<html>
<h1 style="text-align:center; color:rgb(153, 99, 0)">EXPLORING PATTERNS AND TRENDS<br> IN EVERYDAY RECIPES</h1>
<h2 style="text-align:center; color:rgb(153, 99, 0)">SENG 474 – DATA MINING</h2>
<br>
<p style="text-align:center; font-size:14pt; font-style:italics">November 30, 2018</p>
<br>

<div style="height: auto; width: 200px; text-align:right">
<h5 style="text-align: left"><b>Prepared For:</b></h5>
Alex Thomo<br>
<h5 style="text-align: left"><b>By:</b></h5>
Abdulla Almahmood()<br>
Max Gunton (V00511318)<br>
Yaxi Yu(V00828218)<br>
</div>
</html>

<span style="font-size:1.4em;text-decoration:underline;color:orange">**Introduction**</span>
<p>As long as there has been life on earth there has been food, and no other animal has quite mastered the art of combining and preparing food quite as much as humans.  </p>
<p>Scientists generally agree that our early ancestors, Homo erectus, first appeared in Africa 1–2 million years ago. They spread throughout the world and evolved into ancient humans, and approximately 250,000 years ago: hearths appeared.  This is by most accounts the accepted archeological estimate for invention of cooking.  </p>
<p>Knowing that cooking and development of recipes has been on going for 250,000 years we figured there must be hidden patterns yet to be discovered in the ingredients and the cooking techniques used.  </p>


<span style="color:orange; font-size:1.4em; text-decoration:underline">**1.0 Data Collection**</span>
<h4 style="color:orange">You are only as good as your data!  And our data was …. Ok.  </h4>
<p>Initially we had hoped to collect and process our own data, but we were all quite busy this semester.  Therefore, we decided that in order to get the most out of applying the algorithms and techniques we learned, we would simply use an already existing dataset.  </p>
<p>The dataset we chose to use was uploaded to Kaggle.com and can be found using the following link:</p>
	&nbsp;<a href="https://www.kaggle.com/hugodarwood/epirecipes#full_format_recipes.json">https://www.kaggle.com/hugodarwood/epirecipes#full_format_recipes.json</a>
<p>The data comes in the form of a 26.7 CSV file and is composed of 20052 rows and 680 columns.  Each row represents a recipe structured as follows:</p>
<p>title | rating | calories | protein	fat	sodium | … | "characteristics & ingredients" </p>

<p>Where the first 6 row provide information about each recipe and the following 674 rows contain a 0 or a 1 depending on whether they contain the corresponding ingredient or satisfy the characteristic.  </p>

<h3 style="color:orange">1.1 Pitfalls of Our Dataset</h3>

<h4 style="color:orange">Overly Simplified</h4>
<p>One of the major downsides of using this provided dataset is that you can’t force anything out of it that isn’t already there.  For example, the presence of an ingredient is a good start when it comes to classifying recipes, but it is only part of the equation.  And we would be foolish to believe that the ratios of ingredients (their normalized weights) don’t also play a major part in finding patterns.  If we were collecting the data ourselves, we would have liked to include the weights of the ingredients instead of simply a 1 or 0.  </p>


<h4 style="color:orange">Flavour Molecule</h4>
<p>Given more time, we would have also liked to have added information about the composition of the ingredients themselves.  More specifically, the composition of flavour molecules that make up each ingredient.  We believe that this is the direction that would result in the most interesting results.  Using these we could compute the Cosine similarity between ingredients which would allow us to offer recommendation for food pairings.  As well as discover which flavor molecules go well with others.  However, gathering the data and structuring it proved to be too time consuming for the scope of this project.  It would also have added a level of complexity that may have been more than we could handle as beginners to data mining.  </p>

<h3 style="color:orange">1.2 Benefits of Our Dataset</h3>
<p>Some of the positive things about the dataset we are using is that it contains many attributes/columns, and this means that there are many potential relationships to be discovered.  In addition, the data came ready to go for frequent item set analysis, which allowed us to focus on implementing and optimizing the algorithm, rather than playing around with the data.  </p>
<p>Our dataset is also large, and this meant that we had to keep in mind efficiency as well as implement things in such a way that they could be generalized.  It also gave us a taste of how things go in industry, when you can’t see all your data at once.  This really drove the importance of using the visualization tools in the matplotlib library.  And although we were able to load our entire dataset into excel, it was on the upper limit of what excel could handle.  This meant that any data preprocessing had to be done using Python, and in a generalized way as it wasn’t feasible to do it by hand.  </p>


<span style="color:orange; font-size:1.4em; text-decoration:underline">**2.0 Data Preprocessing**</span>

The preprocessing process aims to clean and transform the raw data to an useful form. In
order to do that, we follow the next steps:  
2.1. Deal with missing values  
2.2. Data normalization  
2.3. Data visualization  
2.4. Training and Testing data set generation: To generate the files required for the
data mining process (i.e, ARFF files for the WEKA tool)  


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('Structured Data/epi_r.csv', low_memory=False)

<span style="color:orange">**2.1. Deal With Missing Values.**</span>

In this section, we will look at how we can identify and mark values as missing.  
We can use plots and summary statistics to help identify missing or corrupt data.  
We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

In [None]:
print(data.describe())

Now if we chain a .sum() method on the dataframe, we can see which column contain missing values:

In [None]:
data.isnull().sum()

Let's list all the column names which contain missing values

In [None]:
data.columns[data.isnull().any()].tolist()

The simplest strategy for handling missing data is to remove records that contain a missing value.  
We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.  
Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:  

In [None]:
data.dropna(inplace=True)

Check again to see if we still have missing values:

In [None]:
data.columns[data.isnull().any()].tolist()

All set!

<span style="color:orange">**2.2. Data normalization.**</span>

In [None]:
data.head()

as we can see, the 1st column stores the name of each recipe;  
the column 2 to 6 store the generational info of each recipe;  
and the rest store the ingredients respectively (the value of these columns are eigher 0 or 1, showing whether each ingredient exist in this recipe or not).   

We want to normalize the data in the 2nd to the 6th columns.

In [None]:
data_general_info = data[['rating', 'calories', 'protein', 'fat', 'sodium']]
data_general_info.head()

Construct a method to normalize the data

In [None]:
def normalize(df):
    return (df - df.min()) * 1.0 / (df.max() - df.min())

Call this method to normalze these columns

In [None]:
data_general_info_normalized = data_general_info.apply(normalize)
data_general_info_normalized.head()

<span style="color:orange">**2.3. Data Visualization.**</span>

We can use seaborn (a Python data visualization library based on matplotlib) to visualize the data.

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

In [None]:
sns.boxplot(data = data_general_info_normalized)
plt.show()

A violin plot is similar to a box plot with a rotated kernel density plot on each side.  
Use Violin Plot to visualise the distribution of the data and its probability density.

In [None]:
sns.violinplot(data=data_general_info_normalized)

plt.show()

<span style="color:red;">Global attribute names:  
    **data** ... full dataset without missing values  
    **data_general_info** ... 2nd to 6th columns without missing values  
    **data_general_info_normalized**  ... normalized 2nd to 6th columns without missing values
</span>  
 

<span style="color:orange;">**2.4 Training and Testing data set generation**</span>  
We decided to implement the training / testing split using K-Fold cross validation.  

In [None]:
# crossValidator.py

import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedStratifiedKFold
#from confidenceInterval import confidenceInterval

# Linear Classifiers
from sklearn.linear_model import LogisticRegression, LinearRegression

# Decision Tree Classifiers
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import BernoulliNB

CLASS_LABEL = 'rating'
NUM_CLASSES = 6
NUM_SPLITS = 10

#df = pd.read_csv('Structured Data/epi_r.csv')  # read in entire dataset
#df.fillna(-1, inplace=True)    # replace NaN values with -1
df = data

# split data into attributes and class labels
y = df[CLASS_LABEL]
X = df[[col for col in df.columns if col != CLASS_LABEL]]

# convert y to a numpy array
y = np.array(y)
y.reshape((-1,))

# change y from being continuous values between 0-5 to simply being an integer number
# 0, 1, 2, 3, 4, or 5
y = pd.cut(y,NUM_CLASSES, right=False, labels=[i for i in range(NUM_CLASSES)])

# print(y)
# print(X.columns[1])
strat_k_fold = StratifiedKFold(n_splits=NUM_SPLITS)  # parameters n_folds=10, shuffle=False, random_state=None

# Here we want to cast X and y as numpy arrays
X = np.array(X)
y = np.array(y)

# want to find the class that occurs the least often and only take that many instances
# from each class value so that we end up with an even distribution
minimum = np.min([len([val for val in y if val == i]) for i in range(6)])

classes = np.array([],dtype=np.int32).reshape(-1,)
for i in range(NUM_CLASSES):
    classes = np.append(classes,np.array((np.where(y == i)),dtype=np.int32).reshape(-1,)[:minimum])

X = X[classes] # extract our subset from entire dataset
y = y[classes] #              "

X = X[:, 1:]  # remove the title column from the dataset since these are unique to each entry anyway

splits = strat_k_fold.split(X, y)
#scores = np.array([], dtype=np.float32).reshape(-1,)
count = 1
for train_index, test_index in splits:
    # print('Train Index: {}'.format(train_index))
    # print('Test Index: {}'.format(test_index))
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

    # TODO: use a grid search to determine the best learning constant
    #       tweek C and max iterations to get good convergence and speed mix
    log_reg_m = LogisticRegression(solver='lbfgs', C=0.1, max_iter=50000, multi_class='multinomial') #max_iter=50000
    lin_reg_m = LinearRegression()
    dtr_m = DecisionTreeRegressor()
    rfr_m = RandomForestRegressor(n_estimators=100)
    nb_m = BernoulliNB()

    log_reg_m.fit(X_train, y_train)
    lin_reg_m.fit(X_train, y_train)
    dtr_m.fit(X_train, y_train)
    rfr_m.fit(X_train, y_train)
    nb_m.fit(X_train, y_train)


    # print(log_reg_m.score(X_train,y_train))  # validating on test data

    y1 = log_reg_m.predict(X_test) # this one works well not to sure about the others
    y2 = lin_reg_m.predict(X_test)
    y3 = dtr_m.predict(X_test)
    y4 = rfr_m.predict(X_test)
    y5 = nb_m.predict(X_test)

    # try taking an average and choosing the closest
    #prediction = (y1 + y2 + y3 + y4 + y5) / 5
    #prediction = np.around(prediction)
    prediction = y1
    c1 = confidenceInterval()
    c1.establish(prediction, y_test, 80) # want to know with 80% confidence
    print('\nFOLD -  {}\n'.format(count))
    c1.printConfidence()
    count += 1

    # TODO: try voting based on majority
    # prediction = np.column_stack((y1,y2,y3,y4))

#print(scores.sum()/strat_k_fold.n_splits)




<span style="color:orange; font-size:1.4em; text-decoration:underline">**3. Data Mining**</span>

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. This project is aiming to find interesting patterns and relationships between vairous attributes and do predictions on the ratings of each recipe. 

To achieve this, we first tried to use WEKA to classify the attributes using various machine learning classifiers. However, we had a hart time importing our large csv file to Weka.

We first tried to import the csv file directly to WEKA:

<img src="img/weka1.png" style="height:250px"> 

However, an error occured saying "attribute names are not unique!"

<img src="img/weka2.png" style="height:100px"> 

Since we have 680 columns in total, finding the duplicates and rename them would be time consuming. Therefore, we tried to simply remove the titles, and add a range of numbers to place them:

<img src="img/weka3.png" style="height:200px"> 

unfortunately, a different error occured when we tried to import this file again, saying "680 Problem encountered on line: 23":

<img src="img/weka4.png" style="height:100px"> 

While trying to find the resolvent online, we found that there's a number of programmers having problems loading large size csv to WEKA. Either the file itself is not accepted, or the loading and processing process take too much of their time.  
Therefore, we come into aggreement to code the classifers ourselves. 

Below takes in the predictions as an array and the correct classifications and uses the two to return the confidence interval

In [None]:
# confidenceInterval.py

import numpy as np

# TODO: expand evaluator to handle continuous class labels

# class_labels must be discrete and not continuous and numerical

class confidenceInterval:
    CLT_PROB_DIST = {0.1 : 3.09,    # 99.8% confidence
                 0.5 : 2.58,     # 99%     "
                 1.0 : 2.33,    # 98%      "
                 5.0 : 1.65,    # 90%      "
                 10.0 : 1.28,   # 80%      "
                 20.0 : 0.84,   # 60%      "
                 40.0 : 0.25}   # 20%      "
    MAX_SPREAD = 100
    p = None
    q = None
    variance = None
    SN = None
    N = None
    confidence = None
    
    
    def __init__(self, predictions=None, class_labels=None):
        pass

    def quantizeConfidence(self, confidence):
        index = (100 - confidence)/2.0
        values = np.array([0.1, 0.5, 1.0, 5.0, 10.0, 20.0, 40.0])
        diffs = abs(values - index)
        index = values[np.argmin(diffs)]
        return (100 - (2*index)) , index

    
    def establish(self, predictions, class_labels, confidence=90):
        
        assert (len(predictions) == len(class_labels))

        self.confidence, index = self.quantizeConfidence(confidence)
        self.N = len(predictions)

        # ensure that the input arrays are of type numpy.array
        predictions = np.array(predictions)     # predictions.shape = (N,)
        class_labels = np.array(class_labels)   # class_labels.shape = (N,)

        # check to make sure they are the same shape
        assert (predictions.shape == class_labels.shape)

        # compare each array element wise to see if they are the same
        # given multiple class values if they are represented using integers
        # the predictions can be subtracted from the actual class_labels and
        # resulting zeros will indicate a successful predictions; anything else will
        # indicate an error or incorrect prediction

        results = class_labels - predictions
        #print(results)  # prints correctly
        #print(self.N)  # prints 32
        
        self.p = float(len([result for result in results if result == 0]))/float(self.N)
        self.q = 1-self.p
        self.variance = self.p*self.q/self.N
        denom = np.sqrt(self.variance)

        self.SN = [(self.p - self.CLT_PROB_DIST[index]*denom),
                   (self.p + self.CLT_PROB_DIST[index]*denom)]

        return
        
   
    def printConfidence(self):
        if self.SN != None:
            # \U+03F5 unicode for within the set
            print('Successes: {}\nErrors: {}\n___________\nTotal Instances: {}\n\nP: {}\nQ: {}\nVariance:'
                  ' {}\n'.format(self.p*self.N, self.q*self.N, self.N, self.p, self.q, self.variance))
            print('With {}% confidence probability of correct classification is in the '
                  'range {:.1f}% - {:.1f}%\n\n'.format(self.confidence,self.SN[0]*100,
                                                   self.SN[1]*100))
        else:
            print('Confidence not yet established\n\n')
        return

Below is an implementation of a cross validator.  Will add a more in depth description about this

In [None]:
# ROC_curve.py

import cv2

class ROC_curve:

    SCALE = 4000
    y = None
    predictions = None

    def __init__(self, predictions, y):
        self.y = (np.array(y, dtype=np.int32)).reshape((-1,))
        self.predictions = (np.array(predictions, dtype=np.int32)).reshape((-1,))
        assert (y.shape == predictions.shape)

    def compare_labels(self, axis):
        # 0 - TP
        # 1 - TN
        # 2 - FP
        # 3 - FN
        diff = axis[0] - axis[1]
        if diff == 0:
            if axis[0] == 1:
                # True Positive
                return 0
            else:
                # True Negative
                return 1
        else:
            if axis[0] == 0:
                # False Positive
                return 2
            else:
                # False Negative
                return 3

    def calcPoint(self):

        # y = y.reshape((-1,))
        # predictions = predictions.reshape((-1,))
        values = np.column_stack((self.y, self.predictions))


        # print(y)
        # print(predictions)

        a = np.apply_along_axis(self.compare_labels,1,values)
        labels, counts = np.unique(a, return_counts=True)
        labels2, counts2 = np.unique(self.y, return_counts=True)
        # print(a)
        # print(counts)
        # print(counts2)
        index = dict(zip(labels, counts))
        index2 = dict(zip(labels2, counts2))
        print(index2)
        confusion_matrix = {'TruePos' : index.get(0,0),
                            'TrueNeg' : index.get(1,0),
                            'FalsePos' : index.get(2,0),
                            'FalseNeg' : index.get(3,0),
                            'Positives' : index2.get(1,0),
                            'Negatives' : index2.get(0,0)
                            }

        true_positive_rate = confusion_matrix['TruePos'] / confusion_matrix['Positives']
        false_positive_rate = confusion_matrix['FalsePos'] / confusion_matrix['Negatives']

        graph_point = np.array([np.round(false_positive_rate,2)*self.SCALE,np.round(true_positive_rate,2)*self.SCALE], dtype=np.int32)
        return graph_point

    # assumes a 50 / 50 distribution between two class values as a baseline
    def displayROC(self, points):

        canvas = np.ones((self.SCALE+1,self.SCALE+1,3), dtype=np.int32) * 255
        #canvas[self.SCALE-graph_point[1], graph_point[0]] = 1
        #canvas[100-10,10] = 1
        plt.xticks(([_tick for _tick in range(0,self.SCALE,np.int32(np.round(self.SCALE/10)))]+[self.SCALE-1]),
                   ['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9', '1.0'])
        plt.yticks(([_tick for _tick in range(0,self.SCALE,np.int32(np.round(self.SCALE/10)))]+[self.SCALE-1]),
                   ['1.0', '0.9', '0.8', '0.7', '0.6', '0.5', '0.4', '0.3', '0.2', '0.1', '0'])

        cv2.line(canvas,(0,self.SCALE),(self.SCALE,0), (255,0,0), thickness=20)

        for i in points:
            cv2.circle(canvas,(self.SCALE - i[1], i[0]), 30, (0,0,0), -1, lineType=cv2.LINE_AA)

        plt.imshow(canvas)
        plt.show()

        # print(values)

In [None]:
# FrequentItemSets.py

import numpy as np
import pandas as pd

class FrequentItemSet:

    SUPPORT = 100  # should maybe call this MIN_SUPPORT
    DB = None
    CONFIDENCE = 0.1
    INTEREST = 1


    def __init__(self, db, support, confidence):
        self.DB = db
        self.SUPPORT = support
        self.CONFIDENCE = confidence

    def checkAxis(self,array):
        for i in array:
            if i == 0:
                return 0
        return 1

    def createPotentialSets(self, k, fis):
        new_fis = []
        #k = 3  # for testing
        if k == 1:
            potential_fis = [[i] for i in sorted(list(self.DB))]
        else:
            potential_fis = []
            previous_sets = fis[-1]  # assuming we have run this previously already for frequent itemsets of size k-1
            #print(previous_sets)
            #print(len(previous_sets))
            #previous_sets = [['A','B'],['A','C'],['C','D']]
            #print(previous_sets)
            for i in range(len(previous_sets)):
                for j in previous_sets[(i+1):]:
                    #print('{} - {}'.format(previous_sets[i], j))
                    if previous_sets[i][:-1] == j[:-1]:
                        potential_fis.append((previous_sets[i]+[j[-1]]))
                    else:
                        pass

        # print(potential_fis)
        # need to validate the potential FIS by counting their support in the database

        #potential_fis = [['fall']]
        #print(potential_fis)
        print('\nK = {}'.format(k))
        print('Pruning Step:   Start - {} potential sets'.format(len(potential_fis)))

        # this is the pruning step to make sure that all itemsets are indeed frequent
        for i in potential_fis:
            # print(i)  # prints the potential frequent item set
            item_set = self.DB[i]
            length = len(i)
            #print(length)
            temp = np.sum(item_set, axis=1)
            #new = np.array(temp).reshape(-1,)
            support = np.array(np.where(temp == length)).size

            # Below works but is really slow
            # support = np.sum(np.apply_along_axis(self.checkAxis, 1, item_set))
            if support >= self.SUPPORT:
                new_fis.append(i)
            #print(support)
            #break
        print('Pruning Step:   Complete - {} sets remaining'.format(len(new_fis)))
        #print(new_fis)
        return new_fis

        #

    def generateFIS(self):
        k = 2
        fis = []
        current_fis = self.createPotentialSets(1, fis)
        fis.append(current_fis)
        #print('k: 1  length: {}'.format(len(current_fis)))
        while len(current_fis) > 0:
            current_fis = []
            current_fis = self.createPotentialSets(k, fis)
            k += 1
            if len(current_fis) > 0:
                fis.append(current_fis)
            #break # remove after development
            #print('k: {}  length: {}'.format(k, len(current_fis)))

        return fis

    def generateClosedSets(self, all_sets):
        closed = {}
        all_sets.reverse()
        for i in range(len(all_sets)):
            for j in all_sets[i]:
                add = True
                if i == 0:
                    item_set = self.DB[j]
                    length = len(j)
                    # print(length)
                    temp = np.sum(item_set, axis=1)
                    # new = np.array(temp).reshape(-1,)
                    support = np.array(np.where(temp == length)).size
                    closed[frozenset(j)] = support
                else:
                    item_set = self.DB[j]
                    length = len(j)
                    # print(length)
                    temp = np.sum(item_set, axis=1)
                    # new = np.array(temp).reshape(-1,)
                    support = np.array(np.where(temp == length)).size
                    for k in all_sets[i-1]:
                        if set(j).issubset(set(k)):
                            # compute support for k
                            item_set2 = self.DB[k]
                            length2 = len(k)
                            # print(length)
                            temp2 = np.sum(item_set2, axis=1)
                            # new = np.array(temp).reshape(-1,)
                            support2 = np.array(np.where(temp2 == length2)).size

                            if support2 >= support:
                                add = False
                                break
                            else:
                                pass
                    if add:
                        closed[frozenset(j)] = support
        return closed

    def deriveInterest(self, fis):
        rules = []
        rules2 = []
        rules3 = []
        all_length = len(self.DB[:])
        for i in fis:
            entry = list(i)
            temp = entry.copy()
            for j in entry:
                prob_y = np.sum(self.DB[j]) / all_length
                prob_entry = np.array(np.where(np.sum(self.DB[temp], axis=1) == len(temp))).size / all_length
                temp.remove(j)
                prob_x = np.array(np.where(np.sum(self.DB[temp], axis=1) == len(temp))).size / all_length
                confidence = prob_entry / prob_x
                # print(confidence)
                if (prob_y < confidence) and (confidence >= self.CONFIDENCE):  # filters all of the trivial rules
                    interest = prob_entry/(prob_x * prob_y)
                    if interest > 1:
                        rules.append('{} --> {} : interest={:05.3f}     confidence={:05.4f}'.format(temp, [j],
                                                                                             interest, confidence))
                        rules2.append([temp,[j]])
                        
        # The following works, but is inefficient and takes too long to finish when minimum support is low
        # this loop merges appropriate rules to create longer ones
        # for i in range(len(rules2)):
        #     for j in rules2[i:]:
        #         if len(set(j[1]) & set(rules2[i][1])) != len(j[1]):
        #             entry = list(set(rules2[i][0] + j[0] + rules2[i][1] + j[1]))
        #             temp = entry.copy()
        #             prob_entry = np.array(np.where(np.sum(self.DB[temp], axis=1) == len(temp))).size / all_length
        #             temp.remove(j[1][0])
        #             temp.remove(rules2[i][1][0])
        #             if len(temp) != 0:
        #                 y = list(set(rules2[i][1] + j[1]))
        #                 prob_x = np.array(np.where(np.sum(self.DB[temp], axis=1) == len(temp))).size / all_length
        #                 prob_y = np.array(np.where(np.sum(self.DB[y], axis=1) == len(y))).size / all_length
        #                 confidence = prob_entry / prob_x
        #                 #print(confidence)
        #                 #print(prob_y)
        #                 if (prob_y <= confidence) and (
        #                         confidence >= self.CONFIDENCE):  # filters all of the trivial rules
        #                     try:
        #                         interest = prob_entry / (prob_x * prob_y)
        #                     except ZeroDivisionError:
        #                         interest = 0
        #                     #print(interest)
        #                     if interest > 1:  # change back to 1
        #                         rules.append('{} --> {} : interest={:05.3f}     confidence={:05.4f}'.format(temp, y,
        #                                                                                                     interest,
        #                                                                                                     confidence))
        #                         rules3.append([temp, j])
        #     if i == (len(rules2) - 1):
        #         rules2 = rules3.copy()




        return rules



    def getSupport(self):
        return self.SUPPORT

    def getFIS(self):
        return self.FIS

def main():

    min_support = 20  # ran with value 10
    min_confidence = 0.5
    min_interest = 1.0
    df = pd.read_csv('Structured Data/epi_r.csv')  # read in data

    # PREPROCESSING STAGE ---------------------------------------------------------------

    df = df.drop(['title', 'rating', 'calories', 'protein', 'fat', 'sodium'], axis=1)

    # After running once decided to drop these as they seemed to lead to uninteresting results.
    # They alos had quite a high occurences which leads to explosions in complexity when looking for frequent item sets
    # with smaller support values.  Also many are generalization of much more specific information also included
    # therefore many times they lead to redundant information
    df = df.drop(
        ['bon appétit', 'kosher', 'peanut free', 'pescatarian', 'side', 'soy free', 'tree nut free', 'vegetarian',
         'wheat/gluten-free', 'no sugar added', 'dairy free', 'vegan', 'quick & easy', 'gourmet', 'bake',
         'kidney friendly', 'fruit', 'vegetable', 'dessert', 'alcoholic','backyard bbq', 'fourth of july', 'grill',
         'grill/barbecue', 'summer', 'healthy', 'dinner', 'appetizer', 'chill', 'christmas', 'cocktail party', 'drink',
         'fall', 'herb', 'high fiber', 'kid-friendly', 'lemon', 'low cal', 'lunch', 'no-cook', 'roast', 'salad',
         'sauce', 'sauté', 'soup/stew', 'spring', 'sugar conscious', 'thanksgiving', 'winter', 'brunch', 'breakfast',
         'berry', 'nut', 'bon app��tit'], axis=1)
    # df = df.drop(
    #     ['bon appétit', 'kosher', 'peanut free', 'pescatarian', 'side', 'soy free', 'tree nut free', 'vegetarian',
    #      'wheat/gluten-free', 'no sugar added', 'dairy free', 'vegan', 'quick & easy', 'gourmet', 'bake',
    #      'kidney friendly', 'fruit', 'vegetable', 'dessert', 'alcoholic','backyard bbq', 'fourth of july', 'grill',
    #      'grill/barbecue', 'summer', 'healthy', 'dinner', 'appetizer', 'chill', 'christmas', 'cocktail party', 'drink',
    #      'fall', 'herb', 'high fiber', 'kid-friendly', 'lemon', 'low cal', 'lunch', 'no-cook', 'roast', 'salad',
    #      'sauce', 'sauté', 'soup/stew', 'spring', 'sugar conscious', 'thanksgiving', 'winter', 'brunch', 'breakfast',
    #      'berry', 'nut'], axis=1)
    df = df.to_sparse(fill_value=0)
    fis = FrequentItemSet(df,min_support,min_confidence)
    #fis.generateFIS()
    print('FREQUENT ITEM SET ANALYSIS')
    print('Items: {}'.format(np.array(df[:]).shape[1]))
    print('Rows: {}'.format(np.array(df[:]).shape[0]))
    print('Min Support: {}'.format(min_support))
    print('Min Confidence: {}'.format(min_confidence))
    print('Min Interestingness: {}'.format(min_interest))
    all_item_sets = fis.generateFIS()
    count = np.sum(np.array([len(i) for i in all_item_sets]))
    print('\nGenerated {} Frequent Item Sets'.format(count))
    closed_item_sets = fis.generateClosedSets(all_item_sets)
    print('\nCLOSED ITEM SETS:')
    print('Number of sets: {}\n'.format(len(closed_item_sets)))
    for i in closed_item_sets:
        print('{}\t-\t{}'.format(list(i),closed_item_sets[i]))

    #want to derive interestingness from the data


    interesting_rules = fis.deriveInterest(closed_item_sets)
    print('\nRULES:')
    print('Number of rules: {}\n'.format(len(interesting_rules)))
    for i in interesting_rules:
        print(i)



main()