[Title(TBD)]

Abdulla Almahmood()  
Max Gunton()  
Yaxi Yu(V00828218)


<span style="color:orange">**1. Data Collection. (TODO)**</span>

<span style="color:orange">**2. Data Preprocessing.**</span>

The preprocessing process aims to clean and transform the raw data to an useful form. In
order to do that, we follow the next steps:  
2.1. Deal with missing values  
2.2. Data normalization  
2.3. Data visualization  
2.4. Training and Testing data set generation: To generate the files required for the
data mining process (i.e, ARFF files for the WEKA tool)  


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('Structured Data/epi_r.csv', low_memory=False)

<span style="color:orange">**2.1. Deal With Missing Values.**</span>

In this section, we will look at how we can identify and mark values as missing.  
We can use plots and summary statistics to help identify missing or corrupt data.  
We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

In [None]:
print(data.describe())

Now if we chain a .sum() method on the dataframe, we can see which column contain missing values:

In [None]:
data.isnull().sum()

Let's list all the column names which contain missing values

In [None]:
data.columns[data.isnull().any()].tolist()

The simplest strategy for handling missing data is to remove records that contain a missing value.  
We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.  
Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:  

In [None]:
data.dropna(inplace=True)

Check again to see if we still have missing values:

In [None]:
data.columns[data.isnull().any()].tolist()

All set!

<span style="color:orange">**2.2. Data normalization.**</span>

In [None]:
data.head()

as we can see, the 1st column stores the name of each recipe;  
the column 2 to 6 store the generational info of each recipe;  
and the rest store the ingredients respectively (the value of these columns are eigher 0 or 1, showing whether each ingredient exist in this recipe or not).   

We want to normalize the data in the 2nd to the 6th columns.

In [None]:
data_general_info = data[['rating', 'calories', 'protein', 'fat', 'sodium']]
data_general_info.head()

Construct a method to normalize the data

In [None]:
def normalize(df):
    return (df - df.min()) * 1.0 / (df.max() - df.min())

Call this method to normalze these columns

In [None]:
data_general_info_normalized = data_general_info.apply(normalize)
data_general_info_normalized.head()

<span style="color:orange">**2.3. Data Visualization.**</span>

We can use seaborn (a Python data visualization library based on matplotlib) to visualize the data.

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

In [None]:
sns.boxplot(data = data_general_info_normalized)
plt.show()

A violin plot is similar to a box plot with a rotated kernel density plot on each side.  
Use Violin Plot to visualise the distribution of the data and its probability density.

In [None]:
sns.violinplot(data=data_general_info_normalized)

plt.show()

<span style="color:orange">**2.4. Training and Testing data set generation. (TODO)**</span>

<span style="color:orange">**3. Data Mining.**</span>

Below takes in the predictions as an array and the correct classifications and uses the two to return the confidence interval

In [None]:
# confidenceInterval.py

import numpy as np
from math import sqrt

# TODO: expand evaluator to handle continuous class labels

# class_labels must be discrete and not continuous and numerical

class confidenceInterval:
    CLT_PROB_DIST = {0.1 : 3.09,    # 99.8% confidence
                 0.5 : 2.58,     # 99%     "
                 1.0 : 2.33,    # 98%      "
                 5.0 : 1.65,    # 90%      "
                 10.0 : 1.28,   # 80%      "
                 20.0 : 0.84,   # 60%      "
                 40.0 : 0.25}   # 20%      "
    MAX_SPREAD = 100
    p = None
    q = None
    variance = None
    SN = None
    N = None
    confidence = None
    
    
    def __init__(self, predictions=None, class_labels=None):
        pass

    def quantizeConfidence(self, confidence):
        index = (100 - confidence)/2.0
        values = np.array([0.1, 0.5, 1.0, 5.0, 10.0, 20.0, 40.0])
        diffs = abs(values - index)
        index = values[np.argmin(diffs)]
        return (100 - (2*index)) , index

    
    def establish(self, predictions, class_labels, confidence=90):
        
        assert (len(predictions) == len(class_labels))

        self.confidence, index = self.quantizeConfidence(confidence)
        self.N = len(predictions)

        # ensure that the input arrays are of type numpy.array
        predictions = np.array(predictions)     # predictions.shape = (N,)
        class_labels = np.array(class_labels)   # class_labels.shape = (N,)

        # check to make sure they are the same shape
        assert (predictions.shape == class_labels.shape)

        # compare each array element wise to see if they are the same
        # given multiple class values if they are represented using integers
        # the predictions can be subtracted from the actual class_labels and
        # resulting zeros will indicate a successful predictions; anything else will
        # indicate an error or incorrect prediction

        results = class_labels - predictions
        #print(results)  # prints correctly
        #print(self.N)  # prints 32
        
        self.p = float(len([result for result in results if result == 0]))/float(self.N)
        self.q = 1-self.p
        self.variance = self.p*self.q/self.N
        denom = sqrt(self.variance)

        self.SN = [(self.p - self.CLT_PROB_DIST[index]*denom),
                   (self.p + self.CLT_PROB_DIST[index]*denom)]

        return
        
   
    def printConfidence(self):
        if self.SN != None:
            # \U+03F5 unicode for within the set
            print('Successes: {}\nErrors: {}\n___________\nTotal Instances: {}\n\nP: {}\nQ: {}\nVariance:'
                  ' {}\n'.format(self.p*self.N, self.q*self.N, self.N, self.p, self.q, self.variance))
            print('With {}% confidence probability of correct classification is in the '
                  'range {:.1f}% - {:.1f}%\n\n'.format(self.confidence,self.SN[0]*100,
                                                   self.SN[1]*100))
        else:
            print('Confidence not yet established\n\n')
        return

Below is an implementation of a cross validator.  Will add a more in depth description about this

In [None]:
# crossValidator.py

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedStratifiedKFold
#from confidenceInterval import confidenceInterval

# Linear Classifiers
from sklearn.linear_model import LogisticRegression, LinearRegression

# Decision Tree Classifiers
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

CLASS_LABEL = 'rating'
NUM_CLASSES = 6
NUM_SPLITS = 10

df = pd.read_csv('Structured Data/epi_r.csv')  # read in entire dataset
df.fillna(-1, inplace=True)    # replace NaN values with -1

# split data into attributes and class labels
y = df[CLASS_LABEL]
X = df[[col for col in df.columns if col != CLASS_LABEL]]

# convert y to a numpy array
y = np.array(y)
y.reshape((-1,))

# change y from being continuous values between 0-5 to simply being an integer number
# 0, 1, 2, 3, 4, or 5
y = pd.cut(y,NUM_CLASSES, right=False, labels=[i for i in range(NUM_CLASSES)])

# print(y)
# print(X.columns[1])
strat_k_fold = StratifiedKFold(n_splits=NUM_SPLITS)  # parameters n_folds=10, shuffle=False, random_state=None

# Here we want to cast X and y as numpy arrays
X = np.array(X)
y = np.array(y)

# want to find the class that occurs the least often and only take that many instances
# from each class value so that we end up with an even distribution
minimum = np.min([len([val for val in y if val == i]) for i in range(6)])

classes = np.array([],dtype=np.int32).reshape(-1,)
for i in range(NUM_CLASSES):
    classes = np.append(classes,np.array((np.where(y == i)),dtype=np.int32).reshape(-1,)[:minimum])

X = X[classes] # extract our subset from entire dataset
y = y[classes] #              "

X = X[:, 1:]  # remove the title column from the dataset since these are unique to each entry anyway

splits = strat_k_fold.split(X, y)
#scores = np.array([], dtype=np.float32).reshape(-1,)
for train_index, test_index in splits:
    # print('Train Index: {}'.format(train_index))
    # print('Test Index: {}'.format(test_index))
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

    # TODO: use a grid search to determine the best learning constant
    #       tweek C and max iterations to get good convergence and speed mix
    log_reg_m = LogisticRegression(solver='lbfgs', C=0.1, max_iter=50000, multi_class='multinomial') #max_iter=50000
    lin_reg_m = LinearRegression()
    dtr_m = DecisionTreeRegressor()
    rfr_m = RandomForestRegressor(n_estimators=100)

    log_reg_m.fit(X_train, y_train)
    lin_reg_m.fit(X_train, y_train)
    dtr_m.fit(X_train, y_train)
    rfr_m.fit(X_train, y_train)


    # print(log_reg_m.score(X_train,y_train))  # validating on test data

    y1 = log_reg_m.predict(X_test) # this one works well not to sure about the others
    y2 = lin_reg_m.predict(X_test)
    y3 = dtr_m.predict(X_test)
    y4 = rfr_m.predict(X_test)

    # try taking an average and choosing the closest
    prediction = (y1 + y2 + y3 + y4) / 4
    prediction = np.around(prediction)
    c1 = confidenceInterval()
    c1.establish(prediction, y_test, 80) # want to know with 80% confidence
    c1.printConfidence()

    # TODO: try voting based on majority
    # prediction = np.column_stack((y1,y2,y3,y4))

#print(scores.sum()/strat_k_fold.n_splits)


