In [2]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline
import statsmodels.formula.api as smf 
import statsmodels.api as sm
import os

  from pandas.core import datetools


***Question2***

Predicting pizza orders

Description:
Each time a customer calls in to place an order (possibly for her friends), you note the customer’s phone number and know something about the customer based on prior observations and conversations with the delivery person. 

Observed features:
    1.Weight
    2.Age
    3.Days (since last order)
    4.Vegan
    5.Cats (if any observed in home)
    6.Cash observed in home

Target features:
    1.Size
    2.Toppings

Task: There are two separate datasets which have been provided. I will train one or more classifiers on the training data. From there, I will use the trained classifier to generate labels for test data given the features.

The notebook will continue as follows: 

1. Data Observation and cleaning  
   - Check for any Nan values in the dataset. If they exist, replace the NaN cell with the most frequently observed value in the respective column using scikit-learn's Imputer. 
   - Convert String variables to numerical values using scikit-learn's Label Encoder.
   - Consolidate dataset
   
2. Classification task
   - K-nearest neighbouurs vs Random Forest
   - results 
   - Conclusion
   
   
The selection of K-nearest Neighbours is due to the simplicity of the algorithm and the potential results it may be able to generate given the context of this dataset. Further, because it is non-parametric, it makes no assumptions about the data and most importantly, it is insensitive to outliers. Although there is a caveat that variables may need to be scaled accordingly to provide reasonable results. This is taken into account in my model. 
    
The selection of Random Forest is due to the robustness of the algorithm in classifying data. Because it is a versatile algorithm and due to the strength of its ensemble learning - encompassing bagging, I believe it will contrast well to K-nearest neighbours. Nonetheless, to ensure it doesnt over fit the data, I have adjusted the arguments in my analysis below. 
  


***1. Data Observation***


In [39]:
os.getcwd()
pizzeria = pd.read_csv('/Users/Rong/Documents/USF/Machine Learning 2/MidTerm2/train.csv')
test = pd.read_csv('/Users/Rong/Documents/USF/Machine Learning 2/MidTerm2/test.csv')

In [40]:
pizzeria = pizzeria.iloc[:,1:]
test= test.iloc[:,1:]
print(pizzeria.head())
print(test.head())

       Weight         Age  Days  Vegan  Cats         Cash      Size   Toppings
0  106.238809   36.596211    38      0     1     5.699125  No order   No order
1  184.378192   28.739952    28      0     0     1.171537  No order   No order
2  232.475732  106.605562    38      1     1   259.440103     Large   Hawaiian
3  112.811584  103.684648   112      0     0    13.886261  No order   No order
4  139.317810   15.045878    78      0     0  1934.054928    Medium  Pepperoni
       Weight         Age  Days  Vegan  Cats         Cash
0  215.241281   45.123194    19      0     0  1955.034280
1  251.301889   17.856168    38      0     0  2532.312093
2  189.421541  105.951771     3      0     0   241.320502
3   75.000000   37.001579     7      0     0   292.279276
4  156.416838   92.159389    63      0     2   325.376085


***Checking for Nan Values***

In [41]:
print('Null values exists? --> ' +  str(pizzeria.isnull().values.any()))
print('Null values exists? --> ' +  str(test.isnull().values.any()))

Null values exists? --> False
Null values exists? --> False


***Encode String items into numerical values***

In [42]:
from sklearn.preprocessing import LabelEncoder
size_le = LabelEncoder()
toppings_le = LabelEncoder()

In [43]:
size = pizzeria.loc[:,'Size']
toppings = pizzeria.loc[:, 'Toppings']

size_lbl = size_le.fit_transform(size)
toppings_lbl = toppings_le.fit_transform(toppings)

pizzeria.loc[:,'Size'] = size_lbl
pizzeria.loc[:, 'Toppings'] = toppings_lbl

***Curate X and Y Splits***

X - contains all the features 

Y_size - contains the target variable we want to predict for size ordered

Y_topping - contains the target variable we want to predict for topping ordered

In [47]:
X = pizzeria.iloc[:,:-2]
Y_size = pizzeria.iloc[:, -2]
Y_topping = pizzeria.iloc[:, -1]

***2. Classification Task***

**K-nearest neighbours**

Given the context of the question, my first guess would be to attempt solving the prediction using K-nearest neighbours. If we are able to classify individuals based on the features above, I'd suppose that the euclidean distance between one data point and another would allow us to classify the category of size and toppping another data point would fall. 

Given 500 data points, I opt to sample run the algorithm from 3 neighbours up to 15 neighbours. 

It is also important to note that scaling of the feature values in necessary since there the variation in the feature values is large. 

My results dont go beyong a 0.25 and 0.29 for predicting size and toppings respectively. This is somewhat both in absolute amount and relative amounts. 


Results are shown below.


In [53]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import scale 
X_scaled = scale(X)


Note that we use two different classifier here. One for size and the other for toppings. This will be consistent throughout the process. 

In [55]:
for i in range(3,16):
    clf_size = KNeighborsClassifier(n_neighbors = i)
    clf_toppings = KNeighborsClassifier(n_neighbors=i)
    clf_size.fit(X_scaled, Y_size)
    clf_toppings.fit(X_scaled, Y_topping)
    scores_size = cross_val_score(clf_size, X_scaled, Y_size, cv = 10)
    scores_topping = cross_val_score(clf_toppings, X_scaled, Y_topping, cv = 10)
    print('For k = ' + str(i) + ' neighbours, Mean accuracy for size: %0.4f' % scores_size.mean())
    print('For k = ' + str(i) + ' neighbours, Mean accuracy for type: %0.4f' % scores_topping.mean())
    print('\n')

***Classifying the test set using K-nearest neighbours***

In [56]:
test_scaled = scale(test)
hypotheses_size = clf_size.predict(test_scaled)
hypotheses_toppings = clf_toppings.predict(test_scaled)

In [58]:
size_results = []
toppings_results = []
for i in hypotheses_size:
    size_results.append(size_le.inverse_transform(i))
    toppings_results.append(toppings_le.inverse_transform(i))
size_results = pd.DataFrame(size_results)
toppings_results = pd.DataFrame(toppings_results)   

In [60]:
print(size_results.head())
print(toppings_results.head())

***Random Forest***

In pursuit of delivering more robust results, I will use an ensemble method to contrast it agains simplicity of the K-nearest neighbours algorithm.

The Random Forest algorithm delivers a significant improvement to the K-nearest neighbours attempt. Almost doubling the accuracy score to a high of 0.51 and 0.57 for size and toppings respectively. Nonetheless, although this is a significant improvement in relative terms, the absolute value is still embarassingly low. 




In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate



In [62]:
clf_rf_s = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', max_features = 'auto', max_depth = 6, bootstrap = True, random_state = 1)
cv_rf_s = cross_validate(clf_rf_s, X=X_scaled, y=Y_size, cv=10, scoring=['accuracy'])
#Place into dataframe
cv_rf_df = pd.DataFrame(cv_rf_s)
cv_rf_df



Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy
0,0.017253,0.001114,0.346154,0.734375
1,0.016627,0.001928,0.490196,0.679287
2,0.02,0.001669,0.431373,0.732739
3,0.016791,0.001074,0.45098,0.665924
4,0.017216,0.001216,0.431373,0.710468
5,0.015876,0.000994,0.333333,0.723831
6,0.013512,0.000796,0.510204,0.674058
7,0.015805,0.000911,0.354167,0.716814
8,0.013279,0.000826,0.333333,0.727876
9,0.012793,0.00083,0.4375,0.719027


In [48]:
clf_rf_t = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', max_features = 'auto', max_depth = 6, bootstrap = True, random_state = 1)
cv_rf_t = cross_validate(clf_rf_t, X=X , y=Y_topping, cv=10, scoring=['accuracy'])
#Place into dataframe
cv_rf_df = pd.DataFrame(cv_rf_t)
cv_rf_df



Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy
0,0.01908,0.00128,0.377358,0.686801
1,0.021321,0.002363,0.415094,0.668904
2,0.023125,0.001351,0.461538,0.65625
3,0.017935,0.001385,0.480769,0.662946
4,0.016704,0.001173,0.42,0.671111
5,0.016011,0.001072,0.571429,0.625277
6,0.014141,0.000892,0.408163,0.669623
7,0.013386,0.000946,0.479167,0.634956
8,0.013386,0.000882,0.404255,0.629139
9,0.013261,0.000874,0.489362,0.655629


Prediction results for the test set is shown below, which will also be included as a txt file in the deliverable. 

In [65]:
clf_rf_s.fit(X, Y_size)
hypotheses = clf_rf_s.predict(test)
hypotheses = pd.DataFrame(hypotheses)
results_s = pd.DataFrame(size_le.inverse_transform(hypotheses))
# print(results_s.head())

In [64]:
clf_rf_t.fit(X, Y_topping)
hypotheses = clf_rf_t.predict(test)
hypotheses = pd.DataFrame(hypotheses)
results_t = pd.DataFrame(toppings_le.inverse_transform(hypotheses))
# print(results_t.head())



In [70]:
frames = [results_s, results_t]

final_result = pd.concat(frames, axis= 1)
final_result.to_csv('MidTerm2PredictII', sep=',', encoding='utf-8')

***Conclusion***

Overall, the results shown by both selected classifiers for this specific classification task delivers poor results. Further, it is apparent from the txt file, displaying the predicted results for both used algorithms, that the hypothesis predictions for the test data set is far from similar. 

While recognizing the viability of the data provided, I point to a second path forward: one in which better data are the key to deeper insights. I comment on the lack of information regarding quantity of data that is provided. This will certainly aid the algorithms in making better predictions as limited data may be insufficient to generate/observe patterns in the data. Moreover, the features provided may be of weakness. I believe deeper feature engineering would allow us to manufacture better features to analyse upon. Ultimately aiding our analysis and hence our modelling process.