## Support Vector Challenge

What we want to see is if we can use the ingredient and keyword list to predict the rating. For someone writing a cookbook this could be really useful information that could help them choose which recipes to include because they're more likely to be enjoyed and therefore make the book more likely to be successful.

Transform this regression problem into a binary classifier and clean up the feature set. You can choose whether or not to include nutritional information, but try to cut your feature set down to the 30 most valuable features.

When you've finished that, also take a moment to think about bias. Is there anything in this dataset that makes you think it could be biased, perhaps extremely so?

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVR
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline

In [2]:
#import data
raw_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/epi_r.csv')

In [3]:
# Count nulls 
null_count = raw_data.isnull().sum()
null_count[null_count>0]

calories    4117
protein     4162
fat         4183
sodium      4119
dtype: int64

In [4]:
#dropping nulls
raw_data = raw_data.dropna()

In [5]:
#Recount nulls 
null_count = raw_data.isnull().sum()
null_count[null_count>0]

Series([], dtype: int64)

In [6]:
#importing some more modules
from sklearn import ensemble
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score

In [7]:
#rounded ratings to whole numbers
raw_data.rating = raw_data.rating.apply(np.round)

In [8]:
#Use Random Forest to find top important features
rfc = ensemble.RandomForestClassifier()

In [9]:
#First set up of X and Y
X= raw_data.drop(['rating', 'title'],1)
Y= raw_data['rating']


In [10]:
#fitting
rfc.fit(X,Y)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [11]:
#finding important features
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)

In [12]:
#top 30 features
feature_importances.importance.head(30)

calories             0.054300
sodium               0.053138
protein              0.044867
fat                  0.044666
bon appétit          0.011246
gourmet              0.011124
quick & easy         0.010315
drink                0.009454
summer               0.008386
wheat/gluten-free    0.007223
winter               0.006849
bake                 0.006790
vegetarian           0.006768
kid-friendly         0.006739
tree nut free        0.006642
fall                 0.006595
dairy                0.006296
milk/cream           0.006219
sauce                0.005881
egg                  0.005848
dairy free           0.005756
alcoholic            0.005654
onion                0.005616
soy free             0.005563
kidney friendly      0.005528
vegetable            0.005526
kosher               0.005517
herb                 0.005457
fruit                0.005396
gin                  0.005395
Name: importance, dtype: float64

In [13]:
#created models with top 30 vs ratings (including some nutritional)

X= raw_data[['sodium',
'calories',
'protein',
'fat',
'bon appétit',
'gourmet',
'quick & easy',
'drink',
'summer',
'vegetarian',
'wheat/gluten-free',
'fall',
'bake',
'winter',
'dairy',
'onion',
'kid-friendly',
'tree nut free',
'sauce',
'side',
'house & garden',
'milk/cream',
'dessert',
'vegetable',
'egg',
'herb',
'peanut free',
'kosher',
'tomato',
'chill']]

Y= raw_data['rating']


In [14]:
#Support vector
from sklearn.svm import SVR
svr = SVR()
svr.fit(X,Y)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
#scoring
svr.score(X, Y)

0.5338160067222137

In [16]:
#cross validation
from sklearn.model_selection import cross_val_score
cross_val_score(svr, X, Y, cv=5)

array([0.07147608, 0.08087719, 0.0790401 , 0.07159082, 0.0655964 ])

Besides, the lack of random samples, there could be a few other biases in this dataset. The one that stands out to me is that people tend to have preconceived notions about certain types of foods and aren't really giving a true and honest opinion. Vegetarian, healthy, and 'xyz' free items, usually come with skepticism and some bias.

This is for a number of reasons with include cultures, regions, and even something such as times of year(a person is more prone to be in favor of heathier options in the "resolution" time from). 