<a href="https://colab.research.google.com/github/AVJdataminer/content_dev/blob/master/answer_key_svm_guided_example_and_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import scipy
from sklearn.svm import SVR


First let's load the dataset. More information about the data can be found on [Kaggle.](https://www.kaggle.com/hugodarwood/epirecipes). 

In [3]:
raw_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/epi_r.csv')
raw_data.shape

(20052, 680)

Run the original SVM model and score the result. **Note: This takes a while to run.**

In [3]:
# Count nulls 
null_count = raw_data.isnull().sum()
null_count[null_count>0]

calories    4117
protein     4162
fat         4183
sodium      4119
dtype: int64

In [4]:
raw_data = raw_data.dropna()
raw_data.shape

(15864, 680)

In [10]:
svr = SVR()
X = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1).sample(frac=0.3, replace=True, random_state=1)
Y = raw_data.rating.sample(frac=0.3, replace=True, random_state=1)
svr.fit(X,Y)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [11]:
svr.score(X, Y)

0.4099845913800794

The model performance was pretty abysmal. Let's run a cross validated instance to see if the results improve or are atleast stable. **This also takes some time to run**

In [12]:
from sklearn.model_selection import cross_val_score
cross_val_score(svr, X, Y, cv=5)

array([0.19563385, 0.14683481, 0.15013401, 0.13597791, 0.16082192])

__Note that this actually takes quite a while to run, compared to some of the models we've done before.__ That's because of the high number of features we have in this dataset.

Oh dear, so this did seem not to work very well. In fact it is remarkably poor. Now there are many things that we could do here. 

__Here is your challenge.__  

Let's take some steps to improve our model. There are a few ways we can try to address this poor performance.:

1. We could go back and clean up our feature set. There might be some gains to be made by getting rid of the noise.

2. We could also see how removing the nulls but including dietary information performs. Though its a slight change to the question we could still possibly get some improvements there.

3. Lastly, we could take our regression problem and turn it into a classifier. With this number of features and a discontinuous outcome, we might have better luck thinking of this as a classification problem. We could make it simpler still by instead of classifying on each possible value, group reviews to some decided high and low values to convert this to a binary classification problem.

Ideally we would like to cut our feature set down to the 30 most valuable features.

Good luck!

1. Clean up the features a bit 
  - remove 
2. removing the nulls but including dietary info only
3. adding dietary info to the other features
4. Convert regression problem to classification - with high and low classes



In [5]:
#print the size of the data
raw_data.shape

(20052, 680)

## 1. Feature Cleanup 

Get the word count for each feature.

In [5]:
a = []
d = []
feature_str = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1).columns
for wrd in feature_str:
    res = len(wrd.split())
    a.append([wrd,res])
    d = pd.DataFrame(a, columns=('feature', 'word count'))

In [6]:
d['word count'].value_counts()

1    495
2    146
3     29
4      4
Name: word count, dtype: int64

In [7]:
d[(d['word count']==4)]

Unnamed: 0,feature,word count
4,30 days of groceries,4
146,cook like a diner,4
202,epi loves the microwave,4
404,"no meat, no problem",4


In [8]:
new_cols = d[(d['word count']==1)].feature

In [9]:
X = raw_data.drop(['rating', 'title', 'calories', 'protein', 'fat', 'sodium'], 1).sample(frac=0.3, replace=True, random_state=1)
Y = raw_data.rating.sample(frac=0.3, replace=True, random_state=1)

In [10]:
x_new = X[new_cols]

In [11]:
from sklearn.model_selection import cross_val_score
svr = SVR()
cross_val_score(svr, x_new, Y, cv=5)

array([0.21245565, 0.15740591, 0.14852936, 0.12297991, 0.07369811])

Well that didn't improve the performance much, let's move on to another strategy.

## Now try using the feature counts to reduce the number of features.

In [12]:
dsub = raw_data[feature_str]

In [13]:
feature_counts = pd.DataFrame(dsub.sum(axis =0)).reset_index()
feature_counts.columns = ['variable','cnts']
feature_counts = feature_counts.sort_values(by='cnts',ascending=False)
feature_counts['pct'] = feature_counts['cnts']/feature_counts['cnts'].max()
feature_counts.head()

Unnamed: 0,variable,cnts,pct
57,bon appétit,7383.0,1.0
453,peanut free,6721.0,0.910335
574,soy free,6502.0,0.880672
240,gourmet,5683.0,0.769741
624,tree nut free,5616.0,0.760666


In [14]:
import plotly.express as px
#df = px.data.tips()
fig = px.bar(feature_counts, x="pct", y="variable", orientation='h')
fig.show()

Select features in more than 1% of rows.

In [20]:
new_cols = list(feature_counts[feature_counts['pct']> 0.1].variable.values)

In [21]:
len(new_cols)

58

Add dietary info

In [22]:
diet = ['calories', 'protein', 'fat', 'sodium']
new_cols = new_cols + diet

In [29]:
len(new_cols)

62

Run the model with new features.

In [23]:
X = raw_data.drop(['rating', 'title'], 1).sample(frac=0.3, replace=True, random_state=1)
X_new = X[new_cols]
Y = raw_data.rating.sample(frac=0.3, replace=True, random_state=1)
cross_val_score(svr, X_new, Y, cv=5)

array([-0.13491864, -0.16798304, -0.15465153, -0.16913067, -0.17592541])

Ooh, still not great, let's use more data since we have fewer features.

In [24]:
X = raw_data.drop(['rating', 'title'], 1).sample(frac=0.9, replace=True, random_state=1)
X_new = X[new_cols]
Y = raw_data.rating.sample(frac=0.9, replace=True, random_state=1)
cross_val_score(svr, X_new, Y, cv=5)

array([-0.15229987, -0.16817587, -0.14773998, -0.15178299, -0.15631222])

Yikes! Still very poor, lets try a classification approach.

Convert regression to classification with ratings greater than or equal to three in one class and less than three in another class.

In [25]:
raw_data['class'] = np.where( raw_data['rating'] >= 3, 1, 0)

In [26]:
from sklearn.svm import SVC
svc = SVC()
X = raw_data.drop(['rating', 'title','class'], 1).sample(frac=0.3, replace=True, random_state=1)
Y = raw_data['class'].sample(frac=0.3, replace=True, random_state=1)
cross_val_score(svc, X, Y, cv=5)

array([0.875     , 0.875     , 0.875     , 0.875     , 0.87592008])

Wow, that made a huge difference in our model performance. Let's try it with more data and fewer features.

In [28]:
X = raw_data.drop(['rating', 'title','class'], 1).sample(frac=0.9, replace=True, random_state=1)
X_new = X[new_cols]
Y = raw_data['class'].sample(frac=0.9, replace=True, random_state=1)
cross_val_score(svc, X_new, Y, cv=5)

array([0.8820028 , 0.8820028 , 0.8820028 , 0.88231173, 0.88196147])