Thinkful Bootcamp Course

Author: Ian Heaton

Email: iheaton@gmail.com

Mentor: Nemanja Radojkovic

Date: 2017/04/28


In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import us

%matplotlib inline

sb.set_style('darkgrid')
my_dpi = 96

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Delicious Classification


## Question:
How well can Support Vector Machine perform Classification on recipe ratings? 

What we want to see is if we can use the ingredients, nutritional information and keyword list to predict the rating.


### Data:
Data set is drawn from the larger epicurious dataset, which has a collection of recipes, key terms and ingredients, and their ratings.

The dataset can be [found on Kaggle](https://www.kaggle.com/hugodarwood/epirecipes).


### Context:
For someone writing a cookbook this could be really useful information that could help them choose which recipes to include because they're more likely to be enjoyed and therefore make the book more likely to be successful.

### Content:
Over 20k recipes listed by recipe rating, nutritional information and assigned category (sparse). Dataset contains 680 features.


In [50]:
# Read CSV containing text data
data_file = '/media/ianh/space/ThinkfulData/Epicurious/epi_r.csv'
recipes = pd.read_csv(data_file)
print("\nObservations : %d" % (recipes.shape[0]))


Observations : 20052


## Preprocessing and exploratory data analysis

In [38]:
# Count nulls of features
null_count = recipes.isnull().sum()
null_count[null_count > 0]

calories    4117
protein     4162
fat         4183
sodium      4119
dtype: int64

These fours features have a lot of missing data points. Our next question is, are these missing data points mostly within the same observations (same data frame index) or scatter throughout the data set?
Our bodies are genetically drawn to food with higher portions of fat and protein so therefore retaining these features would perhaps help our model to classify recipe ratings. 

In [45]:
# adding indices of missing feature data into a Set to asses the dispersion of of missing data across 
# the entire data set
big_mask = recipes.isnull()
unique_rows = set(list(big_mask[big_mask.calories == True].index.values))
unique_rows.update(list(big_mask[big_mask.protein == True].index.values))
unique_rows.update(list(big_mask[big_mask.fat == True].index.values))
unique_rows.update(list(big_mask[big_mask.sodium == True].index.values))
print('\nThe missing data for the 4 features are roughly confined to the same %d indices' % (len(unique_rows)))


The missing data for the 4 features are roughly confined to the same 4188 indices


As we add the index values of the masks representing those cells where one of the features has a null value into the Set we are determining how distributed the missing data points of these observations are.  If the missing data is quite dispersed we will need to the drop the features from the data frame. 

With 4188 missing data points we can still use the four features if we are willing to settle for 15,864 observations to drive our Support Vector Machine.

There are features with state names, those with column names starting with ‘#’ and others with dubious labels that may hold very little information, perhaps not worth the increased time in computation and the high bias in our model. Lets investigate

In [67]:
# List will be used to exclude those features that are of type str or one of the four nutritional 
# type that we are trying to keep
excluded = ['title', 'calories', 'protein', 'fat', 'sodium']

# function iterates for all column names and records those that have less than 5 non zero values
# This should catch those column’s names that start with ‘#’
def create_columns_filter(column_names, threshold=5):
    result = []
    for label in column_names:
        if label not in excluded:
            number = np.sum(recipes[label] > 0)
            if number < threshold:
                result.append(label)
    return result

# Find all columns which meet our above criteria
colunms_tobe_dropped = create_columns_filter(recipes.columns)

# Lets add the 'title' column to the list of features to be removed
colunms_tobe_dropped.insert(0, 'title')

# function that removes features from passed dataframe whose names appear
# within the cols parameter
def prune(cols, dataframe):
    for name in cols:
        dataframe.drop(name, axis=1, inplace=True)
    
# Prune baby prune
prune(colunms_tobe_dropped, recipes)
print("Number of features after pruning: %d" % (len(recipes.columns)))

Number of features after pruning: 570


## Conclusions

## References
