# Kaggle Yummly

[What's Cooking](https://www.kaggle.com/c/whats-cooking)

Use graphlab to predict the cuisine type

In [1]:
## import the libraries
import json
import graphlab as gl
import pandas as pd

## options
gl.canvas.set_target('ipynb') # use IPython Notebook output for GraphLab Canvas



In [14]:
## load the training json
with open("../data/train.json") as data_file:
    train_json = json.load(data_file)
    
## load the test json
with open("../data/test.json") as data_file:
    test_json = json.load(data_file)

In [37]:
## training data elements
train_id = []
train_cuisine = []
train_ingredients = []
for dish in train_json:
    train_id.append(dish['id'])
    train_cuisine.append(dish['cuisine'])
    # build a dictionary
    ing_list = dish['ingredients']
    counts = [1]*len(ing_list)
    train_ingredients.append(dict(zip(ing_list, counts)))
    

## test data elements
test_id = []
test_ingredients = []
for dish in test_json:
    test_id.append(dish['id'])
    # build a dictionary
    ing_list = dish['ingredients']
    counts = [1]*len(ing_list)
    test_ingredients.append(dict(zip(ing_list, counts)))
    

In [64]:
## use the elements to to build sframes that include the ingredients as a dict
train_sf = gl.SFrame({'id':train_id, 'cuisine':train_cuisine, 'ingredients':train_ingredients})
test_sf = gl.SFrame({'id':test_id, 'ingredients':test_ingredients})

## Quick look at the data

In [45]:
train_sf.column_names

<bound method SFrame.column_types of Columns:
	cuisine	str
	id	int
	ingredients	dict

Rows: 39774

Data:
+-------------+-------+-------------------------------+
|   cuisine   |   id  |          ingredients          |
+-------------+-------+-------------------------------+
|    greek    | 10259 | {'pepper': 1, 'seasoning':... |
| southern_us | 25693 | {'vegetable oil': 1, 'plai... |
|   filipino  | 20130 | {'garlic powder': 1, 'butt... |
|    indian   | 22213 | {'water': 1, 'vegetable oi... |
|    indian   | 13162 | {'butter': 1, 'garlic past... |
|   jamaican  |  6602 | {'butter': 1, 'plain flour... |
|   spanish   | 42779 | {'olive oil': 1, 'chorizo ... |
|   italian   |  3735 | {'olive oil': 1, 'almond e... |
|   mexican   | 16903 | {'olive oil': 1, 'pork': 1... |
|   italian   | 12734 | {'chopped tomatoes': 1, 'f... |
+-------------+-------+-------------------------------+
[39774 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_c

In [46]:
test_sf.column_names

<bound method SFrame.column_names of Columns:
	id	int
	ingredients	dict

Rows: 9944

Data:
+-------+-------------------------------+
|   id  |          ingredients          |
+-------+-------------------------------+
| 18009 | {'eggs': 1, 'milk': 1, 'ba... |
| 28583 | {'cream of tartar': 1, 'co... |
| 41580 | {'olive oil': 1, 'fennel b... |
| 29752 | {'vegetable oil': 1, 'garl... |
| 35687 | {'sausage casings': 1, 'pa... |
| 38527 | {'corn starch': 1, 'peach ... |
| 19666 | {'orange': 1, 'grape juice... |
| 41217 | {'vegetable oil': 1, 'whit... |
| 28753 | {'vegetable oil': 1, 'taco... |
| 22659 | {'butter': 1, 'self raisin... |
+-------+-------------------------------+
[9944 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.>

## Model Building

My first graphlab-built model.  [Documentation can be found here](https://dato.com/products/create/docs/graphlab.toolkits.classifier.html)

In [47]:
## automatic model selection, no hold-out sample just yet
model_1 = gl.classifier.create(train_sf, target = "cuisine", features = ["ingredients"])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 37736
PROGRESS: Number of classes           : 20
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 6631
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   5.612e-01   5.368e-01        1.02s
PROGRESS:      1   6.335e-01   6.099e-01        2.07s
PROGRESS:      2   6

In [50]:
## summarize the model
model_1.summary()

Class                         : LogisticClassifier

Schema
------
Number of coefficients        : 126008
Number of examples            : 37736
Number of classes             : 20
Number of feature columns     : 1
Number of unpacked features   : 6631

Hyperparameters
---------------
L1 penalty                    : 0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 10
Solver status                 : TERMINATED: Iteration limit reached.
Training time (sec)           : 37.1009

Settings
--------
Log-likelihood                : 10447.5108

Highest Positive Coefficients
-----------------------------
ingredients[blueberri preserv]: 30.2263
ingredients[strawberry compote]: 21.9533
ingredients[brown beech mushrooms]: 21.1545
ingredients[instant tea powder]: 21.0494
ingredients[creme de cacao]   : 20.0717

Lowest Negative Coefficients
----------------------------
ingredients[instant tea powder]: -18.3682
i

In [51]:
## predict on the test set
test_preds = model_1.predict?

In [53]:
test_preds = model_1.predict(test_sf)
test_preds

dtype: str
Rows: 9944
['british', 'southern_us', 'italian', 'cajun_creole', 'italian', 'southern_us', 'spanish', 'chinese', 'mexican', 'irish', 'italian', 'greek', 'indian', 'italian', 'british', 'italian', 'mexican', 'southern_us', 'mexican', 'southern_us', 'korean', 'indian', 'moroccan', 'vietnamese', 'italian', 'southern_us', 'vietnamese', 'korean', 'italian', 'cajun_creole', 'mexican', 'thai', 'southern_us', 'japanese', 'chinese', 'mexican', 'russian', 'indian', 'indian', 'cajun_creole', 'cajun_creole', 'chinese', 'french', 'mexican', 'italian', 'italian', 'spanish', 'indian', 'vietnamese', 'chinese', 'italian', 'thai', 'indian', 'filipino', 'italian', 'chinese', 'italian', 'japanese', 'chinese', 'jamaican', 'french', 'mexican', 'filipino', 'korean', 'mexican', 'greek', 'filipino', 'thai', 'italian', 'italian', 'french', 'indian', 'thai', 'thai', 'indian', 'japanese', 'indian', 'mexican', 'southern_us', 'greek', 'chinese', 'spanish', 'italian', 'korean', 'british', 'southern_us', '

In [65]:
## add to the SFrame
test_sf['cuisine'] = test_preds

In [66]:
test_sf

id,ingredients,cuisine
18009,"{'eggs': 1, 'milk': 1, 'baking powder': 1, ' ...",british
28583,"{'cream of tartar': 1, 'corn starch': 1, ...",southern_us
41580,"{'olive oil': 1, 'fennel bulb': 1, 'sausage ...",italian
29752,"{'vegetable oil': 1, 'garlic cloves': 1, ' ...",cajun_creole
35687,"{'sausage casings': 1, 'parmigiano reggiano ...",italian
38527,"{'corn starch': 1, 'peach slices': 1, 'baking ...",southern_us
19666,"{'orange': 1, 'grape juice': 1, 'white ...",spanish
41217,"{'vegetable oil': 1, 'white pepper': 1, 'w ...",chinese
28753,"{'vegetable oil': 1, 'taco seasoning mix': 1, ...",mexican
22659,"{'butter': 1, 'self raising flour': 1, ...",irish


In [67]:
## drop the ingredients and write to the submission folder
test_sf.remove_column('ingredients')

id,cuisine
18009,british
28583,southern_us
41580,italian
29752,cajun_creole
35687,italian
38527,southern_us
19666,spanish
41217,chinese
28753,mexican
22659,irish


In [61]:
## write the predictions
test_sf.save('../submissions/submission2.csv', format='csv')

In [62]:
## look at the file to make sure it looks ok for submission
!cat ../submissions/submission2.csv | head -10

id,cuisine
18009,"british"
28583,"southern_us"
41580,"italian"
29752,"cajun_creole"
35687,"italian"
38527,"southern_us"
19666,"spanish"
41217,"chinese"
28753,"mexican"
cat: stdout: Broken pipe


This submission resulted in a score of __0.74899__.

### Next Steps

-  clean/evaluate the ingredients, as there appear to be data issues with typos and/or no validation
-  tune the parameters and/or search across a wider space

## Model Version 2

My assumption is that ingredients need to be cleaned a bit more.  I could be wrong, but I suspect that there may not be any data validation going into the ingredients.  Let's take a look...

In [77]:
train_sf['ingredients'].head()

dtype: dict
Rows: 10
[{'pepper': 1, 'seasoning': 1, 'garbanzo beans': 1, 'grape tomatoes': 1, 'garlic': 1, 'black olives': 1, 'feta cheese crumbles': 1, 'romaine lettuce': 1, 'purple onion': 1}, {'vegetable oil': 1, 'plain flour': 1, 'thyme': 1, 'ground black pepper': 1, 'eggs': 1, 'yellow corn meal': 1, 'green tomatoes': 1, 'ground pepper': 1, 'salt': 1, 'milk': 1, 'tomatoes': 1}, {'garlic powder': 1, 'butter': 1, 'pepper': 1, 'green chilies': 1, 'cooking oil': 1, 'eggs': 1, 'mayonaise': 1, 'soy sauce': 1, 'grilled chicken breasts': 1, 'yellow onion': 1, 'salt': 1, 'chicken livers': 1}, {'water': 1, 'vegetable oil': 1, 'wheat': 1, 'salt': 1}, {'butter': 1, 'garlic paste': 1, 'bay leaf': 1, 'chili powder': 1, 'boneless chicken skinless thigh': 1, 'natural yogurt': 1, 'cornflour': 1, 'milk': 1, 'lemon juice': 1, 'black pepper': 1, 'cayenne pepper': 1, 'water': 1, 'passata': 1, 'oil': 1, 'double cream': 1, 'shallots': 1, 'onions': 1, 'garam masala': 1, 'salt': 1, 'ground cumin': 1}, {'bu

In [90]:
## put all of the ingredients together, training and test, dupes and all
ingredients = []

## the training ingredients
for ing in train_sf['ingredients']:
    dish_ings = ing.keys()
    for x in dish_ings:
        ingredients.append(x)

## rebuild because I dropped the column previous, rookie mistake
test_sf = gl.SFrame({'id':test_id, 'ingredients':test_ingredients})
        
## the test ingredients
for ing in test_sf['ingredients']:
    dish_ings = ing.keys()
    for x in dish_ings:
        ingredients.append(x)
    

In [93]:
## put into a pandas series
ingredients_sa = pd.Series(ingredients)

In [99]:
print "there are %s values" % ingredients_sa.count()

## print the counts
ingredients_sa.value_counts()

there are 535644 values


salt                                          22533
onions                                        10008
olive oil                                      9888
water                                          9293
garlic                                         9171
sugar                                          8064
garlic cloves                                  7771
butter                                         6077
ground black pepper                            5989
all-purpose flour                              5816
vegetable oil                                  5516
pepper                                         5508
eggs                                           4262
soy sauce                                      4120
kosher salt                                    3930
green onions                                   3817
tomatoes                                       3812
large eggs                                     3700
carrots                                        3542
unsalted but

The tail includes values that are mixed case (not sure that it matters), but more importantly, bad data (e.g. and carrot green pea).  The truth of the matter is that we don't need to be concerned with counts of 1, as only one dish (training or test) include it.  

However, are we losing information on an ingredient such as cooked cut green beans.  That's just one example, but represents the fact that we probably need to clean the ingredients.

To do that, we need to find ingredients that might be mis-labeled due to a lack of data validation.

In [106]:
## put the counts into a sortable dataframe
## unique names with the counts
x = ingredients_sa.value_counts()

In [123]:
## get the indexes
x_index = x.index.values

## get the values
x_values = x.values

## make the dataframe
ing_df = pd.DataFrame({'ingredient':x_index, 'counts':x_values})

## cleanup
del(x)

In [124]:
ing_df.shape

(7137, 2)

In [126]:
ing_df.head()

Unnamed: 0,counts,ingredient
0,22533,salt
1,10008,onions
2,9888,olive oil
3,9293,water
4,9171,garlic


In [127]:
ing_df.tail()

Unnamed: 0,counts,ingredient
7132,1,McCormick Poppy Seed
7133,1,mesquite flavored seasoning mix
7134,1,raspberry sherbet
7135,1,King Arthur Gluten Free MultiPurpose Flour
7136,1,farfalline


In [None]:
## keep only the top "N" ingredients


In [None]:
## calc ingredient similarity by names

In [None]:
## are their outliers where they are too close?

In [None]:
## remove ingredient text that represent amounts, brands, other things, etc.  
## try to standardize the ingredients like those that are most similar

In [None]:
## question:  how to clean the clean the old values (keys) with the new values (keys) inline?