### Problem solving exercise

+ Because I love burritos we're analyzing a chipotle dataset! 

+ The goal is to string together the skills we've worked on over the last few months

#### RULES

+ I will not be giving out answers! (ok maybe some hints if you get *really* stuck) 
+ You will solve this together as a class
+ When someone figures something out, they can come to the board and present their solution to the class
    + Alternative solutions can also be presented
    
##### REWARDS:
+ There will be a happy hour after this class
+ Students with the most accurate models will be eligible to vote on the bar we go to!
    + Any models with accuracy within a 5% of the most accurate model  
    + So if the best model is 82.2%, we'll also select anyone with accuracy greater than 78.1%
+ Student who presents the most solutions presented to the class will get a free drink! (or alternative if you don't drink) 
    + No ties! Only one student can win this!


### Outline:

#### Cleaning Data

+ We only briefly covered cleaning data
+ You'll need to rely more on google and logic than class notes here
+ Cleaning data is something you just need to learn by doing
+ After cleaning, we'll run a machine learning algorithm to predict the price of an order

#### Preprocessing & ML 

+ We've covered this in class, but this time you're really driving the ship
+ Get your data into the right format, then start training your algorithm! 



### That's it! GO FOR IT! 
+ I believe in all of you!

### First import your dataset

+ hint - examine how the values are separated 
+ What's the difference between a tsv and csv?


In [4]:
### Code here

import pandas as pd

df = pd.read_csv("chipotle.tsv", sep="\t")


In [5]:
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [13]:
df = pd.read_table("chipotle.tsv")
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


## Next, clean up "choice description"

+ What do the values look like? 
+ We're going to plug this into count vectorizer, later
+ How can we clean this up?

+ Check out a package called "re" for regular expressions
+ there are multiple ways to solve this problem

### Niyi's solution

In [15]:
### code here
holder =[]
for x in df["choice_description"]:
    if type(x) == float:
        holder.append(str(x))
    else:
        holder.append(x)
df["choice_description"] = holder

In [17]:
for x in df["choice_description"][:20]:
    print(type(x))

<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>
<type 'str'>


In [20]:
df["choice_description"] = df["choice_description"].apply(lambda x: x.replace("[", "").replace("]",""))

In [21]:
df["choice_description"].head()

0                                                  nan
1                                           Clementine
2                                                Apple
3                                                  nan
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

### Reid's solution

In [23]:
df = pd.read_csv("chipotle.tsv", sep="\t")

In [36]:
import re

df['choice_description'] = df['choice_description'].fillna("[none]").apply(lambda x: re.findall(r"[\w']+", x))

In [37]:
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,[none],$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,[none],$2.39
4,2,2,Chicken Bowl,"[Tomatillo, Red, Chili, Salsa, Hot, Black, Bea...",$16.98


### Next, clean up "item price" 

+ What can you do here? 
+ This will be our outcome variable
+ How can we make this easier to read? 

In [46]:
### code here

df['item_price'] = df['item_price'].str.replace("$", "").apply(float)

In [None]:
### code here

### Now Preprocess your data! 

+ Use a vectorizer of your choice!

+ Consider a dimension reduction technique! 

    + PCA? SVD? LDA?

In [74]:
## code here

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


In [75]:
tf = CountVectorizer(max_df=0.95)

In [76]:
tf_items = tf.fit_transform(df['item_name'])

In [77]:
tf_feature_names = tf.get_feature_names()
print tf_feature_names

[u'and', u'barbacoa', u'bottled', u'bowl', u'burrito', u'canned', u'carnitas', u'chicken', u'chili', u'chips', u'corn', u'crispy', u'drink', u'fresh', u'green', u'guacamole', u'izze', u'mild', u'nantucket', u'nectar', u'of', u'pack', u'red', u'roasted', u'salad', u'salsa', u'side', u'soda', u'soft', u'steak', u'tacos', u'tomatillo', u'tomato', u'veggie', u'water']


In [83]:
cv_vectorizer = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False, 
                                max_df=0.95, min_df=2,
                                max_features=10000,
                                stop_words='english', ngram_range=(1, 4))

tf_choice_description = cv_vectorizer.fit_transform(df['choice_description'])

In [79]:
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,[none],2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,[none],2.39
4,2,2,Chicken Bowl,"[Tomatillo, Red, Chili, Salsa, Hot, Black, Bea...",16.98


### Neel's merging code

In [80]:
tf_items.A

array([[1, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [81]:
tf_items1 = pd.DataFrame(tf_items.A, columns=tf.get_feature_names())

In [82]:
tf_items1.head()

Unnamed: 0,and,barbacoa,bottled,bowl,burrito,canned,carnitas,chicken,chili,chips,...,salsa,side,soda,soft,steak,tacos,tomatillo,tomato,veggie,water
0,1,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,1,1,...,1,0,0,0,0,0,1,0,0,0
4,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
tf_choice_description.A

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [84]:
tf_items2 = pd.DataFrame(tf_choice_description.A, columns=cv_vectorizer.get_feature_names())

In [87]:
tf_items2.head()

Unnamed: 0,Adobo,Adobo Marinated,Adobo Marinated Grilled,Adobo Marinated Grilled Chicken,Adobo Marinated Grilled Steak,Apple,Banana,Barbacoa,Beans,Beans Black,...,Veggies Rice,Veggies Sour,Veggies Sour Cream,Veggies Sour Cream Cheese,Veggies Sour Cream Guacamole,Veggies Sour Cream Lettuce,White,White Rice,White Rice Adobo,White Rice Adobo Marinated
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [95]:
X = pd.concat([df[['order_id', 'quantity']],tf_items1,tf_items2], axis=1)


In [97]:
print X.shape, df[['order_id', 'quantity']].shape, tf_items1.shape,tf_items2.shape

(4622, 991) (4622, 2) (4622, 35) (4622, 954)


### Niyi's dimension reduction 

In [100]:
X.head()

Unnamed: 0,order_id,quantity,and,barbacoa,bottled,bowl,burrito,canned,carnitas,chicken,...,Veggies Rice,Veggies Sour,Veggies Sour Cream,Veggies Sour Cream Cheese,Veggies Sour Cream Guacamole,Veggies Sour Cream Lettuce,White,White Rice,White Rice Adobo,White Rice Adobo Marinated
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,2,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [101]:
from sklearn import decomposition

pca = decomposition.PCA(n_components = 5)
X_pca = pca.fit_transform(X)

print(pca.explained_variance_ratio_)

[  9.99928595e-01   1.22881774e-05   6.45929676e-06   4.47660165e-06
   4.22823826e-06]


In [102]:
#lda
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [103]:
lda = LatentDirichletAllocation(n_topics = 5, max_iter = 20,
                               learning_method = 'online',
                               learning_offset = 50.,
                               random_state = 0)

In [104]:
lda.fit(X)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=20, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [110]:
y = df['item_price'].values

In [121]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Now train your model! 
+ What model you select is up to you
+ check out sklearn documentation!

In [122]:
from sklearn.linear_model import LinearRegression, Ridge

rid = Ridge(normalize=True)

rid.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)

In [123]:
### code here

### Now test your model!

In [126]:
rid2 = Ridge(normalize=True)

In [129]:
from sklearn.model_selection import cross_val_score

scores2 = cross_val_score(rid, X_train, y_train, scoring='r2', cv=10)

scores2.mean()

0.83324251623052015

In [None]:
### code here

In [130]:
rid.predict(X_test)

array([  4.72167898,  10.43236248,   9.1241642 , ...,  10.75946792,
        10.72301279,   8.67896927])