### Problem solving exercise

+ Because I love burritos we're analyzing a chipotle dataset! 

+ The goal is to string together the skills we've worked on over the last few months

#### RULES

+ I will not be giving out answers! (ok maybe some hints if you get *really* stuck) 
+ You will solve this together as a class
+ When someone figures something out, they can come to the board and present their solution to the class
    + Alternative solutions can also be presented
    
##### REWARDS:
+ There will be a happy hour after this class
+ Students with the most accurate models will be eligible to vote on the bar we go to!
    + Any models with accuracy within a 5% of the most accurate model  
    + So if the best model is 82.2%, we'll also select anyone with accuracy greater than 78.1%
+ Student who presents the most solutions presented to the class will get a free drink! (or alternative if you don't drink) 
    + No ties! Only one student can win this!


### Outline:

#### Cleaning Data

+ We only briefly covered cleaning data
+ You'll need to rely more on google and logic than class notes here
+ Cleaning data is something you just need to learn by doing
+ After cleaning, we'll run a machine learning algorithm to predict the price of an order

#### Preprocessing & ML 

+ We've covered this in class, but this time you're really driving the ship
+ Get your data into the right format, then start training your algorithm! 



### That's it! GO FOR IT! 
+ I believe in all of you!

### First import your dataset

+ hint - examine how the values are separated 
+ What's the difference between a tsv and csv?


In [1]:
### Code here
import os
import pandas as pd
import matplotlib as mpl
import numpy as np

df = pd.read_csv('/Users/jamiew/GA-DataScience/GA-MyDSRepo/lesson-18/chipotle.tsv', sep='\t')

## Next, clean up "choice description"

+ What do the values look like? 
+ We're going to plug this into count vectorizer, later
+ How can we clean this up?

+ Check out a package called "re" for regular expressions
+ there are multiple ways to solve this problem

**Mine**

In [2]:
### code here
df.choice_description.unique()

array([nan, '[Clementine]', '[Apple]', ...,
       '[Roasted Chili Corn Salsa, [Pinto Beans, Sour Cream, Cheese, Lettuce, Guacamole]]',
       '[Tomatillo Green Chili Salsa, [Rice, Black Beans]]',
       '[Tomatillo Green Chili Salsa, [Rice, Fajita Vegetables, Black Beans, Guacamole]]'], dtype=object)

**Reid's Solution**

In [65]:
import re
df['choice_description'] = df['choice_description'].fillna("[none]").apply(lambda x: re.findall(r"[\w']+",x))

In [66]:
re.findall

<function re.findall>

**Niyi**

In [67]:
holder = []
for x in df["choice_description"]:
    if type(x) == float:
        holder.append(str(x))
    else:
        holder.append(x)
df["choice_description"] = holder


In [68]:
df['choice_description'].apply(lambda x: x.replace("[", "").replace("]", ""))

AttributeError: 'list' object has no attribute 'replace'

In [69]:
df["choice_description"].head()

0                                               [none]
1                                         [Clementine]
2                                              [Apple]
3                                               [none]
4    [Tomatillo, Red, Chili, Salsa, Hot, Black, Bea...
Name: choice_description, dtype: object

### Next, clean up "item price" 

+ What can you do here? 
+ This will be our outcome variable
+ How can we make this easier to read? 

In [4]:
### code here
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [6]:
df['item_price'] = df['item_price'].str.replace('$', '')

In [7]:
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98


In [39]:
### code here


In [19]:
df2 = df.dropna()

In [20]:
df2['choice_description'] = df2['choice_description'].str.replace('[', '')
df2['choice_description'] = df2['choice_description'].str.replace(']', '')

In [21]:
df2.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
1,1,1,Izze,Clementine,3.39
2,1,1,Nantucket Nectar,Apple,3.39
4,2,2,Chicken Bowl,"Tomatillo-Red Chili Salsa (Hot), Black Beans, ...",16.98
5,3,1,Chicken Bowl,"Fresh Tomato Salsa (Mild), Rice, Cheese, Sour ...",10.98
7,4,1,Steak Burrito,"Tomatillo Red Chili Salsa, Fajita Vegetables, ...",11.75


### Now Preprocess your data! 

+ Use a vectorizer of your choice!

+ Consider a dimension reduction technique! 

    + PCA? SVD? LDA?

In [57]:
## code here
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


In [58]:
tf = CountVectorizer(max_df=0.95)

In [59]:
tf_items = tf.fit_transform(df['choice_description'])

In [60]:
tf_feature_names = tf.get_feature_names()
print tf_feature_names

[u'adobo', u'and', u'apple', u'banana', u'barbacoa', u'beans', u'black', u'blackberry', u'braised', u'brown', u'carnitas', u'cheese', u'cherry', u'chicken', u'chili', u'cilantro', u'clementine', u'coca', u'coke', u'cola', u'corn', u'cream', u'dew', u'diet', u'dr', u'fajita', u'fresh', u'grapefruit', u'green', u'grilled', u'guacamole', u'hot', u'lemonade', u'lettuce', u'lime', u'marinated', u'medium', u'mild', u'mountain', u'nestea', u'orange', u'peach', u'pepper', u'pineapple', u'pinto', u'pomegranate', u'red', u'rice', u'roasted', u'salsa', u'sour', u'sprite', u'steak', u'tomatillo', u'tomato', u'vegetables', u'vegetarian', u'veggies', u'white']


In [61]:
tf_choice_description = tf.fit_transform(df['item_name'])

In [62]:
tf_feature_names = tf.get_feature_names()
print tf_feature_names

[u'barbacoa', u'bowl', u'burrito', u'canned', u'carnitas', u'chicken', u'crispy', u'drink', u'izze', u'nantucket', u'nectar', u'pack', u'salad', u'soda', u'soft', u'steak', u'tacos', u'veggie']


In [63]:
### code here
#cvec = CountVectorizer()
#cvec.fit([df2["choice_description"]])

In [64]:
cv_vectorizer = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False,
                               max_df=0.95, min_df=2,
                               max_features=10000,
                               stop_words='english', ngram_range=(1, 4))

In [65]:
tf_choice_description = cv_vectorizer.fit_transform(df['choice_description'])

In [66]:
tf_items.A

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

^^ What is the point of these matrices?? Brianna explained this to me: it's a matrix of the frequency of the tokens/features

Next we want to put all the sparse matrices together

In [67]:
#Merge tf_items.A by converting it to a dataframe

In [68]:
tf_items1 = pd.DataFrame(tf_items.A, columns=tf.get_feature_names())

ValueError: Shape of passed values is (59, 3376), indices imply (18, 3376)

In [69]:
tf_items1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,49,50,51,52,53,54,55,56,57,58
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,1,0,0,0,...,1,1,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,1,0,0,0,0
4,0,0,0,0,0,2,1,0,0,0,...,1,1,0,0,1,0,1,0,0,0


In [70]:
tf_choice_description.A

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [9, 1, 1, ..., 0, 0, 0],
       ..., 
       [8, 0, 0, ..., 0, 0, 0],
       [5, 0, 0, ..., 0, 0, 0],
       [7, 0, 0, ..., 0, 0, 0]])

In [71]:
tf_items2 = pd.DataFrame(tf_choice_description.A, columns=tf.get_feature_names())

ValueError: Shape of passed values is (851, 3376), indices imply (18, 3376)

In [72]:
X = pd.concat([df[['order_id', 'quantity']],tf_items1,tf_items2], axis=1)

NameError: name 'tf_items2' is not defined

In [74]:
print X.shape, df.shape, tf_items1.shape,tf_items2.shape #these should add up

NameError: name 'X' is not defined

Now talk about dimension reduction.

Niyi's

In [75]:
from sklearn import decomposition

pca = decomposition.PCA(n_components = 5) #look for 
X_pca = pca.fit_transform(X)

print(pca.explained_variance_ratio_)

NameError: name 'X' is not defined

### Now train your model! 
+ What model you select is up to you
+ check out sklearn documentation!

In [None]:
### code here

### Now test your model!

In [None]:
rid2 = Ridge(normalize=True)

In [76]:
### code here
from sklearn.model_selection import cross_val_score #K-folds is build into cross_val_score
scores2 = cross_val_score(rid, X_test, y_test, scoring='r2', cv=10)

scores2.mean()

NameError: name 'rid' is not defined

In [None]:
### code here
rid.predict(X_test)