Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets 
- Explore tabular data for supervised machine learning
- Join relational data for supervised machine learning

# Explore tabular data for superviesd machine learning 🍌

Wrangling your dataset is often the most challenging and time-consuming part of the modeling process.

In today's lesson, we’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!

Let’s get set up:

In [1]:
# Download data
import requests

def download(url):
    filename = url.split('/')[-1]
    print(f'Downloading {url}')
    r = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(r.content)
    print(f'Downloaded {filename}')

download('https://lambdaschool-ds.s3.us-east-2.amazonaws.com/datasets%3Ainstacart.tar.gz')

Downloading https://lambdaschool-ds.s3.us-east-2.amazonaws.com/datasets%3Ainstacart.tar.gz
Downloaded datasets%3Ainstacart.tar.gz


In [2]:
# Uncompress data
import tarfile
tarfile.open('datasets%3Ainstacart.tar.gz').extractall()

In [3]:
# Change directory to where the data was uncompressed
%cd instacart_2017_05_01

/Users/nicholascifuentes-goodbody/Documents/GitHub/DS-Unit-2-Applied-Modeling/module2-wrangle-ml-datasets/instacart_2017_05_01


In [4]:
# Print the csv filenames
from glob import glob
for filename in glob('*.csv'):
    print(filename)

products.csv
orders.csv
order_products__train.csv
departments.csv
aisles.csv
order_products__prior.csv


**Before you start,** load each of the above `.csv` files into its own DataFrame.

In [38]:
import pandas as pd

orders = pd.read_csv('orders.csv')
orders_products_train = pd.read_csv('order_products__train.csv')
departments = pd.read_csv('departments.csv')
products = pd.read_csv('products.csv')
aisles = pd.read_csv('aisles.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')

# Warm-up Questions

What information is contained in the column `orders['eval_set']`?

In [39]:
orders['eval_set'].value_counts()

prior    3214874
train     131209
test       75000
Name: eval_set, dtype: int64

In [40]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


The first row of `orders['order_id']` is `2539329`. Where can we find the items that were included in that order?

In [43]:
#orders_products_prior[orders_products__prior_df['order_id']==2539329]

In [44]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


The first row of `order_products__prior['product_id']` is `33120`. What is the name of that product?

In [46]:
#order_products_prior['product_id'].head()

In [48]:
products[products['product_id']==33120]

Unnamed: 0,product_id,product_name,aisle_id,department_id
33119,33120,Organic Egg Whites,86,16


# Define Our ML Problem

- We want predict whether or not a customer will purchase a specific item (of our choosing).
- Most common item is `24852`: `'Banana'`.
- Our model is going to predict whether or not an order will include `'Banana'`.

In [49]:
order_products_prior['product_id'].value_counts().head()

24852    472565
13176    379450
21137    264683
21903    241921
47209    213584
Name: product_id, dtype: int64

In [50]:
products[products['product_id']==24852]

Unnamed: 0,product_id,product_name,aisle_id,department_id
24851,24852,Banana,24,4


# Create Feature Matrix and Target Vector

In [51]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [52]:
X_train = orders[orders['eval_set']=='train']

In [53]:
X_train.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
10,1187899,1,train,11,4,8,14.0
25,1492625,2,train,15,1,11,30.0
49,2196797,5,train,5,0,11,6.0
74,525192,7,train,21,2,11,6.0
78,880375,8,train,4,1,14,10.0


In [54]:
orders_products_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [55]:
orders_products_train['is_banana'] = orders_products_train['product_id'] == 24852

In [56]:
banana_orders = orders_products_train[orders_products_train['is_banana']]['order_id']

In [57]:
banana_orders.head()

115     226
156     473
196     878
272    1042
297    1139
Name: order_id, dtype: int64

In [58]:
y_train = X_train['order_id'].isin(banana_orders).astype(int)

In [59]:
X_train.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
10,1187899,1,train,11,4,8,14.0
25,1492625,2,train,15,1,11,30.0
49,2196797,5,train,5,0,11,6.0
74,525192,7,train,21,2,11,6.0
78,880375,8,train,4,1,14,10.0


In [60]:
y_train.head()

10    0
25    1
49    0
74    0
78    0
Name: order_id, dtype: int64

In [61]:
X_train.drop(columns=['eval_set', 'order_id', 'user_id', 'order_number'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


# Establish Baseline

In [65]:
print('Baseline Accuracy Score:', y_train.value_counts(normalize=True).max())

Baseline Accuracy Score: 0.85728113162969


# Build Model

In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [69]:
model = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

model.fit(X_train, y_train);

# Check Metrics

In [70]:
print('Training Accuracy Score:', model.score(X_train, y_train))

Training Accuracy Score: 0.85728113162969


In [71]:
X_train.head()

Unnamed: 0,order_dow,order_hour_of_day,days_since_prior_order
10,4,8,14.0
25,1,11,30.0
49,0,11,6.0
74,2,11,6.0
78,1,14,10.0
