Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets 🍌


In today's lesson, we’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!



### Setup

In [0]:
# Download data
import requests

def download(url):
    filename = url.split('/')[-1]
    print(f'Downloading {url}')
    r = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(r.content)
    print(f'Downloaded {filename}')

download('https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz')

Downloading https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Downloaded instacart_online_grocery_shopping_2017_05_01.tar.gz


In [0]:
# Uncompress data
import tarfile
tarfile.open('instacart_online_grocery_shopping_2017_05_01.tar.gz').extractall()

In [0]:
# Change directory to where the data was uncompressed
%cd instacart_2017_05_01

/content/instacart_2017_05_01


In [0]:
# Print the csv filenames
from glob import glob
for filename in glob('*.csv'):
    print(filename)

order_products__prior.csv
order_products__train.csv
aisles.csv
orders.csv
departments.csv
products.csv


### For each csv file, look at its shape & head 

In [0]:
import pandas as pd

from IPython.display import display

def preview():
  for filename in glob('*.csv'):
    df = pd.read_csv(filename)
    print('\n', filename, df.shape)
    display(df.head())
  
preview()


 order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0



 order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1



 aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation



 orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0



 departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol



 products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [0]:
prior = pd.read_csv('order_products__prior.csv')

In [0]:
train = pd.read_csv('order_products__train.csv')

In [0]:
orders = pd.read_csv('orders.csv')

In [0]:
products = pd.read_csv('products.csv')

## The original task was complex ...

[The Kaggle competition said,](https://www.kaggle.com/c/instacart-market-basket-analysis/data):

> The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order.

> orders.csv: This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders.

Each row in the submission is an order_id from the test set, followed by product_id(s) predicted to be reordered.

> sample_submission.csv: 
```
order_id,products
17,39276 29259
34,39276 29259
137,39276 29259
182,39276 29259
257,39276 29259
```

## ... but we can simplify!

Simplify the question, from "Which products will be reordered?" (Multi-class, [multi-label](https://en.wikipedia.org/wiki/Multi-label_classification) classification) to **"Will customers reorder this one product?"** (Binary classification)

Which product? How about **the most frequently ordered product?**

# Questions:

- What is the most frequently ordered product?
- How often is this product included in a customer's next order?
- Which customers have ordered this product before?
- How can we get a subset of data, just for these customers?
- What features can we engineer? We want to predict, will these customers reorder this product on their next order?

## What was the most frequently ordered product?

In [0]:
prior['product_id'].value_counts()

24852    472565
13176    379450
21137    264683
21903    241921
47209    213584
          ...  
11356         1
18001         1
6320          1
26268         1
30087         1
Name: product_id, Length: 49677, dtype: int64

In [0]:
train['product_id'].value_counts()

24852    18726
13176    15480
21137    10894
21903     9784
47626     8135
         ...  
44256        1
2764         1
4815         1
43736        1
46835        1
Name: product_id, Length: 39123, dtype: int64

In [0]:
products[products['product_id'] == 24852]

Unnamed: 0,product_id,product_name,aisle_id,department_id
24851,24852,Banana,24,4


In [0]:
prior_products = pd.merge(prior, products, on='product_id', how='inner')
prior_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2,33120,1,1,Organic Egg Whites,86,16
1,26,33120,5,0,Organic Egg Whites,86,16
2,120,33120,13,0,Organic Egg Whites,86,16
3,327,33120,5,1,Organic Egg Whites,86,16
4,390,33120,28,1,Organic Egg Whites,86,16


In [0]:
# All of the previous orders that contain orders of bananas
prior_products[prior_products['product_id'] == 24852]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
1771449,10,24852,1,1,Banana,24,4
1771450,20,24852,6,0,Banana,24,4
1771451,22,24852,3,1,Banana,24,4
1771452,26,24852,2,1,Banana,24,4
1771453,52,24852,2,1,Banana,24,4
...,...,...,...,...,...,...,...
2244009,3421027,24852,3,1,Banana,24,4
2244010,3421030,24852,9,1,Banana,24,4
2244011,3421038,24852,2,0,Banana,24,4
2244012,3421078,24852,2,1,Banana,24,4


## How often is this product included in a customer's next order?

There are [three sets of data](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b):

> "prior": orders prior to that users most recent order (3.2m orders)  
"train": training data supplied to participants (131k orders)  
"test": test data reserved for machine learning competitions (75k orders)

Customers' next orders are in the "train" and "test" sets. (The "prior" set has the orders prior to the most recent orders.)

We can't use the "test" set here, because we don't have its labels (only Kaggle & Instacart have them), so we don't know what products were bought in the "test" set orders.

So, we'll use the "train" set. It currently has one row per product_id and multiple rows per order_id.

But we don't want that. Instead we want one row per order_id, with a binary column: "Did the order include the product?"

Let's wrangle!

### Test our logic on the first 20 rows

In [0]:
BANANAS = 24852

In [0]:
df = train.head(20)

df

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1
5,1,13176,6,0
6,1,47209,7,0
7,1,22035,8,1
8,36,39612,1,0
9,36,19660,2,1


In [0]:
df['bananas'] = (df['product_id'] == BANANAS)

df.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,bananas
0,1,49302,1,1,False
1,1,11109,2,1,False
2,1,10246,3,0,False
3,1,49683,4,0,False
4,1,43633,5,1,False
5,1,13176,6,0,False
6,1,47209,7,0,False
7,1,22035,8,1,False
8,36,39612,1,0,False
9,36,19660,2,1,False


In [0]:
df.groupby(['order_id']).any()

Unnamed: 0_level_0,product_id,add_to_cart_order,reordered,bananas
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,True,True,True,False
36,True,True,True,False
38,True,True,True,False


### Make the y variable on the whole dataset

In [0]:
train['bananas'] = (train['product_id'] == BANANAS)

train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,bananas
0,1,49302,1,1,False
1,1,11109,2,1,False
2,1,10246,3,0,False
3,1,49683,4,0,False
4,1,43633,5,1,False


In [0]:
train.groupby('order_id')['bananas'].any()

order_id
1          False
36         False
38         False
96         False
98         False
           ...  
3421049    False
3421056    False
3421058    False
3421063    False
3421070    False
Name: bananas, Length: 131209, dtype: bool

In [0]:
train_wrangled = train.groupby('order_id')['bananas'].any().reset_index()

In [0]:
train_wrangled['bananas'].value_counts(normalize=True)

False    0.857281
True     0.142719
Name: bananas, dtype: float64

## Which customers have ordered this product before?

- Customers are identified by `user_id`
- Products are identified by `product_id`

Do we have a table with both these id's? (If not, how can we combine this information?)

In [0]:
preview()


 order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0



 order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1



 aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation



 orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0



 departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol



 products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


## How can we get a subset of data, just for these customers?

We want *all* the orders from customers who have *ever* bought the product.

(And *none* of the orders from customers who have *never* bought the product.)

In [0]:
prior[prior['product_id']==BANANAS]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
77,10,24852,1,1
180,20,24852,6,0
190,22,24852,3,1
234,26,24852,2,1
414,52,24852,2,1
...,...,...,...,...
32433984,3421027,24852,3,1
32434016,3421030,24852,9,1
32434146,3421038,24852,2,0
32434447,3421078,24852,2,1


In [0]:
# A list of all historic (prior) order ids that contained banans
banana_prior_order_ids = prior[prior['product_id']==BANANAS]['order_id']

In [0]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [0]:
# Filter orders to only contain orders that contained bananas
# all orders from users who have bought banans at least once. 
orders[orders['order_id'].isin(banana_prior_order_ids)]
# The user_ids of this column will be all users who 
# have ever bought bananas in the past

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
14,738281,2,prior,4,2,10,8.0
16,1199898,2,prior,6,2,9,13.0
17,3194192,2,prior,7,2,12,14.0
18,788338,2,prior,8,1,15,27.0
19,1718559,2,prior,9,2,9,8.0
...,...,...,...,...,...,...,...
3420915,1764570,206202,prior,20,4,0,11.0
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0


In [0]:
# double check our work, make sure that bananas are included in certain orders
prior[prior['order_id'] == 3194192]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
30280663,3194192,32792,1,1
30280664,3194192,12000,2,1
30280665,3194192,16589,3,1
30280666,3194192,32052,4,1
30280667,3194192,19051,5,1
30280668,3194192,32139,6,1
30280669,3194192,47209,7,1
30280670,3194192,24852,8,1
30280671,3194192,46886,9,0
30280672,3194192,40198,10,0


In [0]:
banana_orders = orders[orders['order_id'].isin(banana_prior_order_ids)]

In [0]:
# Get user_ids of everyone who has ever bought bananas previously
banana_user_ids = banana_orders['user_id'].unique()

banana_user_ids

array([     2,     10,     16, ..., 206196, 206202, 206209])

In [0]:
orders.shape

(3421083, 7)

In [0]:
orders = orders[orders['user_id'].isin(banana_user_ids)]

orders.shape

(1512975, 7)

In [0]:
# I want all order_ids associated with banana purchasers
subset_order_ids = orders['order_id'].unique()

In [0]:
prior.shape

(32434489, 4)

In [0]:
prior = prior[prior['order_id'].isin(subset_order_ids)]

prior.shape

(16534534, 4)

In [0]:
train.shape

(1384617, 5)

In [0]:
train = train[train['order_id'].isin(subset_order_ids)]

train.shape

(587269, 5)

In [0]:
target = 'bananas'

train[target].value_counts(normalize=True)

False    0.971807
True     0.028193
Name: bananas, dtype: float64

### Exploring the above result

In [0]:
# the orders containing bananas of bananas in all past orders:
# number of orders containing banans originally
# divided by total number of products
# 472565 / 3443448

0.13723599136679282

## What features can we engineer? We want to predict, will these customers reorder this product on their next order?

In [0]:
# What features do we already have

preview()


 order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0



 order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1



 aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation



 orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0



 departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol



 products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [0]:
train.shape

(587269, 5)

In [0]:
train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,bananas
0,1,49302,1,1,False
1,1,11109,2,1,False
2,1,10246,3,0,False
3,1,49683,4,0,False
4,1,43633,5,1,False


### What we know about banana purchasers / purchases:

1) How often a person purchases bananas
- percentage of previous orders that contain bananas
- How many days between banana purchases

2) Recency of banana purchases 
- number of orders since
- number of days since

3) When they purchase banans (time of day, day of week)


4) Other items in that purchase (bought other fruit items)



In [0]:
# merge things to make predictions with to look at them all in one place

train = pd.merge(train, orders)

train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,bananas,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1,49302,1,1,False,112108,train,4,4,10,9.0
1,1,11109,2,1,False,112108,train,4,4,10,9.0
2,1,10246,3,0,False,112108,train,4,4,10,9.0
3,1,49683,4,0,False,112108,train,4,4,10,9.0
4,1,43633,5,1,False,112108,train,4,4,10,9.0


In [0]:
# Start with one user and see if we can calculate their percentage 
USER = 206202

In [0]:
prior = pd.merge(prior, orders[['user_id', 'order_id']])

In [0]:
prior['bananas'] = (prior['product_id'] == BANANAS)

In [0]:
df = prior[prior.user_id==USER]

df

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,bananas
178996,36819,432,1,1,206202,False
178997,36819,2693,2,1,206202,False
178998,36819,49683,3,1,206202,False
527743,109085,432,1,1,206202,False
527744,109085,26620,2,1,206202,False
...,...,...,...,...,...,...
16329241,3378039,10455,6,1,206202,False
16329242,3378039,38837,7,1,206202,False
16329243,3378039,4920,8,0,206202,False
16329244,3378039,28204,9,0,206202,False


In [0]:
df.groupby('order_id')['bananas'].sum().sum()

11

In [0]:
df['bananas'].sum()

11

In [0]:
# numerator
# the 11 orders from this user that contain bananas
df[df['bananas']]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,bananas
527747,109085,24852,5,1,206202,True
714005,147354,24852,7,1,206202,True
2120248,438707,24852,2,1,206202,True
3266870,675592,24852,2,1,206202,True
3776404,781276,24852,4,1,206202,True
6046410,1251580,24852,1,1,206202,True
8535018,1764570,24852,5,1,206202,True
10187790,2105497,24852,5,1,206202,True
12359180,2554068,24852,3,0,206202,True
13995888,2892967,24852,1,1,206202,True


In [0]:
# denominator
# total number of orders that they've made 
df['order_id'].nunique()

22

In [0]:
df['bananas'].sum() / df['order_id'].nunique()

0.5