Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets 🍌


In today's lesson, we’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!



### Setup

In [1]:
# Download data
import requests

def download(url):
    filename = url.split('/')[-1]
    print(f'Downloading {url}')
    r = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(r.content)
    print(f'Downloaded {filename}')

download('https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz')

Downloading https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Downloaded instacart_online_grocery_shopping_2017_05_01.tar.gz


In [0]:
# Uncompress data
import tarfile
tarfile.open('instacart_online_grocery_shopping_2017_05_01.tar.gz').extractall()

In [3]:
# Change directory to where the data was uncompressed
%cd instacart_2017_05_01

/content/instacart_2017_05_01


In [4]:
# Print the csv filenames
from glob import glob
for filename in glob('*.csv'):
    print(filename)

departments.csv
products.csv
order_products__train.csv
aisles.csv
order_products__prior.csv
orders.csv


### For each csv file, look at its shape & head 

In [0]:
import pandas as pd

In [6]:
departments = pd.read_csv('departments.csv')
print(departments.shape)
departments.head()

(21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [7]:
products = pd.read_csv('products.csv')
print(products.shape)
products.head()

(49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [8]:
order_products__train = pd.read_csv('order_products__train.csv')
print(order_products__train.shape)
order_products__train.head()

(1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [9]:
aisles = pd.read_csv('aisles.csv')
print(aisles.shape)
aisles.head()

(134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [10]:
order_products__prior = pd.read_csv('order_products__prior.csv')
print(order_products__prior.shape)
order_products__prior.head()

(32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [11]:
orders = pd.read_csv('orders.csv')
print(orders.shape)
orders.head()

(3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [12]:
for filename in glob('*.csv'):
  df = pd.read_csv(filename)
  print(filename, df.shape)

departments.csv (21, 2)
products.csv (49688, 4)
order_products__train.csv (1384617, 4)
aisles.csv (134, 2)
order_products__prior.csv (32434489, 4)
orders.csv (3421083, 7)


In [13]:
from IPython.display import display

def preview():
    for filename in glob('*.csv'):
      df = pd.read_csv(filename)
      print(filename, df.shape)
      display(df.head())
      print('\n')

preview()

departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol




products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13




order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1




aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation




order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0




orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0






  ## The original task was complex ...

[The Kaggle competition said,](https://www.kaggle.com/c/instacart-market-basket-analysis/data):

> The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order.

> orders.csv: This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders.

Each row in the submission is an order_id from the test set, followed by product_id(s) predicted to be reordered.

> sample_submission.csv: 
```
order_id,products
17,39276 29259
34,39276 29259
137,39276 29259
182,39276 29259
257,39276 29259
```

## ... but we can simplify!

Simplify the question, from "Which products will be reordered?" (Multi-class, [multi-label](https://en.wikipedia.org/wiki/Multi-label_classification) classification) to **"Will customers reorder this one product?"** (Binary classification)

Which product? How about **the most frequently ordered product?**

# Questions:

- What is the most frequently ordered product?
- How often is this product included in a customer's next order?
- Which customers have ordered this product before?
- How can we get a subset of data, just for these customers?
- What features can we engineer? We want to predict, will these customers reorder this product on their next order?

## What was the most frequently ordered product?

In [0]:
prior = pd.read_csv('order_products__prior.csv')

In [16]:
prior['product_id'].mode()

0    24852
dtype: int64

In [17]:
prior['product_id'].value_counts()

24852    472565
13176    379450
21137    264683
21903    241921
47209    213584
          ...  
11356         1
18001         1
6320          1
26268         1
30087         1
Name: product_id, Length: 49677, dtype: int64

In [0]:
train = pd.read_csv('order_products__train.csv')

In [19]:
train['product_id'].mode()

0    24852
dtype: int64

In [20]:
train['product_id'].value_counts()

24852    18726
13176    15480
21137    10894
21903     9784
47626     8135
         ...  
44256        1
2764         1
4815         1
43736        1
46835        1
Name: product_id, Length: 39123, dtype: int64

In [0]:
products = pd.read_csv('products.csv')

In [22]:
products[products['product_id'] == prior['product_id'].mode()[0]]

Unnamed: 0,product_id,product_name,aisle_id,department_id
24851,24852,Banana,24,4


## How often is this product included in a customer's next order?

## this product = bananas

There are [three sets of data](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b):

> "prior": orders prior to that users most recent order (3.2m orders)  
"train": training data supplied to participants (131k orders)  
"test": test data reserved for machine learning competitions (75k orders)

Customers' next orders are in the "train" and "test" sets. (The "prior" set has the orders prior to the most recent orders.)

We can't use the "test" set here, because we don't have its labels (only Kaggle & Instacart have them), so we don't know what products were bought in the "test" set orders.

So, we'll use the "train" set. It currently has one row per product_id and multiple rows per order_id.

But we don't want that. Instead we want one row per order_id, with a binary column: "Did the order include the product?"

Let's wrangle!

In [23]:
train.head(20)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1
5,1,13176,6,0
6,1,47209,7,0
7,1,22035,8,1
8,36,39612,1,0
9,36,19660,2,1


In [28]:
df = train.head(16).copy()
df['bananas'] = df['product_id'] == 248522
df

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,bananas
0,1,49302,1,1,False
1,1,11109,2,1,False
2,1,10246,3,0,False
3,1,49683,4,0,False
4,1,43633,5,1,False
5,1,13176,6,0,False
6,1,47209,7,0,False
7,1,22035,8,1,False
8,36,39612,1,0,False
9,36,19660,2,1,False


In [27]:
df.groupby('order_id')['bananas'].any()

order_id
1     False
36    False
Name: bananas, dtype: bool

In [0]:
train['bananas'] = train['product_id'] == 24852

In [34]:
train_wrangled = train.groupby('order_id')['bananas'].any().reset_index()
train_wrangled

Unnamed: 0,order_id,bananas
0,1,False
1,36,False
2,38,False
3,96,False
4,98,False
...,...,...
131204,3421049,False
131205,3421056,False
131206,3421058,False
131207,3421063,False


In [37]:
target = 'bananas'
train_wrangled[target].value_counts(normalize=True)

False    0.857281
True     0.142719
Name: bananas, dtype: float64

technique 2

In [36]:
df.groupby('order_id')['product_id'].apply(list).reset_index()

Unnamed: 0,order_id,product_id
0,1,"[49302, 11109, 10246, 49683, 43633, 13176, 472..."
1,36,"[39612, 19660, 49235, 43086, 46620, 34497, 486..."


In [38]:
def includes_bananas(product_ids):
  return 24852 in list(product_ids)

df.groupby('order_id')['product_id'].apply(includes_bananas)

order_id
1     False
36    False
Name: product_id, dtype: bool

In [0]:
train_wrangle2 = (train
                  .groupby('order_id')
                  .agg({'product_id': includes_bananas})
                  .reset_index()
                  .rename(columns={'product_id': 'bananas'}))

In [40]:
train_wrangle2[target].value_counts(normalize=True)

False    0.857281
True     0.142719
Name: bananas, dtype: float64

In [62]:
train_wrangled.shape, train_wrangle2.shape

((131209, 2), (131209, 2))

## Which customers have ordered this product before?

- Customers are identified by `user_id`
- Products are identified by `product_id`

Do we have a table with both these id's? (If not, how can we combine this information?)

In [41]:
preview()

departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol




products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13




order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1




aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation




order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0




orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0






no, but we can combine stuff

In [42]:
# in the order products prior table, which orders included bananas

BANANAS = 24852
prior[prior['product_id']==BANANAS]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
77,10,24852,1,1
180,20,24852,6,0
190,22,24852,3,1
234,26,24852,2,1
414,52,24852,2,1
...,...,...,...,...
32433984,3421027,24852,3,1
32434016,3421030,24852,9,1
32434146,3421038,24852,2,0
32434447,3421078,24852,2,1


In [0]:
banana_prior_order_ids = prior[prior['product_id']==BANANAS]['order_id']

In [45]:
# in the orders table, which orders included bananas
orders = pd.read_csv('orders.csv')
orders.sample(n=5)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
1734900,363034,104162,prior,1,1,12,
3327849,2585585,200571,prior,7,0,17,17.0
1053336,2974064,63405,prior,6,0,14,7.0
3359511,3119663,202503,prior,11,3,11,14.0
2329170,719619,140212,prior,26,5,16,6.0


In [46]:
orders[orders['order_id'].isin(banana_prior_order_ids)]

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
14,738281,2,prior,4,2,10,8.0
16,1199898,2,prior,6,2,9,13.0
17,3194192,2,prior,7,2,12,14.0
18,788338,2,prior,8,1,15,27.0
19,1718559,2,prior,9,2,9,8.0
...,...,...,...,...,...,...,...
3420915,1764570,206202,prior,20,4,0,11.0
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0


In [47]:
# check order to ensure has bananas
prior[prior['order_id']==738281]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
6992648,738281,49451,1,0
6992649,738281,32792,2,1
6992650,738281,32139,3,0
6992651,738281,34688,4,0
6992652,738281,36735,5,0
6992653,738281,37646,6,0
6992654,738281,22829,7,0
6992655,738281,24852,8,0
6992656,738281,47209,9,0
6992657,738281,33276,10,0


In [0]:
banana_orders = orders[orders['order_id'].isin(banana_prior_order_ids)]

In [49]:
banana_orders

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
14,738281,2,prior,4,2,10,8.0
16,1199898,2,prior,6,2,9,13.0
17,3194192,2,prior,7,2,12,14.0
18,788338,2,prior,8,1,15,27.0
19,1718559,2,prior,9,2,9,8.0
...,...,...,...,...,...,...,...
3420915,1764570,206202,prior,20,4,0,11.0
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0


In [51]:
# in the orders table, which users have bought bananas
banana_user_ids = banana_orders['user_id'].unique()
banana_user_ids

array([     2,     10,     16, ..., 206196, 206202, 206209])

## How can we get a subset of data, just for these customers?

We want *all* the orders from customers who have *ever* bought the product.

(And *none* of the orders from customers who have *never* bought the product.)

In [52]:
# orders table, shape before getting subset
orders.shape

(3421083, 7)

In [54]:
# orders table, shape after getting subset
orders2 = orders[orders['user_id'].isin(banana_user_ids)]
orders2.shape

(1512975, 7)

In [0]:
# ids from all the orders from customers that have ever bought bananas
subset_order_ids = orders2['order_id'].unique()

In [56]:
# order_products__prior table, shape before getting subset
prior.shape

(32434489, 4)

In [57]:
# order_products__prior table, shape after getting subset
prior2 = prior[prior['order_id'].isin(subset_order_ids)]
prior2.shape

(16534534, 4)

In [63]:
# order_products__train table, shape before getting subset
train_wrangled.shape

(131209, 2)

In [64]:
# order_products__train table, shape after getting subset
train2 = train_wrangled[train_wrangled['order_id'].isin(subset_order_ids)]
train2.shape

(46964, 2)

In [66]:
# In this subset, how often were bananas reordered in the customer's most recent order?
train2[target].value_counts(normalize=True)

False    0.647453
True     0.352547
Name: bananas, dtype: float64

## What features can we engineer? We want to predict, will these customers reorder this product on their next order?

- other fruit they buy
- time between banana orders
- frequency

In [67]:
preview()

departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol




products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13




order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1




aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation




order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0




orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0






In [68]:
train2.shape

(46964, 2)

In [69]:
train2.head()

Unnamed: 0,order_id,bananas
0,1,False
1,36,False
9,349,False
13,631,False
18,878,True


In [72]:
# merge user_id, order_number, order_dow, order_hour_of_day and days_since_prior_order
# with the training data
train3 = pd.merge(train2, orders)
train3.head()

Unnamed: 0,order_id,bananas,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1,False,112108,train,4,4,10,9.0
1,36,False,79431,train,23,6,18,30.0
2,349,False,156353,train,9,3,16,30.0
3,631,False,184099,train,7,3,9,30.0
4,878,True,61911,train,9,2,13,30.0


In [0]:
target = 'bananas'
can_be_used = ['order_id', 'user_id', 'order_number']
usefull = ['order_dow', 'order_hour_of_day', 'days_since_prior_order']
useless = ['eval_set']

- frequency of banana orders
 - % of orders
 - every n days on average
- recency of banana orders
 - n of orders
 - n days

In [0]:
USER = 61911

In [0]:
prior11 = pd.merge(prior, orders[['order_id', 'user_id']])

In [78]:
# this user has ordered 196 products
prior11[prior11['user_id']==USER]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id
3451411,364645,27705,1,1,61911
3451412,364645,21137,2,1,61911
3451413,364645,27845,3,1,61911
3451414,364645,36011,4,1,61911
3451415,364645,26790,5,1,61911
...,...,...,...,...,...
21538392,2271842,25090,9,1,61911
21538393,2271842,27845,10,1,61911
21538394,2271842,36011,11,1,61911
21538395,2271842,45633,12,0,61911


In [0]:
prior11['bananas'] = prior11['product_id'] == BANANAS

In [80]:
prior11[prior11['user_id']==USER]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,bananas
3451411,364645,27705,1,1,61911,False
3451412,364645,21137,2,1,61911,False
3451413,364645,27845,3,1,61911,False
3451414,364645,36011,4,1,61911,False
3451415,364645,26790,5,1,61911,False
...,...,...,...,...,...,...
21538392,2271842,25090,9,1,61911,False
21538393,2271842,27845,10,1,61911,False
21538394,2271842,36011,11,1,61911,False
21538395,2271842,45633,12,0,61911,False


In [81]:
# has ordered bananas 6 times
prior11[prior11['user_id']==USER]['bananas'].sum()

6

In [0]:
df_user = prior11[prior11['user_id']==USER]

In [83]:
df_user[df_user['bananas']]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,bananas
3451417,364645,24852,7,1,61911,True
6999188,738971,24852,4,1,61911,True
7281896,768788,24852,6,1,61911,True
7930904,837210,24852,6,1,61911,True
14971181,1579677,24852,6,0,61911,True
21538384,2271842,24852,1,1,61911,True


In [86]:
# how many unique orders for this user
df_user['order_id'].nunique()

8

In [85]:
df_user['bananas'].sum() / df_user['order_id'].nunique()

0.75