Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets 
- Explore tabular data for supervised machine learning
- Join relational data for supervised machine learning

# Explore tabular data for superviesd machine learning 🍌

Wrangling your dataset is often the most challenging and time-consuming part of the modeling process.

In today's lesson, we’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!

Let’s get set up:

In [1]:
# Download data
import requests

def download(url):
    filename = url.split('/')[-1]
    print(f'Downloading {url}')
    r = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(r.content)
    print(f'Downloaded {filename}')

download('https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz')

Downloading https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Downloaded instacart_online_grocery_shopping_2017_05_01.tar.gz


In [2]:
# Uncompress data
import tarfile
tarfile.open('instacart_online_grocery_shopping_2017_05_01.tar.gz').extractall()

In [3]:
# Change directory to where the data was uncompressed
%cd instacart_2017_05_01

/mnt/c/Users/joe/OneDrive/Desktop/SCHOOL/2-3/module2-wrangle-ml-datasets/instacart_2017_05_01


In [4]:
# Print the csv filenames
from glob import glob
for filename in glob('*.csv'):
    print(filename)

aisles.csv
departments.csv
orders.csv
order_products__prior.csv
order_products__train.csv
products.csv


In [5]:
# For each csv file, look at its shape & head 
import pandas as pd
from IPython.display import display
def preview():
    for filename in glob('*.csv'):
        df=pd.read_csv(filename)
        print('\n',filename,df.shape)
        display(df.head())
preview()


 aisles.csv (134, 2)


Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation



 departments.csv (21, 2)


Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol



 orders.csv (3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0



 order_products__prior.csv (32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0



 order_products__train.csv (1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1



 products.csv (49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [6]:
train=pd.read_csv('order_products__train.csv')


In [7]:
products=pd.read_csv('products.csv')
train = pd.merge(train, products, on='product_id', how='inner')

In [8]:
bananas = train[(train['product_name'].str.contains('anana')) & (train['reordered'] == 1)]
(bananas.count()/train.count()*100)[0]

2.515569287391387

### ... but we can simplify!

Simplify the question, from "Which products will be reordered?" (Multi-class, [multi-label](https://en.wikipedia.org/wiki/Multi-label_classification) classification) to **"Will customers reorder this one product?"** (Binary classification)

Which product? How about **the most frequently ordered product?**

### Questions:

- What is the most frequently ordered product?
- How often is this product included in a customer's next order?
- Which customers have ordered this product before?
- How can we get a subset of data, just for these customers?
- What features can we engineer? We want to predict, will these customers reorder this product on their next order?

## Follow Along

### What was the most frequently ordered product?

### How often are bananas included in a customer's next order?

There are [three sets of data](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b):

> "prior": orders prior to that users most recent order (3.2m orders)  
"train": training data supplied to participants (131k orders)  
"test": test data reserved for machine learning competitions (75k orders)

Customers' next orders are in the "train" and "test" sets. (The "prior" set has the orders prior to the most recent orders.)

We can't use the "test" set here, because we don't have its labels (only Kaggle & Instacart have them), so we don't know what products were bought in the "test" set orders.

So, we'll use the "train" set. It currently has one row per product_id and multiple rows per order_id.

But we don't want that. Instead we want one row per order_id, with a binary column: "Did the order include bananas?"

Let's wrangle!

# Join relational data for supervised machine learning

## Overview
Often, you’ll need to join data from multiple relational tables before you’re ready to fit your models.

### Which customers have ordered this product before?

- Customers are identified by `user_id`
- Products are identified by `product_id`

Do we have a table with both these id's? (If not, how can we combine this information?)

In [14]:
b_set=train[train.product_name.str.contains('Banana')]


In [12]:
orders=pd.read_csv('orders.csv')
orders.sample(5)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2161501,2321439,130035,prior,13,4,8,2.0
2659027,2848021,160060,prior,19,6,19,7.0
1755202,1084076,105338,prior,34,1,3,3.0
2371782,2120788,142755,prior,31,2,8,8.0
342994,2163784,20760,prior,4,5,12,8.0


In [16]:
# a subset of the orders that contain bannanas
# we can new get more information about the users that are getting bananas
# for the given dataframe how do you get a list of user id's
order_b_set=orders[orders.order_id.isin(b_set.order_id)]

In [17]:
order_b_set.user_id.value_counts().index.to_list()

[182168,
 183923,
 206126,
 15661,
 144684,
 66858,
 138537,
 5416,
 104288,
 32037,
 60759,
 95524,
 195677,
 83234,
 154913,
 87328,
 142558,
 42270,
 183603,
 130357,
 126263,
 189870,
 193876,
 122193,
 186282,
 9550,
 68939,
 1354,
 138569,
 202056,
 32069,
 95556,
 183519,
 175423,
 42302,
 36155,
 143510,
 113949,
 111900,
 103704,
 199093,
 142575,
 13548,
 68843,
 132330,
 72937,
 70888,
 27879,
 179409,
 85219,
 17634,
 23777,
 198479,
 44255,
 42206,
 46300,
 157309,
 62708,
 176891,
 189686,
 124182,
 52499,
 187665,
 185616,
 159205,
 140558,
 15629,
 13580,
 72200,
 5384,
 147383,
 95492,
 83202,
 173310,
 34042,
 203752,
 169304,
 131144,
 179613,
 204201,
 28071,
 97701,
 150947,
 17826,
 89505,
 21920,
 177564,
 171353,
 190549,
 34202,
 171417,
 157853,
 64917,
 62868,
 52627,
 66733,
 75182,
 11695,
 199399,
 13772,
 67018,
 204233,
 93639,
 91590,
 32197,
 71863,
 83394,
 155073,
 44479,
 42430,
 48573,
 112060,
 40377,
 118195,
 143797,
 122257,
 153840,
 180135,
 

## Follow Along

### How can we get a subset of data, just for these customers?

We want *all* the orders from customers who have *ever* bought bananas.

(And *none* of the orders from customers who have *never* bought bananas.)

### What features can we engineer? We want to predict, will these customers reorder bananas on their next order?

## Challenge

**Continue to clean and explore your data.** Can you **engineer features** to help predict your target? For the evaluation metric you chose, what score would you get just by guessing? Can you **make a fast, first model** that beats guessing?

We recommend that you use your portfolio project dataset for all assignments this sprint. But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset today. Follow the instructions in the assignment notebook. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!