Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle Data

## Download and Import

In [None]:
import requests
import tarfile
import pandas as pd

Download file to local machine

In [None]:
url = 'https://nicks-datasets.s3-us-west-2.amazonaws.com/instacart_online_grocery_shopping_2017_05_01.tar.gz'
filename = url.split('/')[-1]
r = requests.get(url)

with open(filename, 'wb') as f:
    f.write(r.content)

Extract `.csv` files from `.gz` file

In [None]:
# Extract files from our `.gz` file
tarfile.open('instacart_online_grocery_shopping_2017_05_01.tar.gz').extractall()

Load `.csv` files into DataFrames

In [None]:
# Load files into DataFrames
orders = pd.read_csv('instacart_2017_05_01/orders.csv')
order_products_train = pd.read_csv('instacart_2017_05_01/order_products__train.csv')
order_products_prior = pd.read_csv('instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('instacart_2017_05_01/products.csv')

## EDA - Warmup Questions

What information is included in the four DataFrames?

- `orders` contains individual orders placed by customers. Each order has a unique `'order_id'`.
- `order_products_prior` contains the items for each order in `orders` where `'eval_set' == 'prior'`.
- `products` contains all the information for each product a customer can order. Each product has a unique `'product_id'`.

What information is contained in the column `orders['eval_set']`?

In [None]:
orders['eval_set'].value_counts()

The first row of `orders['order_id']` is `2539329`. Where can we find the items that were included in that order?

In [None]:
orders.loc[0]

In [None]:
mask = order_products_prior['order_id'] == 2539329
order_products_prior[mask]

The first row of `order_products_prior['product_id']` is `33120`. What is the name of that product?

In [None]:
order_products_prior.loc[0]

In [None]:
products.head()

In [None]:
mask = products['product_id'] == 33120
products[mask]

## Define Our Machine Learning Problem

- We want to predict whether or not a customer will purchase an item.
- The item that is most ordered is `'banana'` (`24852`).

In [None]:
order_products_prior['product_id'].value_counts()

In [None]:
mask = products['product_id'] == 24852
products[mask]

## Create our feature matrix and our target vector

1. Limit our feature matrix to `'train'` orders.

In [None]:
X = orders[orders['eval_set']=='train']

2. "Check every order id for the product id and see if the product id is the one for banana" (Marcos).

First, identify banana in `order_products_train`.

In [None]:
order_products_train['is_banana'] = \
(order_products_train['product_id'] == 24852).astype(int)

Second, make list of `'order_id'`s that have `'banana'` in them

In [None]:
mask = order_products_train['is_banana'] == 1
banana_order_ids = order_products_train[mask]['order_id'].unique()

Third, use the list of `'order_id'`s to create our target.

In [None]:
X['has_banana'] = X['order_id'].isin(banana_order_ids)

## Create new features

Size of order

In [None]:
order_size = order_products_train.groupby('order_id')['product_id'].count()
X = X.merge(order_size, left_on='order_id', right_index=True)
X = X.rename(columns={"product_id": "n_items"})

Ordered bananas previously

In [None]:
# Identifying bananas
order_products_prior['is_banana'] = \
(order_products_prior['product_id'] == 24852).astype(int)

# Identifying prior orders with banana
mask = order_products_prior['is_banana'] == 1
banana_order_id_prior = order_products_prior[mask]['order_id'].unique()
orders['prior_banana'] = orders['order_id'].isin(banana_order_id_prior)

# Identify `user_id`s associated with those prior orders
previous_banana_users = orders[orders['prior_banana']]['user_id'].unique()

# Find those `user_id`s in `X`
X['previous_banana'] = X['user_id'].isin(previous_banana_users)

# Split Data

Split target from feature matrix

In [None]:
target = 'has_banana'
y = X[target]
X = X[['order_dow', 
       'order_hour_of_day',
       'days_since_prior_order',
       'n_items',
       'previous_banana']]

Split data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Establish Baseline

In [None]:
print('Baseline Acc:', y_train.value_counts(normalize=True).max())

# Build Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

model.fit(X_train, y_train);

# Check Metrics

In [None]:
print('Training Accuracy:', model.score(X_train, y_train))
print('Validation Accuracy:', model.score(X_test, y_test))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, model.predict(X_test)))