# A4 Instacart

## Contents

#### Libraries and Import

#### Security Concerns
-Removing Name column

#### Geographic Regions  
-Add a new variable with regional categories based on the States column  
-https://simple.wikipedia.org/wiki/List_of_regions_of_the_United_States  
-Cross region with spending to determine any geographic trends

#### Low-activity Exclusion  
-Create a flag for low order volume  
-Export dataset of only "active" customers  





### Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from scipy import stats

In [3]:
# Our dataframe is pretty wide. 
pd.set_option("display.max_columns", None)

### Import (+ data cleaning from 4.9)

In [2]:
# Path variable to our project folder
path = r'D:\2021 CareerFoundry Course\Immersion\Instacart Basket Analysis A4'

In [3]:
# Read pickle file to df
ords_prods_cust = pd.read_pickle(os.path.join(path, '02 Data', 'Processed', 'ords_prods_cust_merge2.pkl'))

In [5]:
# View df
ords_prods_cust.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404859 entries, 0 to 32404858
Data columns (total 33 columns):
 #   Column                        Dtype   
---  ------                        -----   
 0   order_id                      int32   
 1   user_id                       int32   
 2   order_number                  int8    
 3   orders_day_of_week            int8    
 4   order_hour_of_day             int8    
 5   days_since_user_last_ordered  float16 
 6   Repeat_orders                 bool    
 7   product_id                    int32   
 8   add_to_cart_order             int32   
 9   reordered                     int32   
 10  product_name                  object  
 11  aisle_id                      int8    
 12  department_id                 int8    
 13  prices                        float64 
 14  price_range_loc               object  
 15  busiest_day                   object  
 16  busiest_days                  object  
 17  busiest_hour                  object  
 18  

We also need to clean up the outlier prices that were identified in 4.8/9

In [None]:
ords_prods_cust.loc[ords_prods_cust['prices'] >100, 'prices'] = np.nan

### Security/PID 
We will need to remove the name column for sure. There are no emails, phone numbers or addresses in the data. That said with the rest combined, I think there is a reasonable assumption that someone's identity could be figured out. I don't think I can really address that, since I need most of those datapoints, but it is worth mentioning.

In [6]:
# Remove the Name column
ords_prods_cust.drop(columns = ['name'], inplace = True)

### Geographic Regions

In [8]:
# Defining function "region_label"
# This looks at the state column and assigns a regional category (Northeast, Midwest, South, West) 
# based on the contents.
def region_label(row):

  if row['state'] in ['Maine', 'New Hampshire','Vermont','Massachusetts','Rhode Island','Connecticut',
                    'New York', 'Pennsylvania', 'New Jersey']:
    return 'Northeast'
  elif row['state'] in ['North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa',
                         'Missouri', 'Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio']:
    return 'Midwest'
  elif row['state'] in ['Oklahoma', 'Texas', 'Arkansas', 'Louisiana', 'Kentucky', 'Tennessee', 'Mississippi',
                        'Alabama', 'Delaware', 'Maryland', 'District of Columbia', 'Virginia', 
                        'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida']:
    return 'South'
  elif row['state'] in ['Idaho', 'Montana', 'Wyoming', 'Nevada', 'Utah', 'Colorado', 'Arizona', 'New Mexico',
                       'Alaska', 'Washington', 'Oregon', 'California', 'Hawaii']:
    return 'West'
  else: return 'Not enough data'

In [10]:
ords_prods_cust['region'] = ords_prods_cust.apply(region_label, axis = 1)

In [11]:
ords_prods_cust['region'].value_counts()

South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: region, dtype: int64

#### Crossing region w/ spending flag

In [9]:
# Creating the crosstab to count spending categories by region.
# This could be exported to Excel and either reported on, or made into an Excel pivot report
crosstab = pd.crosstab(ords_prods_cust['spend_flag'], ords_prods_cust['region'],
                       dropna = False)
crosstab

region,Midwest,Northeast,South,West
spend_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High,155975,108225,209691,160354
Low,7441350,5614511,10582194,8132559


In [10]:
# Midwest Spend ratio (high:low) percentage
(155975/7441350)*100

2.0960578389673916

In [11]:
# Northeast Spend ratio (high:low) percentage
(108225/5614511)*100

1.927594406707904

In [12]:
# South Spend ratio (high:low) percentage
(209691/10582194)*100

1.9815456038700483

In [13]:
# West Spend ratio (high:low) percentage
(160354/8132559)*100

1.9717532943812641

These all seem fairly similar, but the midwest has a *SLIGHTLY* higher spending ratio. But it's pretty small, roughly 0.1%. I'm not sure that would stand up as significant in statistical tests, but our sample is pretty large.

#### Low-Activity Exclusion Flag

In [4]:
# Customers to be included
ords_prods_cust.loc[ords_prods_cust['max_order'] >= 5, 'exclusion_flag'] = 'Active'

In [5]:
# Disinclude customers with less than 5 orders
ords_prods_cust.loc[ords_prods_cust['max_order'] < 5, 'exclusion_flag'] = 'Inactive'

In [6]:
ords_prods_cust['exclusion_flag'].value_counts()

Active      30964564
Inactive     1440295
Name: exclusion_flag, dtype: int64

In [16]:
ords_prods_cust.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_user_last_ordered,Repeat_orders,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_hour,max_order,loyalty_flag,mean_price,spend_flag,median_freq,order_freq_flag,gender,state,age,date_joined,n_dependants,fam_status,income,_merge,region,exclusion_flag
0,2539329,1,1,2,8,,False,196,1,0,Soda,77,7,9.0,Mid,Regularly busy,Regular days,Average,10,New customer,6.367188,Low,20.5,Low,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Active
1,2398795,1,2,3,7,15.0,True,196,1,1,Soda,77,7,9.0,Mid,Regularly busy,Slowest days,Average,10,New customer,6.367188,Low,20.5,Low,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Active
2,473747,1,3,3,12,21.0,True,196,1,1,Soda,77,7,9.0,Mid,Regularly busy,Slowest days,Busy,10,New customer,6.367188,Low,20.5,Low,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Active
3,2254736,1,4,4,7,29.0,True,196,1,1,Soda,77,7,9.0,Mid,Least busy,Slowest days,Average,10,New customer,6.367188,Low,20.5,Low,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Active
4,431534,1,5,4,15,28.0,True,196,1,1,Soda,77,7,9.0,Mid,Least busy,Slowest days,Busy,10,New customer,6.367188,Low,20.5,Low,Female,Alabama,31,2/17/2019,3,married,40423,both,South,Active


#### Exclude Inactive Customers

In [64]:
# Create a sample which only includes Active customers based on exclusion tag
ords_active_only = ords_prods_cust.loc[ords_prods_cust['exclusion_flag'] == 'Active']

In [62]:
# We dropped around 2 million rows
ords_active_only.shape

(30964564, 39)

In [5]:
ords_active_only.to_pickle(os.path.join(path, '02 Data', 'Processed', 'ords_active_sample_cleaned.pkl'))

NameError: name 'ords_active_only' is not defined