# Contents
## Importing Libraries and Data (orders_products_all.pkl)
## Data Security Check
## Spending Habits by Region
### &emsp; Creating the 'region' column
### &emsp; Comparing 'region' and 'spending_flag'
## Excluding Low-Activity Customers
### &emsp; Creating the 'exculsion_flag' column
### &emsp; Creating the sample
### Exporting the sample as orders_products_high.pkl
## Profiling Customers
### &emsp; Age
### &emsp; Income
### &emsp; Department
### &emsp; Family Status
## Profile Visualizations
## Profile Aggregation
## Customer Profiles by Region and Department
## Visualizations for Results
## Extra Analysis
### &emsp; Ordering Habits by Region
### &emsp; Ordering Habits by Loyalty
## Exporting as orders_products_final.pkl

# Step 1: Importing Libraries and Data

In [1]:
# Importing libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# Turning project folder into a string

path = r'C:\Users\davau\OneDrive - College of the Sequoias\Career Foundry\Data Immersion\Achievement 4 (Python)\Instacart Basket Analysis'

In [None]:
# Importing orders_products_all.pkl

df = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_all.pkl'))

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.head()

# Step 2: Checking for data security

I removed the customer names from customers.csv in Task 4.9 Part 1 before merging that data with the rest of my data.  The remaining columns cannot be used to identify any individual customer, even if take all together.

# Step 3: Spending Habits by Region

Determine whether there’s a difference in spending habits between the different U.S. regions.

## Creating the 'region' column

In [None]:
# Creating an empty list that will become the 'region' column

region = []

In [None]:
# Checking to see how the states are written in the df

df['state'].value_counts(dropna = False)

In [None]:
# Filling 'region' with 'northeast', 'midwest', 'south', or 'west'

for x in df['state']:
    if x in ['Maine','New Hampshire', 'Vermont', 'Massachusetts','Rhode Island','Connecticut','New York','Pennsylvania','New Jersey']:
        region.append('northeast')
    elif x in ['Wisconsin','Michigan','Illinois','Indiana','Ohio','North Dakota','South Dakota','Nebraska','Kansas','Minnesota','Iowa','Missouri']:
        region.append('midwest')
    elif x in ['Delaware','Maryland','District of Columbia','Virginia','West Virginia','North Carolina','South Carolina','Georgia','Florida','Kentucky','Tennessee','Mississippi','Alabama','Oklahoma','Texas','Arkansas','Louisiana']:
        region.append('south')
    else:
        region.append('west')
            

In [None]:
region

In [None]:
# Adding 'region' column to df

df['region'] = region

In [None]:
# Getting frequency distribution for 'region'

df['region'].value_counts(dropna = False)

## Comparing 'region' with 'spending_flag'

In [None]:
# Creating a crosstab between 'region' and 'spending_flag'

region_spending_cross = pd.crosstab(df['region'], df['spending_flag'], dropna = False)

In [None]:
region_spending_cross

In [None]:
# Copying to clipboard to paste in Excel

region_spending_cross.to_clipboard()

# Step 4: Excluding low-activity customers

## Creating the exclusion flag

In [None]:
df.loc[df['max_order'] < 5, 'low_order_flag'] = 'Low order customer'

In [None]:
df.loc[df['max_order'] >= 5, 'low_order_flag'] = 'High order customer'

In [None]:
df['low_order_flag'].value_counts(dropna = False)

## Creating a sample with only high order customers

In [None]:
# Creating the sample

df_high = df[df['low_order_flag'] == 'High order customer']

In [None]:
df_high.head()

In [None]:
# Exporting the sample as orders_products_high.pkl

df_high.to_pickle(os.path.join(path,'02 Data','Prepared Data','orders_products_high.pkl'))

# Step 5: Profiling Customers

I've been asked to create a profiling variable based on age, income, certain goods in the 'department_id' column, and number of dependents. 

In addition to that, the 'family_status' column obviously contains useful information about whether people are single or married, which affects how we view the 'number_of_dependents'.

If I take each of those 5 columns, assign them 2 values (e.g. "Young" and "Old" for the 'age' column), and then look at every combination of values, there will be 2^5=32 distinct profile types.  This will be too noisy to make sense of.

So instead, I will simply create flags for each of these columns that I can use to answer the questions in the Project Brief.





## Age 

In [None]:
# Creating the age_flag

age_flag = []
for x in df['age'].tolist():
    if x <= 25:
        age_flag.append('Young')
    elif x > 25 and x < 65:
        age_flag.append('Middle-aged')
    elif x >= 65:
        age_flag.append('Senior')
    else:
        print('Weird value:', x)

In [None]:
# Adding age_flag to df as 'age_profile'

df['age_profile'] = age_flag

In [None]:
# Getting frequency distribution for 'age_profile'

df['age_profile'].value_counts(dropna = False)

## Income

In [None]:
# Creating the income_flag

income_flag = []
for x in df['income'].tolist():
    if x < 75000:
        income_flag.append('Low-income')
    elif x >= 75000 and x < 150000:
        income_flag.append('Mid-income')
    elif x >= 150000:
        income_flag.append('High-income')
    else:
        print('Weird value:', row)

In [None]:
# Adding income_flag to df as 'income_profile'

df['income_profile'] = income_flag

In [None]:
# Getting frequency distribution for 'income_profile'

df['income_profile'].value_counts(dropna = False)

## Department

I will use the department_id column to separate the customers into vegans and non-vegans.

Vegans don't eat meat (department_id=12) or dairy (department_id=16)

I don't want to label each purchase as vegan or non-vegan, but the customer as vegan or non-vegan, based on their purchases.

In [None]:
# Creating a crosstab between 'department_id' and 'user_id'

dep_user_cross = pd.crosstab(df['department_id'], df['user_id'], dropna = False)

In [None]:
dep_user_cross

In [None]:
# Assigning nutrition flags to users and storing them in a Python dictionary

vegan_dict = dict()
for user in dep_user_cross:
    workinglist = dep_user_cross[user].tolist()
    if workinglist[11] == 0 and workinglist[15] == 0:   # the index starts at 0, so index 0 corresponds to dep_id=1
        vegan_dict[user] = 'Vegan'
    else:
        vegan_dict[user] = 'Non-vegan'

In [None]:
# Assigning user flags to the vegan_flag

vegan_flag = []
for user in df['user_id']:
    vegan_flag.append(vegan_dict[user])

In [None]:
# Adding vegan_flag to df as 'vegan_profile'

df['vegan_profile'] = vegan_flag

In [None]:
# Getting frequency distribution for 'vegan_profile'

df['vegan_profile'].value_counts(dropna = False)

## Family status

In [None]:
df['family_status'].value_counts(dropna = False)

I will put information from the 'family_status' column together with information about the number of dependents to get a sense for the family structure.  Here's the plan:

(family_status = married) and (number_of_dependents == 1) : married, no children

(family_status = married) and (number_of_dependents >= 2) : married with children

(family_status in (single, divorced/widowed, living with parents and siblings)) and (number_of_dependents == 0) : single, no children

(family_status in (single, divorced/widowed, living with parents and siblings)) and (number_of_dependents >= 1) : single with children

In [None]:
# Creating the 'family_profile' column and labeling the 'married no children' customers

df.loc[(df['family_status'] == 'married') & (df['number_of_dependants'] == 1), 'family_profile'] = 'Married no children'

In [None]:
# Creating the 'family_profile' column and labeling the 'married with children' customers

df.loc[(df['family_status'] == 'married') & (df['number_of_dependants'] >= 2), 'family_profile'] = 'Married with children'

In [None]:
# Creating the 'family_profile' column and labeling the 'single no children' customers

df.loc[(df['family_status'] != 'married') & (df['number_of_dependants'] == 0), 'family_profile'] = 'Single no children'

In [None]:
# Creating the 'family_profile' column and labeling the 'single with children' customers

df.loc[(df['family_status'] != 'married') & (df['number_of_dependants'] >= 1), 'family_profile'] = 'Single with children'

In [None]:
# Getting frequency distribution for 'family_profile'

df['family_profile'].value_counts(dropna = False)

# Step 6: Profile Visualizations

Create an appropriate visualization to show the distribution of profiles.

In [None]:
# age_profile

age_profile_bar = df['age_profile'].value_counts().plot.bar(rot=0)

In [None]:
# Exporting the viz

age_profile_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'age_profile_bar.png'))

In [None]:
# income_profile

income_profile_bar = df['income_profile'].value_counts().plot.bar(rot=0)

In [None]:
# Exporting the viz

income_profile_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'income_profile_bar.png'))

In [None]:
# vegan_profile

vegan_profile_bar = df['vegan_profile'].value_counts().plot.bar(rot=0)

In [None]:
# Exporting the viz

vegan_profile_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'vegan_profile_bar.png'))

In [None]:
# family_profile

family_profile_bar = df['family_profile'].value_counts().plot.bar()

In [None]:
# Exporting the viz

family_profile_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'family_profile_bar.png'))

# Step 7: Profile Aggregation

Aggregate the max, mean, and min variables on a customer-profile level for usage frequency and expenditure.

## Age

In [None]:
# age_profile stats

df.groupby('age_profile').agg({'days_since_prior_order': ['mean', 'max', 'min'], 'prices': ['mean', 'max', 'min']})

Usage frequency and expenditure are similar across all age groups, though Seniors do tend to shop a little more frequently and purchase slightly more expensive products, on average.

## Income

In [None]:
# income_profile stats

df.groupby('income_profile').agg({'days_since_prior_order': ['mean', 'max', 'min'], 'prices': ['mean', 'max', 'min']})

Usage frequency and expenditure are, again, fairly close among the income groups.  Low-income customers tend to purchase slightly cheaper options, and they shop a little less frequently than mid- and high-income customers.

## Veganism

In [None]:
# vegan_profile stats

df.groupby('vegan_profile').agg({'days_since_prior_order': ['mean', 'max', 'min'], 'prices': ['mean', 'max', 'min']})

Here we see more of a difference, expecially in usage frequency.  Vegans tend to go longer between orders than their non-vegan counterparts.  They also tend to go with cheaper products (which is surprising, given that vegan products can be expensive).

## Family Status

In [None]:
# family_profile stats

df.groupby('family_profile').agg({'days_since_prior_order': ['mean', 'max', 'min'], 'prices': ['mean', 'max', 'min']})

Again, we have remarkably consistent results across all groups with respect to usage frequency and expenditure.  "Single with children" customers shop the most frequently and purchase more expensive items than their peers while "Married with children" customers shop the least frequently and purchase cheaper items than their peers.  These differences are minor, though.

# Step 8: Customer Profiles by Region and Department

## Customer Profiles by Region

### Age

In [None]:
# Creating a crosstab comparing 'age_profile' and 'region', looking at column percentages

age_region_cross = pd.crosstab(df['age_profile'], df['region'], normalize = 'columns')
# normalize = 'columns' gives the column percentages

In [None]:
age_region_cross

Similar age groups across all regions.

### Income

In [None]:
# Creating a crosstab comparing 'income_profile' and 'region', looking at column percentages

income_region_cross = pd.crosstab(df['income_profile'], df['region'], normalize = 'columns')
# normalize = 'columns' gives the column percentages

In [None]:
income_region_cross

Similar income groups across all regions.

### Veganism

In [None]:
# Creating a crosstab comparing 'vegan_profile' and 'region', looking at column percentages

vegan_region_cross = pd.crosstab(df['vegan_profile'], df['region'], normalize = 'columns')
# normalize = 'columns' gives the column percentages

In [None]:
vegan_region_cross

Similar rates of veganism in the midwest and northeast, but there are fewer vegans in the south, and more vegans in the west.

### Family Status

In [None]:
# Creating a crosstab comparing 'family_profile' and 'region', looking at column percentages

family_region_cross = pd.crosstab(df['family_profile'], df['region'], normalize = 'columns')
# normalize = 'columns' gives the column percentages

In [None]:
family_region_cross

Similar family structures across all regions.

## Customer Profiles by Department

### Age

In [None]:
# Creating a crosstab comparing 'age_profile' and 'department_id', looking at row percentages

age_department_cross = pd.crosstab(df['age_profile'], df['department_id'], normalize = 'index')   
# normalize = 'index' gives row percentages

In [None]:
pd.set_option('display.max_columns', None)    # displays all columns

age_department_cross

To three decimal places, all columns are the same (+/- one decimal place) except

4 (produce)


To two significant digits, all columns are the same (+/- one sig dig) except

5 (alcohol)

8 (pets)

### Income

In [None]:
# Creating a crosstab comparing 'income_profile' and 'department_id', looking at row percentages

income_department_cross = pd.crosstab(df['income_profile'], df['department_id'], normalize = 'index')
# normalize = 'index' gives row percentages

In [None]:
income_department_cross

To three decimal places, all columns are the same (+/- two decimal places) except 

1 (frozen)

4 (produce) *big one

7 (beverages)

9 (dry goods pasta)

12 (meat seafood) *mid sized one

13 (pantry)

15 (canned goods)

16 (dairy eggs) *mid sized one

19 (snacks) *big one


To two significant digits, all columns are the same (+/- two sig dig) except 1, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 19

### Veganism

In [None]:
# Creating a crosstab comparing 'vegan_profile' and 'department_id'

vegan_department_cross = pd.crosstab(df['vegan_profile'], df['department_id'], normalize = 'index')
# normalize = 'index' gives row percentages

In [None]:
vegan_department_cross

There are differences here in nearly all departments.  Vegans spend less on 1 (frozen), 3 (bakery), 9 (dry goods pasta), 13 (pantry), 15 (canned goods), 20 (deli), and obviously, they don't buy meat and dairy.

Vegans spend more on 4 (produce), 5 (alcohol), 7 (beverages), 11 (personal care), 17 (household), 19 (snacks)

In [None]:
# This isn't a large df, so I'm going to do some work with it in Excel.  Copying to clipboard...

vegan_department_cross.to_clipboard()

Ok, so the largest differences (in magnitude), besides meat and dairy, are department 7 (beverages) at 11.2% and department 19 (snacks) at 7.8%.  There's also a fairly sizeable difference in department 5 (alcohol) at 3.0%

The largest differences (relatively speaking), besides meat and dairy, are department 5 (alcohol) from which vegans purchase 87.3% more often than non-vegans, department 9 (dry goods pasta) from which vegans purchase 65.2% less often than non-vegans, and department 17 (household) from which vegans purchase 65.1% more often than non-vegans.

### Family Status

In [None]:
# Creating a crosstab comparing 'family_profile' and 'department_id'

family_department_cross = pd.crosstab(df['family_profile'], df['department_id'], normalize = 'index')
# normalize = 'index' gives row percentages

In [None]:
family_department_cross

We see fairly similar distributions here.  Department 5 (alcohol) is slightly more elevated for single customers, especially those with children.  Single customers with children purchase more often from department 8 (pets).  Otherwise, it's fairly uniform.  

# Step 9: Visualizations for Results

## Visualizations for Profiles by Region

### Age

In [None]:
# age_profile by region

age_region_bar = age_region_cross.plot.bar(rot=0)

In [None]:
# Exporting viz

age_region_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'age_region_bar.png'))

### Income

In [None]:
# income_profile by region

income_region_bar = income_region_cross.plot.bar(rot=0)

In [None]:
# Exporting viz

income_region_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'income_region_bar.png'))

### Veganism

In [None]:
# vegan_profile by region

vegan_region_bar = vegan_region_cross.plot.bar(rot=0)

In [None]:
# Exporting viz

vegan_region_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'vegan_region_bar.png'))

### Family Status

In [None]:
# family_profile by region

family_region_bar = family_region_cross.plot.bar()

In [None]:
# Exporting viz

family_region_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'family_region_bar.png'))

## Visualizations for Profiles by Department

### Age

In [None]:
# Creating a new crosstab so that the departments are on the x-axis

age_department_cros = pd.crosstab(df['department_id'], df['age_profile'])

In [None]:
# Creating a stacked bar plot of the new crosstab

age_department_bar = age_department_cros.plot.bar(stacked = True)

In [None]:
# Exporting the viz

age_department_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'age_department_bar.png'))

### Income

In [None]:
# Creating a new crosstab so that the departments are on the x-axis

income_department_cros = pd.crosstab(df['department_id'], df['income_profile'])

In [None]:
# Creating a stacked bar plot of the new crosstab

income_department_bar = income_department_cros.plot.bar(stacked = True)

In [None]:
# Exporting the viz

income_department_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'income_department_bar.png'))

### Vegan

In [None]:
# Creating a new crosstab so that the departments are on the x-axis

vegan_department_cros = pd.crosstab(df['department_id'], df['vegan_profile'])

In [None]:
# Creating a stacked bar plot of the new crosstab

vegan_department_bar = vegan_department_cros.plot.bar(stacked = True)

In [None]:
# Exporting the viz

vegan_department_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'vegan_department_bar.png'))

### Family Status

In [None]:
# Creating a new crosstab so that the departments are on the x-axis

family_department_cros = pd.crosstab(df['department_id'], df['family_profile'])

In [None]:
# Creating a stacked bar plot of the new crosstab

family_department_bar = family_department_cros.plot.bar(stacked = True)

In [None]:
# Exporting the viz

family_department_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'family_department_bar.png'))

## Extra Analysis

I have not found much of a difference between different groups using my profiles.  However, I think I can do a better job of making sense of some of the data by just answering some of the questions in the Project Brief directly.  For example, the project brief asks about differences in ordering habits based on loyalty status and region.  Let's explore.

In [None]:
# Creating a crosstab comparing 'region' and 'loyalty_flag'

region_loyalty_cross = pd.crosstab(df['region'], df['loyalty_flag'])

region_loyalty_cross

In [None]:
# Creating a crosstab comparing 'region' and 'loyalty_flag' with row percentages

region_loyalty_crossed = pd.crosstab(df['region'], df['loyalty_flag'], normalize = 'index')

region_loyalty_crossed

In [None]:
# Visualizing regional loyalty

region_loyalty_bar = region_loyalty_cross.plot.bar(rot=0)

At first glance, it seems like there are tons of regular customers in the South, and while this is true in absolute terms, it is also the case that the South was the largest region of the country.  When we take the number of customers into account, we see that the distribution of loyalty is remarkably consistent across regions.

In [None]:
# Visualizing regional loyalty

region_loyalty_bar_2 = region_loyalty_crossed.plot.bar(rot=0)

As we've seen before, there's just not much of a difference in loyalty by region once the population of that region is taken into account.  The same can be said of prices, days_since_prior_order, and all other variables I've explored.  At this point, I've spent many hours on this project, and I'm going to call it here.

In [None]:
# Exporting the visualizations above

region_loyalty_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'region_loyalty_bar.png'))
region_loyalty_bar_2.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'region_loyalty_bar_2.png'))

### Creating a viz to answer "Are there differences in ordering habits based on a customer's loyalty status?"

#### By Order Frequency

In [None]:
# Creating the crosstab

frequency_loyalty_crossed = pd.crosstab(df['order_frequency_flag'], df['loyalty_flag'])

frequency_loyalty_crossed

In [None]:
# Creating the viz

frequency_loyalty_bar = frequency_loyalty_crossed.plot.bar(rot=0)

In [None]:
# Exporting the viz

frequency_loyalty_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'frequency_loyalty_bar.png'))

#### By Spending

In [None]:
# Creating the crosstab

spending_loyalty_crossed = pd.crosstab(df['spending_flag'], df['loyalty_flag'])

spending_loyalty_crossed

In [None]:
# Creating the viz

spending_loyalty_bar = spending_loyalty_crossed.plot.bar(rot=0)

In [None]:
# Exporting the viz

spending_loyalty_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'spending_loyalty_bar.png'))

### Creating a viz to answer "Are there differences in ordering habits based on a customer's region?"

#### By Order Frequency

In [None]:
# Creating the crosstab

frequency_region_crossed = pd.crosstab(df['order_frequency_flag'], df['region'])

frequency_region_crossed

In [None]:
# Creating the viz

frequency_region_bar = frequency_region_crossed.plot.bar(rot=0)

In [None]:
# Exporting the viz

frequency_region_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'frequency_region_bar.png'))

#### By spending

In [None]:
# Creating the crosstab

spending_region_crossed = pd.crosstab(df['spending_flag'], df['region'])

spending_region_crossed

In [None]:
# Creating the viz

spending_region_bar = spending_region_crossed.plot.bar(rot=0)

In [None]:
# Exporting the viz

spending_region_bar.figure.savefig(os.path.join(path, '04 Analysis', 'Visualizations', 'spending_region_bar.png'))

# Step 10: Tidy Up, Export Df, and Save Notebook

In [144]:
# Exporting final data set as 

df.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_final.pkl'))

In [145]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404859 entries, 0 to 32404858
Data columns (total 37 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   order_id                       int64  
 1   user_id                        int64  
 2   eval_set                       object 
 3   order_number                   int64  
 4   orders_day_of_week             int64  
 5   order_hour_of_day              int64  
 6   days_since_prior_order         float64
 7   product_id                     int64  
 8   add_to_cart_order              int64  
 9   reordered                      int64  
 10  product_name                   object 
 11  aisle_id                       int64  
 12  department_id                  int64  
 13  prices                         float64
 14  price_range_loc                object 
 15  busiest_day                    object 
 16  busiest_days                   object 
 17  busiest_period_of_day          object 
 18  