# 4.10. Final report (tasks 1-4)

List of contents:
1. Import libraries
2. Import subset of customer data
3. check for data consistency
4. write a brief analysis on the use of PII data
5. comparing customer behaviour in different regions
6. creating exclusion flag for low activity customer
7. filter out low-activity customer
8. export high activity customer(active customer)
9. export high and low customer activity combined



# Step 1

In [15]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [16]:
# allocate file path

path = r'C:\Users\admin\Desktop\10.2023 Instacart Basket Analysis'

path

'C:\\Users\\admin\\Desktop\\10.2023 Instacart Basket Analysis'

In [17]:
#import read pickle file

ords_cus_merged =pd.read_pickle(os.path.join(path, '02 Data','Prepared Data', 'ords_cus_merged_ready.pkl'))

# check for data consistency

In [18]:
ords_cus_merged.shape

(4034590, 30)

In [19]:
ords_cus_merged.head()

Unnamed: 0,user_id,STATE,Age,number_of_dependants,fam_status,income,order_id,order_number,orders_day_of_week,order_hour,...,busiest day,busiest_day,busiest_slowest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spender_flag,median_days_since_prior_order,order_frequency_flag
3487529,3,Arizona,33,3,married,93240,-45,2,2,20,...,Regularly busy,Regularly busy,Regularly busy,Average orders,7,New customer,9.078125,Low spender,7.0,Frequent customer
1086135,59,Delaware,49,2,married,71218,78,16,1,15,...,Regularly busy,Regularly busy,Busiest days,Most orders,37,Regular customer,7.003906,Low spender,7.0,Frequent customer
6994335,27,New York,81,0,divorced/widowed,110170,66,10,1,14,...,Regularly busy,Regularly busy,Busiest days,Most orders,30,Regular customer,8.765625,Low spender,9.0,Frequent customer
12713972,26,New Mexico,59,2,married,49072,-26,27,6,11,...,Regularly busy,Regularly busy,Regularly busy,Most orders,50,Loyal customer,8.03125,Low spender,6.0,Frequent customer
5791778,52,Alabama,64,2,married,40974,-30,50,2,14,...,Regularly busy,Regularly busy,Regularly busy,Most orders,75,Loyal customer,8.117188,Low spender,5.0,Frequent customer


# Step 2
Consider any security implications that might exist for this new data. You’ll need to address any PII data in the data before continuing your analysis.

The dataset utilized for analysis has been carefully pre-processed to exclude any Personally Identifiable Information (PII). This precaution was taken because PII is not relevant to the analysis objectives and to ensure adherence to data security protocols.



## Step 3

The Instacart officers are interested in comparing customer behavior in different geographic areas. Create a regional segmentation of the data. You’ll need to create a “Region” column based on the “State” column from your customers data set

In [20]:
##This Python code creates a new column 'Region' in the DataFrame 'ords_cus_merged',
## categorizing each row into a U.S. region 
## (Northeast, Midwest, South, West) based on the state specified in the 'STATE' column.


def assign_region(state):
    if state in ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont',
                 'New Jersey', 'New York', 'Pennsylvania']:
        return 'Northeast'
    elif state in ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin',
                   'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota']:
        return 'Midwest'
    elif state in ['Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia',
                   'Alabama', 'Kentucky', 'Mississippi', 'Tennessee',
                   'Arkansas', 'Louisiana', 'Oklahoma', 'Texas']:
        return 'South'
    elif state in ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming',
                   'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']:
        return 'West'
    else:
        return 'Unknown'  # For any state not listed



In [21]:
# Applying the function to DataFrame to customers database

ords_cus_merged['Region'] = ords_cus_merged['STATE'].apply(assign_region)

In [22]:
ords_cus_merged.head()

Unnamed: 0,user_id,STATE,Age,number_of_dependants,fam_status,income,order_id,order_number,orders_day_of_week,order_hour,...,busiest_day,busiest_slowest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spender_flag,median_days_since_prior_order,order_frequency_flag,Region
3487529,3,Arizona,33,3,married,93240,-45,2,2,20,...,Regularly busy,Regularly busy,Average orders,7,New customer,9.078125,Low spender,7.0,Frequent customer,West
1086135,59,Delaware,49,2,married,71218,78,16,1,15,...,Regularly busy,Busiest days,Most orders,37,Regular customer,7.003906,Low spender,7.0,Frequent customer,South
6994335,27,New York,81,0,divorced/widowed,110170,66,10,1,14,...,Regularly busy,Busiest days,Most orders,30,Regular customer,8.765625,Low spender,9.0,Frequent customer,Northeast
12713972,26,New Mexico,59,2,married,49072,-26,27,6,11,...,Regularly busy,Regularly busy,Most orders,50,Loyal customer,8.03125,Low spender,6.0,Frequent customer,West
5791778,52,Alabama,64,2,married,40974,-30,50,2,14,...,Regularly busy,Regularly busy,Most orders,75,Loyal customer,8.117188,Low spender,5.0,Frequent customer,South


In [23]:
ords_cus_merged['Region'].value_counts(dropna = False)

Region
South        1237688
West         1082892
Midwest       906209
Northeast     710143
Unknown        97658
Name: count, dtype: int64

In [24]:
# To determine whether theres a difference in spending habits between U:S. regions
# I will use two methods; Crosstabs methods and grouping method

crosstab = pd.crosstab(ords_cus_merged['Region'], ords_cus_merged['spender_flag'], dropna = False)

In [25]:
crosstab.to_clipboard()

#Region	High spender	Low spender
#Midwest	17819	888390
#Northeast	12666	697477
#South	25441	1212247
#Unknown	2166	95492
#West	20314	1062578


In [26]:
# Group the data by 'Region' and 'spending_flag', then count the occurrences
spending_habits_by_region = ords_cus_merged.groupby(['Region', 'spender_flag']).size().reset_index(name='count')

# Alternatively, if 'spending_flag' is numerical (like total spend), you could use mean to find average spending
# spending_habits_by_region = ords_cus_merged.groupby(['Region', 'spender_flag']).mean().reset_index(name='average_spending')

# Display the result
print(spending_habits_by_region)


      Region  spender_flag    count
0    Midwest  High spender    17819
1    Midwest   Low spender   888390
2  Northeast  High spender    12666
3  Northeast   Low spender   697477
4      South  High spender    25441
5      South   Low spender  1212247
6    Unknown  High spender     2166
7    Unknown   Low spender    95492
8       West  High spender    20314
9       West   Low spender  1062578


# Step 4
The Instacart CFO isn’t interested in customers who don’t generate much revenue for the app. Create an exclusion flag for low-activity customers (customers with less than 5 orders) and exclude them from the data. Make sure you export this sample

In [None]:
# Step 1: Create the exclusion flag
ords_cus_merged['exclusion_flag'] = ords_cus_merged['order_number'] < 5

# Step 2: Filter out low-activity customers
high_activity_customers = ords_cus_merged[ords_cus_merged['exclusion_flag'] == False]

# Step 3: Export the sample
high_activity_customers.to_csv('high_activity_customers.csv', index=False)


In [47]:
# Assign a flag for low-activity customers
ords_cus_merged.loc[ords_cus_merged['order_number'] < 5, 'exclusion_flag'] = 'low-activity customers'

In [48]:
ords_cus_merged.head(15)

Unnamed: 0,user_id,STATE,Age,number_of_dependants,fam_status,income,order_id,order_number,orders_day_of_week,order_hour,...,busiest_slowest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spender_flag,median_days_since_prior_order,order_frequency_flag,Region,exclusion_flag
3487529,3,Arizona,33,3,married,93240,-45,2,2,20,...,Regularly busy,Average orders,7,New customer,9.078125,Low spender,7.0,Frequent customer,West,low-activity customers
1086135,59,Delaware,49,2,married,71218,78,16,1,15,...,Busiest days,Most orders,37,Regular customer,7.003906,Low spender,7.0,Frequent customer,South,
6994335,27,New York,81,0,divorced/widowed,110170,66,10,1,14,...,Busiest days,Most orders,30,Regular customer,8.765625,Low spender,9.0,Frequent customer,Northeast,
12713972,26,New Mexico,59,2,married,49072,-26,27,6,11,...,Regularly busy,Most orders,50,Loyal customer,8.03125,Low spender,6.0,Frequent customer,West,
5791778,52,Alabama,64,2,married,40974,-30,50,2,14,...,Regularly busy,Most orders,75,Loyal customer,8.117188,Low spender,5.0,Frequent customer,South,
11624605,31,Oklahoma,38,1,married,52311,79,20,0,20,...,Busiest days,Average orders,38,Regular customer,7.820312,Low spender,8.0,Frequent customer,South,
12232240,30,Ohio,28,0,single,71469,-17,20,3,10,...,Slowest days,Most orders,35,Regular customer,7.519531,Low spender,8.0,Frequent customer,Midwest,
8655287,79,North Carolina,22,0,single,53757,38,2,5,13,...,Regularly busy,Most orders,9,New customer,4.558594,Low spender,7.0,Frequent customer,South,low-activity customers
12438297,116,Illinois,22,0,single,88778,-22,13,4,15,...,Slowest days,Most orders,21,Regular customer,7.535156,Low spender,11.0,Regular customer,Midwest,
11548288,31,Oklahoma,38,1,married,52311,91,17,2,13,...,Regularly busy,Most orders,55,Loyal customer,8.023438,Low spender,5.0,Frequent customer,South,


In [49]:
ords_cus_merged['exclusion_flag'].value_counts(dropna = False)

exclusion_flag
NaN                       3042409
low-activity customers     992181
Name: count, dtype: int64

In [50]:
# Filter out low-activity customers
high_activity_customers = ords_cus_merged[ords_cus_merged['order_number'] >= 5]


# export the filtered data to pickle. high_activity_customers

In [55]:
# Export the filtered data to a CSV file
high_activity_customers.to_csv(os.path.join(path, '02 Data','Prepared Data', 'high_activity_customers.csv'), index=False)

In [59]:
ords_cus_merged.head()

Unnamed: 0,user_id,STATE,Age,number_of_dependants,fam_status,income,order_id,order_number,orders_day_of_week,order_hour,...,busiest_slowest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spender_flag,median_days_since_prior_order,order_frequency_flag,Region,exclusion_flag
3487529,3,Arizona,33,3,married,93240,-45,2,2,20,...,Regularly busy,Average orders,7,New customer,9.078125,Low spender,7.0,Frequent customer,West,low-activity customers
1086135,59,Delaware,49,2,married,71218,78,16,1,15,...,Busiest days,Most orders,37,Regular customer,7.003906,Low spender,7.0,Frequent customer,South,
6994335,27,New York,81,0,divorced/widowed,110170,66,10,1,14,...,Busiest days,Most orders,30,Regular customer,8.765625,Low spender,9.0,Frequent customer,Northeast,
12713972,26,New Mexico,59,2,married,49072,-26,27,6,11,...,Regularly busy,Most orders,50,Loyal customer,8.03125,Low spender,6.0,Frequent customer,West,
5791778,52,Alabama,64,2,married,40974,-30,50,2,14,...,Regularly busy,Most orders,75,Loyal customer,8.117188,Low spender,5.0,Frequent customer,South,


# Export dataframe. high and low customers activity combined

In [62]:
# Export high and low customer activities combined to pickle

ords_cus_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'order_cus_merged_high_low.pkl'))