# TASK 4.10 - Part 1, Coding Etiquette & Excel Reporting

## This script contains the following points from Step 1- 5 in task 4.10, part 1:
### -  Importing libraries
### -  Importing Data
### -  Creating new column using 'If-Statements with For-Loops'
### - Deriving new columns with loc()
### - Creating crosstab 
### -  Creating & Exporting Charts
### -  Merging datasets
### - Droping Columns
### -  Exporting Data in Pickle Format

# Step 1.

## 1. Importing libraries

In [79]:
# Import libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

## 2. Importing Data

In [80]:
path = r'C:\Users\Sanja\Documents\08-2020 Instacart Basket Analysis'

### 2.1 Importing 'Orders Product Customers' data set

In [81]:
ords_prods_cust_all = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data','orders_products_all.pkl'))

MemoryError: 

In [None]:
# Check on the imported data
ords_prods_cust_all.shape

In [None]:
ords_prods_cust_all.head()

In [None]:
ords_prods_cust_all.info()

### 2.2 Importing 'Department' data set

In [None]:
departments = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'), index_col = False)

In [None]:
# Check on the imported data
departments

In [None]:
# renaming the 'Unnamed: 0' column to 'department_id'

departments.rename(columns={'Unnamed: 0':'department_id'},inplace=True)

In [None]:
departments

# Step 2.  Addressing PII data in the data set

### Note: Consider any security implications that might exist for this new data. You’ll need to address any PII data in the data before continuing your analysis.

## Answer: We need to pay special attention to any Personally Identifiable Information (PII), such as names, email addresses, physical addresses, and phone numbers. Specifically, we had customer names in the dataset, but since they are considered PII, I excluded them from further analysis. Additionally, I didn't combine these name columns with the final dataset I'm working with to ensure data privacy and security.

# Step 3. Exploring customer behavior in different geographic areas

### Note:The Instacart officers are interested in comparing customer behavior in different geographic areas. Create a regional segmentation of the data. You’ll need to create a “Region” column based on the “State” column from your customers data set.
### Use the region information in this Wikipedia article to create your column (you only need to create regions, not divisions).
### Determine whether there’s a difference in spending habits between the different U.S. regions. (Hint: You can do this by crossing the variable you just created with the spending flag.)

## 3.1. Creating a “region” column based on the “state” column from the customers data set.

In [None]:
ords_prods_cust_all['state'].value_counts(dropna = False)

In [None]:
# Create new column for region, using 'If-Statements with For-Loops'
result =[]

for value in ords_prods_cust_all['state']:
    if value in ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']:
        result.append('Northeast')
    elif value in ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']:
        result.append('Midwest')
    elif value in ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']:
        result.append('South')
    elif value in ['Idaho', 'Montana', 'Wyoming', 'Nevada', 'Utah', 'Colorado', 'Arizona', 'New Mexico', 'Alaska', 'Washington', 'Oregon', 'California', 'Hawaii']:
        result.append('West')
    else:
        result.append('Unknown')

In [None]:
result

In [None]:
# Create new column from result output
ords_prods_cust_all['region'] = result

In [None]:
# Check accurate regional segmentation
ords_prods_cust_all['region'].value_counts (dropna = False)

In [None]:
# Check on the dataframe
ords_prods_cust_all.shape

In [None]:
ords_prods_cust_all.head()

## 3.2 Creating crosstab to match regions column with the spending_flag column

In [None]:
print(ords_prods_cust_all['spending_flag'])

In [None]:
regional_spending_habits = pd.crosstab(ords_prods_cust_all['region'], ords_prods_cust_all['spending_flag'], dropna = False)

In [None]:
regional_spending_habits

In [None]:
# Create a bar chart of the above 'regional_spending_habits' crosstab

bar_regional_spending_habit=regional_spending_habits.plot.bar(color=['lightblue','tab:blue'])
plt.xlabel("Region", fontsize=10)
plt.ylabel("Frequency",fontsize=10)
plt.title("Regional Spending Habit", fontsize=12)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.tight_layout()  # Automatically adjust subplot parameters

### Observation: It seems that customers' spending habits, especially distinguishing between high and low spenders, do not exhibit significant variations between different regions. In each region, the number of low spenders aligns with the total number of customers, with the regions ranked as South, West, Midwest, and Northeast. Furthermore, there are no notable differences in the counts of high spenders across these regions.

In [None]:
# Export the chart
bar_regional_spending_habit.figure.savefig(os.path.join(path, '04 Analysis','Visualizations', 'bar_regional_spending_habit.png'))

# Step 4. Creating an exclusion flag for low-activity customers

### Note: The Instacart CFO isn’t interested in customers who don’t generate much revenue for the app. Create an exclusion flag for low-activity customers (customers with less than 5 orders) and exclude them from the data. Make sure you export this sample.

In [None]:
# Deriving columns with loc()
ords_prods_cust_all.loc[ords_prods_cust_all['max_order'] >= 5, 'customer_activity'] = 'active_customer'

In [None]:
ords_prods_cust_all.loc[ords_prods_cust_all['max_order'] < 5, 'customer_activity'] = 'non_active_customer'

In [None]:
# Cross-check the result
ords_prods_cust_all['customer_activity'].value_counts(dropna = False)

#### Observation: After the cross-check of the sum of frequency of the new column I can confirm that the code operated correctly. Total sum = 32404859 rows.

In [None]:
# Create a subset for the active customers

ords_prods_active_customers = ords_prods_cust_all[ords_prods_cust_all['customer_activity']=='active_customer']

In [None]:
ords_prods_active_customers.shape

In [None]:
ords_prods_active_customers.head()

In [None]:
# Export the 'active_customer' subset

ords_prods_active_customers.to_pickle(os.path.join(path,'02 Data','Prepared Data','ords_prods_active_customers.pkl'))

# Step 5. Customer Profiling

### Note: The marketing and business strategy units at Instacart want to create more-relevant marketing strategies for different products and are, thus, curious about customer profiling in their database. Create a profiling variable based on age, income, certain goods in the “department_id” column, and number of dependents. You might also use the “orders_day_of_the_week” and “order_hour_of_day” columns if you can think of a way they would impact customer profiles. (Hint: As an example, try thinking of what characteristics would lead you to the profile “Single adult” or “Young parent.”)

## 5.1 Creating 'Age' Groups

### The customers are grouped in 3 age groups:
#### Group 1 'Under 40': customers < 40 years
#### Group 2 'Middle Age':  40 years < = customers < 65 years
#### Group 3 'Seniors': customers > = 65 years


In [None]:
# Deriving columns with loc(), create a flag 'age group'
ords_prods_cust_all.loc[ords_prods_cust_all['age'] < 40, 'age_group'] = 'Under 40'

In [None]:
ords_prods_cust_all.loc[(ords_prods_cust_all['age'] >= 40) & (ords_prods_cust_all['age'] < 65),'age_group']= 'Middle Age'

In [None]:
ords_prods_cust_all.loc[ords_prods_cust_all['age'] >= 65, 'age_group'] = 'Seniors'

In [None]:
# Cross-check the result
ords_prods_cust_all['age_group'].value_counts(dropna=False)

#### Observation: After the cross-check of the sum of frequency of the new column I can confirm that the code operated correctly. Total sum = 32404859 rows.

## 5.2 Customer Profiling based on Income - Creating 'Income' Groups

In [None]:
# Define the intervals of the groups
ords_prods_cust_all['spending_power'].describe()

### The customers are grouped in 3 groups, based on the income:
#### Group 1 'Lower':  income < 67.000
#### Group 2 'Medium':  67.000 < = income < 130.000
#### Gropu 3 'Higher: income > = 130.000

In [None]:
# Deriving columns with loc(), create a flag 'income group'
ords_prods_cust_all.loc[ords_prods_cust_all['spending_power'] < 67000, 'income_group'] = 'Lower'

In [None]:
ords_prods_cust_all.loc[(ords_prods_cust_all['spending_power'] >= 67000) & (ords_prods_cust_all['spending_power'] < 130000),'income_group']= 'Medium'

In [None]:
ords_prods_cust_all.loc[ords_prods_cust_all['spending_power'] >= 130000, 'income_group'] = 'Higher'

In [None]:
# Cross-check the result
ords_prods_cust_all['income_group'].value_counts(dropna=False)

#### Observation: After the cross-check of the sum of frequency of the new column I can confirm that the code operated correctly. Total sum = 32404859 rows.

## 5.3 Creating 'Family Status' Groups, based on Marital Status & Number of Dependents

In [None]:
# Define the intervals of the groups
ords_prods_cust_all['marital_status'].value_counts(dropna=False)

In [None]:
ords_prods_cust_all['marital_status'].describe()

In [None]:
ords_prods_cust_all['number_of_dependents'].value_counts(dropna=False)

In [None]:
# Define the intervals of the groups
ords_prods_cust_all['number_of_dependents'].describe()

In [None]:
# Deriving columns with loc(), create a flag 'family status'
ords_prods_cust_all.loc[(ords_prods_cust_all['marital_status'].isin(['divorced/widowed','single', 'living with parents and siblings']))&(ords_prods_cust_all['number_of_dependents'] == 0),'family_status_flag']= 'Single adult'

In [None]:
ords_prods_cust_all.loc[(ords_prods_cust_all['marital_status'].isin(['living with parents and siblings']))&(ords_prods_cust_all['number_of_dependents'] > 0),'family_status_flag']= 'Young parent'

In [None]:
ords_prods_cust_all.loc[(ords_prods_cust_all['marital_status'].isin(['divorced/widowed','single']))&(ords_prods_cust_all['number_of_dependents'] > 0),'family_status_flag']= 'Single adult with children'

In [None]:
ords_prods_cust_all.loc[(ords_prods_cust_all['marital_status'].isin(['married']))&(ords_prods_cust_all['number_of_dependents'] > 0),'family_status_flag']= 'Family'

In [None]:
ords_prods_cust_all.loc[(ords_prods_cust_all['marital_status'].isin(['married']))&(ords_prods_cust_all['number_of_dependents'] == 0),'family_status_flag']= 'Family without children'

In [None]:
# Cross-check the result
ords_prods_cust_all['family_status_flag'].value_counts(dropna=False)

#### Observation: After the cross-check of the sum of frequency of the new column I can confirm that the code operated correctly. Total sum = 32404859 rows.

## 5.4. Customer Profiling based on Certain Goods in the 'department_id'

## 5.4.1. Merging the prepared Instacart 'ords_prods_cust_all' data with the wrangled 'departments' data set

In [None]:
# Merge 'ords_prods_cust_all' and 'departments' using department_id as a key 
df_instacart_all = ords_prods_cust_all.merge(departments, on = 'department_id')

In [None]:
df_instacart_all.shape

In [None]:
df_instacart_all.head()

In [None]:
# Use the indicator argument to check whether there was a full match between the two dataframes
df_instacart_all_test = ords_prods_cust_all.merge(departments, on = 'department_id', indicator = True)

In [None]:
df_instacart_all_test['_merge'].value_counts()

In [None]:
# Merge 'ords_prods_cust_all' and 'departments' using department_id as a key  & the argument how = 'outer', to double-check the full match 
df_instacart_all_test_1 = ords_prods_cust_all.merge(departments, on = 'department_id', indicator = True, how = 'outer')

In [None]:
# Use the indicator argument to check whether there was a full match between the two dataframes
df_instacart_all_test_1['_merge'].value_counts()

#### Note: After using this method to double-check the merge, we can see that we do have a full match.

## 5.4.2. Creating  Groups of  Goods - Grocery Essentials& Non-Grocery Items

In [None]:
# Deriving columns with loc(), create a flag 'goods_sales_count'
df_instacart_all.loc[(df_instacart_all['department'].isin(['produce', 'dairy eggs', 'snacks', 'beverages', 'frozen', 'pantry', 'bakery'])),'goods_group']= 'Grocery Essentials'

In [None]:
df_instacart_all.loc[(df_instacart_all['department'].isin(['canned goods', 'deli', 'dry goods pasta', 'household', 'meat seafood', 'breakfast', 'personal care', 'babies', 'international', 'alcohol', 'pets', 'other' ])),'goods_group']= 'Non-Grocery Items'

In [None]:
df_instacart_all.loc[(df_instacart_all['department'].isin(['bulk', 'missing'])),'goods_group']= 'Not Specified'

In [None]:
# Cross-check the result
df_instacart_all['goods_group'].value_counts(dropna=False)

#### Observation: After the cross-check of the sum of frequency of the new column I can confirm that the code operated correctly. Total sum = 32404859 rows

## 5.4.3. Creating the Groups of the goods based on their sales count

In [None]:
# Define the intervals of the groups
df_instacart_all['department'].value_counts()

In [None]:
# Deriving columns with loc(), create a flag 'goods_sales_count'
df_instacart_all.loc[(df_instacart_all['department'].isin(['produce', 'dairy eggs'])),'goods_sales_count']= 'High Sales'

In [None]:
df_instacart_all.loc[(df_instacart_all['department'].isin(['snacks', 'beverages', 'frozen', 'pantry', 'bakery', 'canned goods', 'deli', 'dry goods pasta', 'household', 'meat seafood', 'breakfast', 'personal care', 'babies'])),'goods_sales_count']= 'Medium Sales'

In [None]:
df_instacart_all.loc[(df_instacart_all['department'].isin(['international', 'alcohol', 'pets', 'missing', 'other', 'bulk'])),'goods_sales_count']= 'Low Sales'

In [None]:
# Cross-check the result
df_instacart_all['goods_sales_count'].value_counts(dropna=False)

#### Observation: After the cross-check of the sum of frequency of the new column I can confirm that the code operated correctly. Total sum = 32404859 rows

In [None]:
df_instacart_all.shape

## Note: 'df_instacart_all' will serve as the final dataframe used for the final analysis

In [None]:
# Droping the unnecessary columns from the final dataframe
instacart_all=df_instacart_all.drop(columns = ['add_to_cart_order', 'reordered','aisle_id' ])

In [None]:
instacart_all.shape

In [None]:
# Export Final Data in Pickle Format
instacart_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'instacart_all.pkl'))