# Kmart Analysis

## Objectives


* Define business requirements
* Conduct EDA: explore data distribution, plot data for discovery
* Start determining content at Tableau Dashboard (Use cases, KPI's, Design)


## Inputs

* `inputs/dataset/kmart_processed_data.csv`, generated at "notebooks/01 - Data Collection.ipynb"

## Outputs

* same as Objectives







---

# Install and load packages

In [None]:
! pip install pandas==1.3.5
! pip install matplotlib==3.5.0
! pip install seaborn==0.11.2
! pip install plotly==5.1.0
! pip install feature-engine==1.4.0 
! pip install pandas-profiling==3.3.0 

import os
os.kill(os.getpid(), 9)

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

---

# Load Data

Clone Repo used for project development

In [None]:
! git clone https://github.com/FernandoRocha88/portfolio-kmart.git

In [None]:
df = pd.read_csv("/content/portfolio-kmart/inputs/dataset/kmart_processed_data.csv")
print(df.shape)
df.head(2)

---

# EDA

Univariate Exploration with Pandas Profiling

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

* The dataset has **185.6k rows**
* There are **178.4k unique orders** in 2019
* There are **19 products** sold. Charging Cable, Batteries and headphones are most sold
* There are 140.7k unique purchase addresses. 
  * We assume a given customer has 1 address and didnt change in 2019. 
  * As a result, there are **140.7k customers** buying from us in 2019
* **Line of products**: 45% of ordered items are Cables and Accessories represents, followed by Audio with 26%, followed by PC and Video Games with 13%. Together, these sum 84%
* **8 States**: 40% of ordered items are from CA, 13.5% from NY and TX, and 11% from MA. These 4 states sum almost 78%
* **10 Cities**: 25% of sales are in San Francisco, 16% in LA, 13% in NY and 10% in Boston. These sum 64% of all items orders




---

# Business Requirements, Analysis and Dashboard

### Business Requirements

* Understand Overall Results and Patterns, in terms of Margin, Revenue, # Orders, # Customers, over time or across different products/regions 

* Discuss business strategies, based on product, product and customer behaviour

### Analysis

Questions to be answered in the notebook, since its use cases belong to an initial exploratory data analysis and tend not to require interactiveness that Tableau brings
* **What is the City, State distribution and its amount of orders?**
* **Product Basket**
* **Customer behaviour**



---

### Dashboard

Use cases to be explored at Tableau (due to interactiveness), using flexible and parameter controlled plot types
* Analytics Explorer
  * What is the `percentual margin / sales / margin / cost / count of order /  count of customer` per `quarters / months / weeks / days of month / weekday / hours / city / state / product line / product` colored by `time / city / product` ?
  * Bar plot, controlled by parameters for x axis, y axis and color by. Hamburger menu with additional series of filter

* Tracker Over Time
  * What is `margin / quanitity of products, percentual margin, unitary product price/cost` over time per `product / region`?
  * Line plot, controlled by parameters for y axis, and color by. Hamburger menu with additional series of filter

* Matrix Performance
  * It shows how 2 different levels (`margin, revenue, cost, # order, # quantity,  customer, percentual margin, avg product price/cost`) are broken down by a given variable (`product, region`), in a given time range
  * Scatter plot, controlled by parameter for x axis, y axis and color by.


Design:
  * Each use case will have its own individual view

---

### Strategic Business Concerns for meeting discussion

* Pricing
  * Under which circumstances could we adjust (increase/decrease) the price?
  * Are the prices in-line with the market reality?


* Product
  * Can we do: up-selling, cross-selling?
  * Which products captured our attention and why?
  * What are the product portfolio risks?



* Customer Behaviour
  * What is the product recurrency profile?
  * How do I find more people that buy recurrently?
  * Where do I find more customers that have higher margin (regardless if it is recurrent)?


* Place
  * What is our coverage/presence and its performance?

* Promotion
  * How are our advertising  initiatives, either for online or in-store?


---

# Questions to be answered in the notebook


### City, State distribution

* What is the City, State distribution and its amount of orders?
  * There are 8 states, 10 cities
  * CA and TX have customers from 2 cities, where remaining states have customers from 1 city
  * CA and NY states have more orders


In [None]:
df_order_city = df.groupby('Order ID')['City'].unique().reset_index() # group data by order ID, get unique City
df_order_city['City'] = df_order_city['City'].apply(lambda x: x[0]) # CityState comes in array, get first item. 
                                                                    # It is assumed that a given order is associated to a unique City 
df_order_city = (df_order_city
                  .groupby('City')['Order ID']
                  .count()
                  .reset_index()
                  .sort_values(by='City',ascending=True)) # count how many orders happened in a given City
df_order_city['State'] = df_order_city['City'].apply(lambda x: x.split(",")[0]) # extract State, for viz purpose
df_order_city

In [None]:
plt.figure(figsize=(17, 4))
sns.barplot(data =df_order_city,x = "City", y = "Order ID", hue = "State", dodge = False)
plt.xticks(rotation=90)
plt.show()

---

### Product Basket

  * What is the percentage of orders that people buy 1 product only?
  * Which products are sold together? 
  * How many items do people tpyically buy? 

In [None]:
df.head(2)

* What is the percentage of orders that people buy 1 product only?

`df_order`
* Group df by `order ID` for product basket analysis: each row is a order
* Extract information like: quantity of products in a order, product, number of distinct products in a order, margin

In [None]:
def generate_product_basket_data(df, group_by ):

  df_order = (df
              .groupby(by=[group_by])
              .agg(Quantity=('Quantity Ordered','sum'),
                    Product=('Product','unique'),
                    NumberOfProducts=('Product','nunique'),
                    Margin = ('Margin', 'sum'),
                    CustomerAddress = ("Purchase Address", 'unique') ) # assume each customer has 1 address
              .reset_index()
              .sort_values(by=['NumberOfProducts'],ascending=False)
              )

  for col in ['CustomerAddress', 'Product']:
    df_order[col] = (df_order[col]
                    .astype(str)
                    .map(lambda x: x.lstrip('[').rstrip(']')) # removes brackets
                    ) 

  df_order['CustomerAddress'] = df_order['CustomerAddress'].map(lambda x: x[1:-1]) # removes quotes
  return df_order


df_order = generate_product_basket_data(df=df, group_by='Order ID')
df_order

What is the percentage of orders that people buy 1 product only?
* 86.8% of time people buy 1 product
* 13.1% of time people buy more than 1 product

How many orders have more than 1 product?
* 23.3k

How much margin do purchase with more than 1 product bring?
* It brings 2.0M of margin, or 10.4% of all margin result

In [None]:
print(len(df_order.query("Quantity == 1")) / len(df_order)) # percentage of orders with 1 product
print(len(df_order.query("Quantity != 1")) / len(df_order)) # percentage of orders with more than 1 product
print()

print(len(df_order.query("Quantity != 1")))  # orders with more than 1 item
print()

# margin for orders with more than 1 item
print(round(df_order.query("Quantity != 1")['Margin'].sum(), 0) )
print(round(df_order.query("Quantity != 1")['Margin'].sum() / df_order['Margin'].sum() * 100, 1), "% of all margin result" )

What is the Quantity distribution across the orders?
* When considering order with more than 1 product, it is more common to have an order with 2 products

In [None]:
qty_distribution = df_order['Quantity'].value_counts()
pd.DataFrame(data= {"Absolute": qty_distribution , "Relative Percentage": round(qty_distribution / len(df_order) * 100 ,1)})

What are the most common product sets combinations?

In [None]:
def common_product_set_combinations(df_order, top_n, x_size=6):

  combinations_frequency = df_order['Product'].astype(str).value_counts()
  print(f"* There are {combinations_frequency.shape[0]} distinct combinations of product sets \n\n* These are the top {top_n}:")
  print(combinations_frequency.head(top_n),"\n\n")

  plt.figure(figsize=(x_size, 3))
  splot = sns.barplot(x="index",y="Product",data=combinations_frequency.head(top_n).reset_index())
  for p in splot.patches:
      splot.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')
  plt.xticks(rotation=90)
  plt.title(f"Top {top_n} product combinations")
  splot.set(xlabel=None)
  plt.show()

Common product set combinations for orders with quantity > 1

In [None]:
common_product_set_combinations(df_order=df_order.query("Quantity > 1"), top_n=15) 

Just for exploration and sanity check: query a given product set that are in a order

In [None]:
products = 'Lightning Charging Cable'
df_order.query("Quantity != 1")[df_order['Product'].str.contains(products)]

For quick exploration and validation. Query a given order id from `df` and inspect the content

In [None]:
order_id = 201904
df.query(f"`Order ID` == {order_id}")

---

###  Customer Behaviour

Data where each row is an order

In [None]:
df_order.head(2)

`df_customer`
* Group `df_order` by `customer address` to analyze customer order recurrency: each row is a customer 
* Extract information like: recurrency times, quantity of products bought, unique products, number of distinct products, customer margin

In [None]:
df_customer = (df_order
             .groupby(by=['CustomerAddress'])
             .agg(RecurrencyTimes=('Order ID','count'),
                  QuantityOfProducts = ('Quantity', 'sum'),
                  Product=('Product','unique'),
                  NumberOfProducts=('NumberOfProducts','sum'),
                  CustomerMargin = ('Margin', 'sum'),
                  ) 
             .sort_values(by=['RecurrencyTimes'], ascending=False)
             .reset_index()
             )

df_customer['State'] = df_customer['CustomerAddress'].apply(lambda x: x.split(",")[-1][1:3])
df_customer['City'] = df_customer['CustomerAddress'].apply(lambda x: x.split(",")[-2][1:])
df_customer['City'] = df_customer['State'] + " , " + df_customer['City']

print(df_customer.shape)
df_customer.head(5)

How many customers re-purchased? What is the percentage of customer base?
* 30.7k customers (21.9% of customer base)

In [None]:
print(f"There are {len(df_customer.query('RecurrencyTimes != 1'))} customers that re-purchased")
print(f"Or, {round(len(df_customer.query('RecurrencyTimes != 1'))  / len(df_customer) * 100 , 1)}% of customer base")

What is the recurrency times distribution?
* 78% of customers buy once
* 17% re-buy twice
* 3.4% re-buy 3 times.

In [None]:
recurrency_distr = df_customer['RecurrencyTimes'].value_counts()
pd.DataFrame(data= {"Absolute": recurrency_distr , "Relative Percentage": round(recurrency_distr / len(df_customer) * 100 ,1)})

What is the list with recurrent customers?

In [None]:
recurrency_times = [2,3,4,5,6,7]
customer_address_list = df_customer.query(f"RecurrencyTimes in {recurrency_times}")['CustomerAddress'].to_list()
df_recurrent_customers = df.query(f"`Purchase Address` in {customer_address_list}")

print(f"There are {len(customer_address_list)} customers that bought {recurrency_times} times. These are the first 5 customers:\n")
customer_address_list[:5]

In [None]:
# df_recurrent_customers.head(2)

---

#### What is the product recurrency profile?

For these customers, which products they have bought? Regardless if it was in the same order
* when using `generate_product_basket_data(`), we group by `Purchase ID`, since we want the products for recurrent customers

In [None]:
df_products_recurr_cust = generate_product_basket_data(
                            df = df_recurrent_customers, # only recurrent customers
                            group_by = 'Purchase Address' )

print(df_products_recurr_cust.shape)
df_products_recurr_cust

For these recurrent customers, what is the product basket analysis?
* Batteries + Charging Cable  
* Charging Cable
* Wired/Apple Headphones + Charging Cable
* Batteries
* Apple/Bose/wired Headphones + Batteries
* Wired Headphones


In [None]:
common_product_set_combinations(df_order=df_products_recurr_cust, top_n=40, x_size=18)

Just for exploration/sanity check: query recurrent customers that bought a given product set


In [None]:
products = "Apple Airpods Headphones" # "'Apple Airpods Headphones' 'Wired Headphones'"
df_products_recurr_cust[df_products_recurr_cust['Product'].str.contains(products)]

For quick exploration/validation. Query customer orders records from `df` based on customer address (pick from address from table above)

In [None]:
customer_id = '1 12th St, San Francisco, CA 94016'

customer_oders_date = df.query(f"`Purchase Address` == '{customer_id}' ")['Order Date'].unique()
print(f"{len(customer_oders_date)} orders made on:\n {customer_oders_date}\n" )

df.query(f"`Purchase Address` == '{customer_id}' ")

---

#### What is the profile from recurrent customers?

In [None]:
# df.head(2)

What is absolute and relative frequencies for State and City on recurrent customers data?
* Customers from **San Francisco, Los Angles, New York and Boston** tend to be more recurrent
* Is this proportion different compared to the data with all customers?

In [None]:
for col in ['State', 'City']:
  print(f"Absolute and relative frequencies for {col}")
  relative_proportion = pd.DataFrame(data= {"Absolute": df_recurrent_customers[col].value_counts(),
                            "Relative":  df_recurrent_customers[col].value_counts(normalize=True)}
                    )
  print(relative_proportion)
  print("\n\n")

Let's compare: What is the State and City distribution for all customers and for recurrent customers?
* There is indication that customers from San Francisco tend to be "specially" more recurrent, since the relative frequency for San Francisco on "recurrent customers" data is greater than "all customers" data
* Relative proportion levels from customers from San Francisco, Los Angles, New York and Boston, at "all customers" and "recurrent customers" are more paired, than the other options

For roadmap
* a chi square statistical test could have been applied to compute if distributions are different

In [None]:
def plot_distributions_for_same_variables_from_2_datasets(df_a,df_a_name, df_b, df_b_name, variables, title):

  for col in variables:
    
    print(f"=== {col} ===")
    # I know it is difficult to difficult to compare proportions on datasets 
    # with different size, but the distribution is as it follows
    df1 = pd.DataFrame(data={"Type": df_a_name, "Data":df_a[col] }) 
    df1 = df1.append( pd.DataFrame(data={"Type": df_b_name, "Data":df_b[col] })  )
    plt.figure(figsize=(8, 4))
    sns.countplot(x='Type',hue='Data',data=df1)
    plt.title(f"{col} {title}")
    plt.show()
    print("\n\n")

    print("Relative Proportion")
    relative_proportion = pd.DataFrame(data= {df_b_name: df_b[col].value_counts(normalize=True),
                            df_a_name: df_a[col].value_counts(normalize=True)}
                    )
    relative_proportion['Difference'] = round(relative_proportion[df_b_name] - relative_proportion[df_a_name] ,3)
    print(relative_proportion)
    print("\n\n")


plot_distributions_for_same_variables_from_2_datasets(df_a= df, df_a_name='AllCustomers',
                                                      df_b=df_recurrent_customers, df_b_name = 'RecurrentCustomers',
                                                      variables= ['State', 'City'],
                                                      title= 'distribution on All Customers and Recurrent Customers')

---

#### What is the profile from customers with high margin?

In [None]:
print(df_customer.shape) # all base
df_customer.head(2)

What is the margin distribution across all customers
* **assume customers with top margin are in top 10% percentile, meaning it should be greater than 400**
* Roadmap: analyze in more detail customers at top 1% percentile

In [None]:
df_customer['CustomerMargin'].quantile([0,0.25,0.5,0.75,0.8,0.85,0.87,0.9,0.95,0.99,1])
# Q1 - Q3: 8.15 to 165
# 80% of customer margin is up to 200
# 90% of customer margin is up to 400

In [None]:
df_customer.query('CustomerMargin < 700')['CustomerMargin'].hist()
plt.show()

What is the absolute and relative frequencies for State and City on data for customers with top margin (margin > 400)
* Customers from **San Francisco, Los Angeles, New York and Boston** tend to have more margin
* Is this proportion different compared to the data with all customers?

In [None]:
for col in ['State', 'City']:
  print(f"Absolute and relative frequencies for {col}")
  relative_proportion = pd.DataFrame(data= {"Absolute": df_customer.query("CustomerMargin > 400")[col].value_counts(),
                            "Relative":  df_customer.query("CustomerMargin > 400")[col].value_counts(normalize=True)}
                    )
  print(relative_proportion)
  print("\n\n")

Let's compare: What is the State and City distribution for all customers and for top margin customers?
* There is indication that customers from San Francisco tend to be "specially" with more margin, since the relative frequency for San Francisco on "top margin customers" data is greater than "all customers" data
* Relative proportion levels from customers from San Francisco, Los Angeles, New York and Boston, at "all customers" and "top margin customers" are more paired, than the other options

For roadmap
* a chi square statistical test could have been applied to compute if distributions are different

In [None]:
plot_distributions_for_same_variables_from_2_datasets(df_a= df_customer, df_a_name='AllCustomers',
                                                      df_b=df_customer.query("CustomerMargin > 400"), df_b_name = 'CustomersTopMargin',
                                                      variables= ['State', 'City'], title= 'distribution on All customers and with top margin')



---