#  Cohort Analysis (Retention over User & Product Lifttime) which we especially use more in E-commerce website, and Application Mobile

# Cohort Analysis (Retention over User & Product Lifetime)

# A cohort is a group of subjects who share a defining characteristic. We can observe how a cohort behaves across time and compare it to other cohorts. Cohorts are used in medicine, psychology, econometrics

In [6]:
# For Data  Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# For Machine Learning Algorithm
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory


# lets load the data
ad_df = pd.read_excel(r"C:\Users\jki\Downloads\4 Addidas\clean\Adidas US Sales Datasets.xlsx")
ad_df.head(5)

Unnamed: 0,Retailer,Retailer ID,Invoice Date,Region,State,City,Product,Price per Unit,Units Sold,Total Sales,Operating Profit,Operating Margin,Sales Method
0,Foot Locker,1185732,2020-01-01,Northeast,New York,New York,Men's Street Footwear,50.0,1200,600000.0,300000.0,0.5,In-store
1,Foot Locker,1185732,2020-01-02,Northeast,New York,New York,Men's Athletic Footwear,50.0,1000,500000.0,150000.0,0.3,In-store
2,Foot Locker,1185732,2020-01-03,Northeast,New York,New York,Women's Street Footwear,40.0,1000,400000.0,140000.0,0.35,In-store
3,Foot Locker,1185732,2020-01-04,Northeast,New York,New York,Women's Athletic Footwear,45.0,850,382500.0,133875.0,0.35,In-store
4,Foot Locker,1185732,2020-01-05,Northeast,New York,New York,Men's Apparel,60.0,900,540000.0,162000.0,0.3,In-store


In [7]:
# lets check for missing values
missing_values = ad_df.isna().sum()
print(missing_values)

Retailer            0
Retailer ID         0
Invoice Date        0
Region              0
State               0
City                0
Product             0
Price per Unit      0
Units Sold          0
Total Sales         0
Operating Profit    0
Operating Margin    0
Sales Method        0
dtype: int64


In [8]:
# let check on data tyep
ad_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9648 entries, 0 to 9647
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Retailer          9648 non-null   object        
 1   Retailer ID       9648 non-null   int64         
 2   Invoice Date      9648 non-null   datetime64[ns]
 3   Region            9648 non-null   object        
 4   State             9648 non-null   object        
 5   City              9648 non-null   object        
 6   Product           9648 non-null   object        
 7   Price per Unit    9648 non-null   float64       
 8   Units Sold        9648 non-null   int64         
 9   Total Sales       9648 non-null   float64       
 10  Operating Profit  9648 non-null   float64       
 11  Operating Margin  9648 non-null   float64       
 12  Sales Method      9648 non-null   object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(6)
memory usage: 980.0+ KB


In [9]:
# lets check whetther we have negative valeus
ad_df.describe()

Unnamed: 0,Retailer ID,Price per Unit,Units Sold,Total Sales,Operating Profit,Operating Margin
count,9648.0,9648.0,9648.0,9648.0,9648.0,9648.0
mean,1173850.0,45.216625,256.930037,93273.4375,34425.244761,0.422991
std,26360.38,14.705397,214.25203,141916.016727,54193.113713,0.097197
min,1128299.0,7.0,0.0,0.0,0.0,0.1
25%,1185732.0,35.0,106.0,4254.5,1921.7525,0.35
50%,1185732.0,45.0,176.0,9576.0,4371.42,0.41
75%,1185732.0,55.0,350.0,150000.0,52062.5,0.49
max,1197831.0,110.0,1275.0,825000.0,390000.0,0.8


##  Let's Make Cohort Analysis

## For cohort analysis, there are a few labels that we have to create

## Invoice period: A string representation of the year and month of a single transaction/invoice

## Cohort group: A string representation of the the year and month of retailer's first purchase. This label is common across all invoices for a particular customer.

## Cohort period / Cohort Index: A integer representation a retialer’s stage in its “lifetime”. The number represents the number of months passed since the first purchase

In [12]:
import datetime as dt
def get_month(x) : return dt.datetime(x.year,x.month,1)
ad_df['InvoiceMonth'] = ad_df['Invoice Date'].apply(get_month)
grouping = ad_df.groupby('Retailer ID')['InvoiceMonth']
ad_df['CohortMonth'] = grouping.transform('min')
ad_df.tail()


Unnamed: 0,Retailer,Retailer ID,Invoice Date,Region,State,City,Product,Price per Unit,Units Sold,Total Sales,Operating Profit,Operating Margin,Sales Method,InvoiceMonth,CohortMonth
9643,Foot Locker,1185732,2021-01-24,Northeast,New Hampshire,Manchester,Men's Apparel,50.0,64,3200.0,896.0,0.28,Outlet,2021-01-01,2020-01-01
9644,Foot Locker,1185732,2021-01-24,Northeast,New Hampshire,Manchester,Women's Apparel,41.0,105,4305.0,1377.6,0.32,Outlet,2021-01-01,2020-01-01
9645,Foot Locker,1185732,2021-02-22,Northeast,New Hampshire,Manchester,Men's Street Footwear,41.0,184,7544.0,2791.28,0.37,Outlet,2021-02-01,2020-01-01
9646,Foot Locker,1185732,2021-02-22,Northeast,New Hampshire,Manchester,Men's Athletic Footwear,42.0,70,2940.0,1234.8,0.42,Outlet,2021-02-01,2020-01-01
9647,Foot Locker,1185732,2021-02-22,Northeast,New Hampshire,Manchester,Women's Street Footwear,29.0,83,2407.0,649.89,0.27,Outlet,2021-02-01,2020-01-01
