### Introduction
Cohort analysis is a useful technique that facilitates businesses in tracking customer retention over time. A cohort refers to a group of customers who share similar characteristics, such as the month they made their first purchase or the region they live in. By analyzing the retention rate of each cohort, businesses can identify trends and patterns in customer behavior and take action to improve customer retention.

#### Methodology
- Data Cleaning
- Descriptive Statistics

### The Dataset

#### Description
Transactional data that lists all the transactions that occurred between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail that mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

#### Data Overview
- **File(s):** Online Retail.xlsx
- **Table(s):** Online Retail
- **Variables in the table:** InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID and Country

#### Attribute Overview
- **InvoiceNo:** A unique 6-digit number assigned to each transaction. If the number starts with the letter 'c', it indicates cancellation
- **StockCode:** A unique 5-digit code assigned to each product
- **Description:** The name of the product
- **Quantity:** The quantity of each product per transaction
- **InvoiceDate:** The date and time when the transaction was generated
- **UnitPrice:** Product price per unit in sterling
- **CustomerID:** A unique 5-digit number assigned to each customer
- **Country:** The name of the country where the customer resides

#### Importing Python libraries and Loading the dataset

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv('Online Retail.csv')

#### Displaying the first 5 observations

In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 08:26,3.39,17850.0,United Kingdom


#### Displaying the last 5 observations

In [4]:
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,09-12-2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,09-12-2011 12:50,2.1,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,09-12-2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,09-12-2011 12:50,4.15,12680.0,France
541908,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,09-12-2011 12:50,4.95,12680.0,France


#### Displaying the number of observations(rows) and variables(columns)

In [5]:
df.shape

(541909, 8)

#### Displaying the attributes and their data types

In [6]:
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

### Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate or incomplete data within a dataset. If data is incorrect, outcomes and algorithms become unreliable, even though they may be implemented correctly because of this it becomes crucial to perform data cleaning

The most common data cleaning practices include:

<ul style="list-style-type:square">
    <li>Removing duplicate or irrelevant observations</li>
    <li>Fixing structural errors such as strange naming conventions, typos or incorrect capitalization</li>
    <li>Filtering unwanted outliers (if needed)</li>
    <li>Handling missing data</li>
</ul>

- Count of non-missing values for each variable

In [7]:
df.count()

InvoiceNo      541909
StockCode      541909
Description    540455
Quantity       541909
InvoiceDate    541909
UnitPrice      541909
CustomerID     406829
Country        541909
dtype: int64

- Detecting missing values

In [8]:
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

- Removing missing values

In [9]:
df.dropna(inplace=True)

- Detecting redundant data

In [10]:
df.duplicated().sum()

5225

- Removing redundant data

In [11]:
df.drop_duplicates()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,09-12-2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,09-12-2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,09-12-2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,09-12-2011 12:50,4.15,12680.0,France


### Descriptive statistics

Descriptive statistics summarizes a given data set, which can be either a representation of the entire population or a sample of the population. Descriptive statistics can be broken down into measures of central tendency and measures of variability/ spread. Measures of central tendency include mean, median and mode whereas measures of variability include standard deviation, variance, minimum and maximum variables, kurtosis and skewness

- **count:** Count of non-empty values
- **mean:** The mean(average) value
- **std:** Standard deviation
- **min:** Minimum value
- **25%:** The 25% percentile value
- **50%:** The 50% percentile value
- **75%:** The 75% percentile value
- **max:** Maximum value

In [12]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,12.061303,3.460471,15287.69057
std,248.69337,69.315162,1713.600303
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


#### Summarizing the dataframe

A short summary of the dataframe which consists of information such as RangeIndex, Data columns, dtypes and memory usage

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    406829 non-null  object 
 1   StockCode    406829 non-null  object 
 2   Description  406829 non-null  object 
 3   Quantity     406829 non-null  int64  
 4   InvoiceDate  406829 non-null  object 
 5   UnitPrice    406829 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      406829 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


#### Variance Analysis

Variance is the expected value of the squared variation of a random variable from its mean. In other words, variance estimates how far the data points are spread out from their mean

In [14]:
df.var(numeric_only=True)

Quantity      6.184839e+04
UnitPrice     4.804592e+03
CustomerID    2.936426e+06
dtype: float64

#### Standard Deviation Analysis

Standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean

In [15]:
df.std(numeric_only=True)

Quantity       248.693370
UnitPrice       69.315162
CustomerID    1713.600303
dtype: float64

From the analysis of Variance and Standard Deviation it can be concluded that:

<ul style="list-style-type:square">
    <li>The attributes Quantity and UnitPrice are quite spread out, away from the mean and from one another becuase of the vast difference that these two attributes have in terms of value</li>
    <li>The attribute CustomerID is unique for each customer because of which it has the highest variance and standard deviation</li>
</ul>