<a href="https://colab.research.google.com/github/Kamal-4u/End-to-End-Machine-Learning/blob/main/Retail_Customer_Segmentation_Unsupervised2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Kamalkant Singh


# **Project Summary -**

The project at hand involves the analysis of a transnational dataset spanning from December 1, 2010, to September 12, 2011. This dataset pertains to a UK-based online retail company specializing in unique all-occasion gifts. Notably, a significant portion of the company's clientele consists of wholesale buyers.

To tackle this data analysis task, several key libraries are employed. Pandas is used for data manipulation and aggregation, which is crucial for cleaning and organizing the dataset. Matplotlib and Seaborn are employed for data visualization, enabling the team to gain insights into customer behavior concerning the target variable. NumPy is leveraged to facilitate computationally efficient operations, streamlining the data analysis process. Lastly, Scikit-learn is utilized for building predictive models and segmenting the customer base.

The primary objective of this project is to identify major customer segments within this online retail company's customer base. By scrutinizing the transactional data, the team aims to categorize customers into distinct segments, potentially based on their purchasing behaviors, frequency, or other relevant factors.

In essence, this data-driven analysis aims to provide the company with a deeper understanding of its customer base. By uncovering major customer segments, the company can tailor its marketing and business strategies more effectively. For instance, they may identify high-value customers, allowing for targeted promotions or loyalty programs. Simultaneously, understanding the needs and preferences of wholesale buyers may lead to improved inventory management and supply chain optimization.

This project underscores the significance of data analysis and segmentation in enhancing business operations. By harnessing the power of data and employing the aforementioned libraries, the company is poised to make informed decisions that can positively impact its bottom line and customer satisfaction.

# **GitHub Link -**

GitHub Link:-https://github.com/Kamal-4u/End-to-End-Machine-Learning


# **Problem Statement**


**A primary goal for any company and business is to understand their targeted customers. How their consumers operate and use their services. Every consumer may use a companies services differently. The problem we’re trying to solve is to define this delivery company’s consumers. To define certain behaviors and methods these consumers use the companies services for.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import f_oneway
from scipy.stats import ttest_1samp
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv('/content/drive/MyDrive/Projects/Online_Retail.csv')

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

In our dataset, there are some duplicate values present in CustomerId and Description. In this dataset, the Description and CustomerID have occurrences of 1454 and 135080.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
#Looking for the description of the dataset to get insights of the data
df.describe()

### Variables Description

* InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

* StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

* Description: Product (item) name. Nominal.

* Quantity: The quantities of each product (item) per transaction. Numeric.

* InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.

* UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

* Country: Country name. Nominal, the name of the country where each customer resides.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
#print the unique value
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#checking a duplicate value
value= len(df[df.duplicated()])
print("the number of duplicate value in dataset :",value)

In [None]:
#Drop the duplicate row
df = df.drop_duplicates()
len (df[df.duplicated()])

In [None]:
df.shape

In [None]:
# now i'm drop same invoiceNo which are presented "C" because "c" represent a cancellation
df["InvoiceNo"] = df["InvoiceNo"].astype("str")

In [None]:
df[df['InvoiceNo'].str.contains('C')]

In [None]:
df=df[~df['InvoiceNo'].str.contains('C')]
df.shape

In [None]:
df.columns

In [None]:
# Now convert Invice date columns are year, month,day,hours,minuts,second
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["InvoiceDate_year"] = df['InvoiceDate'].dt.year
df['InvoiceDate_month'] = df['InvoiceDate'].dt.month
df['InvoiceDate_day'] = df['InvoiceDate'].dt.day
df['InvoiceDate_hour'] = df['InvoiceDate'].dt.hour
df['InvoiceDate_minute'] = df['InvoiceDate'].dt.minute
df['InvoiceDate_second'] = df['InvoiceDate'].dt.second

In [None]:
print("Columns and data types")
pd.DataFrame(df.dtypes).rename(columns = {0:'dtype'})

In [None]:
df.columns

In [None]:
df.shape

### What all manipulations have you done and insights you found?

* There was some duplicate values so i removed those values.
* I dropped InvoiceNo column which are starts with 'c' because 'c', it indicates a cancellation.
*I converted invoice Data column into 'year','month','day','hour','minute','second'

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Check the number of unique customer IDs
print('The no. of customers = ', df['CustomerID'].nunique())
# Count the number of transactions for each customer and sort them in descending order
active_customers = df['CustomerID'].value_counts().reset_index()
active_customers.columns = ['CustomerID', 'Count']  # Renaming columns for clarity

# Display the most active customer
most_active_customer = active_customers.iloc[0]

print("The most active customer is CustomerID:", most_active_customer['CustomerID'])
print("Number of transactions:", most_active_customer['Count'])

# Plot the top 5 active customers
plt.figure(figsize=(9, 8))
plt.title('Top 5 active customers ID')
sns.barplot(x='CustomerID', y='Count', data=active_customers.head(5))
plt.show()


##### 1. Why did you pick the specific chart?

The best option for representing categorical variables is a bar chart.
When comparing categorical variables, a bar chart is the most suitable choice.

##### 2. What is/are the insight(s) found from the chart?

The top 5 most active customers who have been making regular purchases have the following IDs: 17841, 14911, 14096, 12748, and 14606. These customers can be considered special because they are highly likely to make frequent purchases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can identify the most active customers, and the company can then concentrate its efforts on nurturing and retaining these customers. By doing so, the frequency of repeat purchases increases, leading to an increase in revenue.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Analysis of Categorical Features
categorical_columns = list(df.select_dtypes(['object']).columns)
categorical_features = pd.Index(categorical_columns)

# Analysis of Description Variable
Description_df = df['Description'].value_counts().reset_index()
Description_df.rename(columns={'index': 'Description_Name', 'Description': 'Count'}, inplace=True)

# Display the first few rows of Description_df
print(Description_df.head())

# Plot the top 5 Product Names
plt.figure(figsize=(9, 8))
plt.title('Top 5 Product Name')
sns.barplot(x='Description_Name', y='Count', data=Description_df[:5])
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

The best option for categorical variables is a bar chart. A bar chart is the ideal choice for comparing categorical variables.

##### 2. What is/are the insight(s) found from the chart?

WHITE HANGING HEART T-LIGHT HOLDER is the highest-selling product, with approximately 2315 units sold. JUMBO BAG RED RETROSPOT is the second-highest selling product, with almost 2112 units sold.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The high counts of these top products indicate a strong demand and popularity among customers. This information can be utilized to ensure sufficient stock availability and plan targeted promotions, which, in turn, can boost product sales and increase revenue.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(9,7))
plt.title('Bottom 5 product Name')
sns.barplot(x='Description_Name', y='Count', data=Description_df[-5:])
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Bar chart draw variables in one axis and corresponding values in another axis.

##### 2. What is/are the insight(s) found from the chart?

It has been observed that items such as "TINY CRYSTAL BRACELET RED," "4 GOLD FLOCK CHRISTMAS BALLS," "ZINC STAR T-LIGHT HOLDER," "BLUE GINGHAM ROSE CUSHION COVER," and "PAPER CRAFT, LITTLE BIRDIE" have notably low counts, suggesting limited demand or popularity among customers

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Analysis of StockCode Variable
StockCode_df = df['StockCode'].value_counts().reset_index()
StockCode_df.rename(columns={'index': 'StockCode_Name', 'StockCode': 'Count'}, inplace=True)

plt.figure(figsize=(9, 7))
plt.title('Top 5 Stock Names')
sns.barplot(x='StockCode_Name', y='Count', data=StockCode_df.head(5))  # Use head(5) to select the top 5
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()


##### 1. Why did you pick the specific chart?

For categorical variables Bar chart is best option.

##### 2. What is/are the insight(s) found from the chart?

StockCode-85123A is the top-selling product, and StockCode-85099B is the second-highest selling product.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Analyzing the demand for top stock codes enables data-driven decisions for procurement, production, and restocking, optimizing sales, reducing excess inventory, and enhancing customer satisfaction while improving revenue and profitability.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(9,8))
plt.title('Bottom 5 Stock Name')
sns.barplot(x='StockCode_Name',y='Count',data=StockCode_df[-5:])
plt.show()

##### 1. Why did you pick the specific chart?

For doing comparision on categorical variables Bar chart is best.

##### 2. What is/are the insight(s) found from the chart?

The bottom 5 stock codes represent the least popular items, indicating limited customer demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights have been observed to create a positive business impact. By identifying the least popular stock codes, the company can optimize its inventory management. In turn, the company can focus on high-demand products, thereby increasing revenue.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Analysis of the 'Country' variable
country_df = df['Country'].value_counts().reset_index()
country_df.columns = ['Country_Name', 'Count']

# Display the top 5 countries with the most customers
top_5_countries = country_df.head(5)

# Create a pie chart to visualize the top 5 countries
plt.figure(figsize=(10, 8))
plt.title('Top 5 Countries based on the Most Number of Customers')
plt.pie(top_5_countries['Count'], labels=top_5_countries['Country_Name'], autopct='%1.1f%%', startangle=140)
plt.axis('equal')  # Equal aspect ratio ensures that the pie is drawn as a circle.

plt.show()


##### 1. Why did you pick the specific chart?

For categorical variables, a pie chart is the best option.

##### 2. What is/are the insight(s) found from the chart?

It has been observed that the UK has the highest number of customers, while Germany, France, and Ireland have nearly equal numbers of customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By identifying the countries with the highest number of customers, the company can concentrate its marketing efforts and allocate resources accordingly. This approach can lead to targeted marketing campaigns, enhanced customer engagement, and increased sales in these pivotal markets.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Bar plot to show Top 5 Country based least Numbers of Customers

plt.figure(figsize=(8,7))
plt.title('Top 5 Country based least Numbers of  Customers')
sns.barplot(x='Country_Name',y='Count',data=country_df[-5:])

##### 1. Why did you pick the specific chart?

A bar chart is the best option for representing categorical variables.
When comparing categorical variables, a bar chart is the most suitable choice.
In this case, we are plotting the average number of customers in different countries from the bottom.






##### 2. What is/are the insight(s) found from the chart?

There are very few customers from Saudi Arabia, and Bahrain is the second country with the lowest number of customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By comprehending the challenges or barriers specific to certain countries, the company can formulate tailored strategies for market entry, enabling it to surmount obstacles, allure a larger customer base, and attain business growth.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Analysis Numeric Features
numerical_columns = list(df.select_dtypes(['int64', 'float64']).columns)
numerical_features = pd.Index(numerical_columns)

for col in numerical_features:
    # Histogram
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    feature.hist(bins=50, ax=ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
    plt.show()

    print("Skewness:", feature.skew())
    print("Kurtosis:", feature.kurt())

    # Dist Plot (excluding Car_ID)
    if col != 'Car_ID':
        fig = plt.figure(figsize=(9, 6))
        ax = fig.gca()
        sns.distplot(df[col])
        ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
        ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
        ax.set_title(col)
        plt.show()

        print("Skewness:", feature.skew())
        print("Kurtosis:", feature.kurt())

plt.show()

##### 1. Why did you pick the specific chart?

A histogram is a suitable choice for visualizing the distribution of numerical data, allowing us to visualize how the data is distributed across different ranges or bins. It provides insights into the frequency or count of data points within each bin, giving a sense of the data's distribution pattern.

##### 2. What is/are the insight(s) found from the chart?

It has been observed that some of the data are nearly normally distributed, some exhibit a positive skew, and others display a significant positive skew, while some are negatively skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights help create a positive business impact:

By understanding the distribution of data, it helps in making decisions and improving strategies in various business areas, such as marketing, sales, and operations.
By detecting and addressing outliers, businesses can improve data quality, enhancing decision-making accuracy.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# plot a boxplot for the label by each numerical feature
for col in numerical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    df.boxplot(col)
    ax.set_title('Label by ' + col)
    #ax.set_ylabel("Churn")
plt.show()

##### 1. Why did you pick the specific chart?

Box plots offer a succinct and informative visualization of numerical data, showcasing the distribution, spread, outliers, and skewness.

##### 2. What is/are the insight(s) found from the chart?

Most of the columns do not contain outliers, but a few columns do have outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes with the help of insight which i found from this will help to create business growth.
* The identification of outliers in the data through boxplots can help businesses to identify and address data quality issues.
* By detecting and addressing outliers, businesses can improve the accuracy and reliability of their data, leading to better decision-making and improved business performance.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
correlation=df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

I want to view the correlation between all the columns at once. To achieve this, I utilized a heatmap, which displays the correlations between columns in a single chart.

##### 2. What is/are the insight(s) found from the chart?

There is a positive correlation between InvoiceDate_day and InvoiceDate_month. There is a positive correlation between InvoiceDate_minute and InvoiceDate_hour. There is a positive correlation between InvoiceDate_hour and InvoiceDate_month.

#### Chart - 11 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.figure(figsize=(15, 8))
correlation = df.corr()
sns.pairplot(df, kind="scatter", diag_kind="kde")
plt.show()

##### 1. Why did you pick the specific chart?

Pairplot illustrates the pairwise relationships between numerical columns. It automatically detects numerical columns and generates pairplots for all possible pairs of numerical columns. My intent in using pairplot was to examine the correlations between these numerical columns.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

The means of invoices across different months are consistent. The mean quantity matches a specified value, and the mean unit price is equivalent to a predetermined value. These statements highlight the uniformity and agreement in statistical measures, which can be essential for data analysis and decision-making.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis (H0): The means of invoices across different months are equal.

Alternative hypothesis (H1): At least one of the means of invoices across different months is different.

Test Type: one-way ANOVA test.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Extract the "InvoiceDate_Month" column
invoice_months = df['InvoiceDate_month']
# Prepare the data for the one-way ANOVA test
groups = [df[df['InvoiceDate_month'] == month]['InvoiceDate_month'] for month in invoice_months.unique()]
# Perform the one-way ANOVA test
f_statistic, p_value = f_oneway(*groups)
# Set the significance level (alpha)
alpha = 0.05
# Compare p-value to the significance level
if p_value < alpha:
    print("Reject null hypothesis. At least one of the means of invoices across different months is different.")
else:
    print("Fail to reject null hypothesis. The means of invoices across different months are equal.")



##### Which statistical test have you done to obtain P-Value?

I used F- statistic.The F-statistic test is commonly used in statistical analysis, specifically in analysis of variance (ANOVA) tests. ANOVA is used to compare the means of two or more groups to determine if there are any significant differences between them. The F-statistic is the test statistic used in ANOVA to assess the variation between group means and the variation within groups.

##### Why did you choose the specific statistical test?

Here data is approximately normally distributed.thats why i used F-statistic test

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis (H0): The mean quantity is equal to a specified value.

Alternative hypothesis (H1): The mean quantity is not equal to the specified value.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Extract the "Quantity" column
quantity = df['Quantity']
# Specify the value to test against
specified_value = 100  # For example, testing against a mean quantity of 10
# Perform the one-sample t-test
t_statistic, p_value = ttest_1samp(quantity, specified_value)
# Set the significance level (alpha)
alpha = 0.05
# Compare p-value to the significance level
if p_value < alpha:
    print("Reject null hypothesis. The mean quantity is significantly different from the specified value.")
else:
    print("Fail to reject null hypothesis. The mean quantity is not significantly different from the specified value.")

##### Which statistical test have you done to obtain P-Value?

I have used t-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected and the mean quantity is significantly different from the specified value.

##### Why did you choose the specific statistical test?

Data is not normally distributed and for skewed data t test is best option.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


Null hypothesis (H0): The mean unit price is equal to a specified value.

Alternative hypothesis (H1): The mean unit price is not equal to the specified value.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Extract the "UnitPrice" column
unit_price = df['UnitPrice']
# Specify the value to test against
specified_value = 110  # For example, testing against a mean unit price of 10
# Perform the one-sample t-test
t_statistic, p_value = ttest_1samp(unit_price, specified_value)
# Set the significance level (alpha)
alpha = 0.05
# Compare p-value to the significance level
if p_value < alpha:
    print("Reject null hypothesis. The mean unit price is significantly different from the specified value.")
else:
    print("Fail to reject null hypothesis. The mean unit price is not significantly different from the specified value.")

##### Which statistical test have you done to obtain P-Value?

I utilized a t-Test for statistical analysis, obtaining a P-Value that led to the rejection of the null hypothesis. This indicates a significant difference between the mean unit price and the specified value.

##### Why did you choose the specific statistical test?


Our data distribution exhibits positive skewness, suggesting that the t-test is more appropriate for skewed data. Consequently, I employed a t-test to achieve more accurate results.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
missing_counts = df.isnull().sum()
# To check missing values for a specific column, you can use df['ColumnName'].isnull().sum()
description_missing = df['Description'].isnull().sum()
customer_id_missing = df['CustomerID'].isnull().sum()
# Print the missing value counts for each column
print("Missing Value Counts for All Columns:")
print(missing_counts)
# Print the missing value count for the 'Description' column
print("Missing Value Count for 'Description' column:", description_missing)
# Print the missing value count for the 'CustomerID' column
print("Missing Value Count for 'CustomerID' column:", customer_id_missing)

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.shape


#### What all missing value imputation techniques have you used and why did you use those techniques?

I simply removed rows where missing values were present in this dataset. I removed all null and missing values for Description (1454) and CustomerID (134658).

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# plot a boxplot for the label by each numerical feature
for col in numerical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    df.boxplot(col)
    ax.set_title('Label by ' + col)
    #ax.set_ylabel("Churn")
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?


I kept outliers in 'Quantity' and 'UnitPrice' as they can provide valuable insights into customer behavior, preferences, and purchasing patterns. Outliers may represent high-value transactions, bulk purchases, or unique buying behaviors that are essential to understand for effective customer segmentation. Removing outliers may lead to the loss of such valuable information.

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Assuming 'df' is your DataFrame, let's create a new feature 'Day' from 'InvoiceDate'
df['Day'] = df['InvoiceDate'].dt.day_name()

# Calculate the 'TotalAmount' by multiplying 'Quantity' and 'UnitPrice'
df['TotalAmount'] = df['Quantity'] * df['UnitPrice']

# Display summary statistics for the 'TotalAmount' column
total_amount_summary = df['TotalAmount'].describe()
print(total_amount_summary)

# Create a DataFrame to count the occurrences of each day
day_df = df['Day'].value_counts().reset_index()
day_df.rename(columns={'index': 'Day_Name', 'Day': 'Count'}, inplace=True)

# Plot a bar chart of the day counts
plt.figure(figsize=(11,6))
plt.title('Day')
sns.barplot(x='Day_Name', y='Count', data=day_df)
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()  # Show the plot


In [None]:
# lets create dataframe of InvoiceDate_month column:
month_df=df['InvoiceDate_month'].value_counts().reset_index()
month_df.rename(columns={'index': 'Month_Name'}, inplace=True)
month_df.rename(columns={'InvoiceDate_month': 'Count'}, inplace=True)
month_df

In [None]:
# create bar chart of dataframe which i created:
plt.figure(figsize=(11,6))
plt.title('Month')
sns.barplot(x='Month_Name',y='Count',data=month_df)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# create dataframe of InvoiceDate_hour column:
hour_df=df['InvoiceDate_hour'].value_counts().reset_index()
hour_df.rename(columns={'index': 'Hour_Name'}, inplace=True)
hour_df.rename(columns={'InvoiceDate_hour': 'Count'}, inplace=True)
hour_df

In [None]:
# create bar chart of dataframe which i created:
plt.figure(figsize=(13,8))
plt.title('Hour')
sns.barplot(x='Hour_Name',y='Count',data=hour_df)

In [None]:
# create function to divide time into 3 category:
def time_type(time):
  if(time==6 or time==7 or time==8 or time==9 or time==10 or time==11):
    return 'Morning'
  elif(time==12 or time==13 or time==14 or time==15 or time==16 or time==17):
    return 'Afternoon'
  else:
    return 'Evening'

In [None]:
df['Time_type']=df['InvoiceDate_hour'].apply(time_type)

In [None]:
plt.figure(figsize=(9,8))
plt.title('Time_type')
sns.countplot(x='Time_type',data=df)
plt.show()

##### What all feature selection methods have you used  and why?

From this graph we can see that in AfterNone Time most of the customers have purches the item.


##### Which all features you found important and why?

Most customers have purchased gifts in the months of November, October, and December. Fewer customers have purchased gifts in the months of April, January, and February. From this graph, we can see that most customers make purchases in the afternoon. Moderate numbers of customers buy items in the morning, and the fewest customers make purchases in the evening.

### 5. Data Transformation



####Creating RFM model
Before applying any clustering algorithms it is always necessary to determine various quantitative factors on which the algorithm will perform segmentation. Examples of these would be features such as amount spend, activeness of the customer, their last visit, etc.

RFM model which stands for Recency, Frequency, and Monetary is one of such steps in which we determine the recency - days to last visit, frequency - how actively the customer repurchases and monetary - total expenditure of the customer, for each customer. There are other steps too in which we divide each of these features accordingly and calculate a score for each customer. However, this approach doesnot require machine learning algorithms as segmentation can be done manually. Therefore we will skip the second step and directly use the rfm features and feed it to clustering algorithms.

Recency = Latest Date - Last Inovice Data,

Frequency = count of invoice no. of transaction(s),

Monetary = Sum of Total Amount for each customer


#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
#Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase
Latest_Date = dt.datetime(2011,12,10)

#Create RFM Modelling scores for each customer
rfm_df = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#Convert Invoice Date into type int
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#Rename column names to Recency, Frequency and Monetary
rfm_df.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalAmount': 'Monetary'}, inplace=True)

rfm_df.reset_index().head()

In [None]:
#Descriptive Statistics (Recency)
rfm_df.Recency.describe()

In [None]:
#Recency distribution plot
x = rfm_df['Recency']
plt.figure(figsize=(13,5))
sns.distplot(x)

In [None]:
#Descriptive Statistics (Frequency)
rfm_df.Frequency.describe()

In [None]:
#Frequency distribution plot, taking observations which have frequency less than 1000
x = rfm_df['Frequency']
plt.figure(figsize=(13,5))
sns.distplot(x)

In [None]:
#Descriptive Statistics (Monetary)
rfm_df.Monetary.describe()

In [None]:
#Monateray distribution plot, taking observations which have monetary value less than 10000
import seaborn as sns
x = rfm_df['Monetary']
plt.figure(figsize=(13,5))
sns.distplot(x)

In [None]:
# Handle negative and zero values to avoid issues during log transformation
def handle_neg_n_zero(num):
    if num <= 0:
        return 1
    else:
        return num

# Apply handle_neg_n_zero function to Recency and Monetary columns
rfm_df['Recency'] = [handle_neg_n_zero(x) for x in rfm_df['Recency']]
rfm_df['Monetary'] = [handle_neg_n_zero(x) for x in rfm_df['Monetary']]

# Perform Log transformation to bring data into normal or near-normal distribution
Log_Tfd_Data = rfm_df[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis=1).round(3)

# Calculate quantiles on the transformed data
quantiles = Log_Tfd_Data.quantile(q=[0.25, 0.5, 0.75])
quantiles = quantiles.to_dict()

# Define scoring functions for R, F, and M segments on transformed data
def RScoring(x, p, d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4

def FnMScoring(x, p, d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1

# Calculate R, F, and M segment values on the transformed data
rfm_df['R'] = Log_Tfd_Data['Recency'].apply(RScoring, args=('Recency', quantiles,))
rfm_df['F'] = rfm_df['Frequency'].apply(FnMScoring, args=('Frequency', quantiles,))
rfm_df['M'] = rfm_df['Monetary'].apply(FnMScoring, args=('Monetary', quantiles,))

# Calculate and add RFMGroup value column showing combined concatenated score of RFM
rfm_df['RFMGroup'] = rfm_df['R'].map(str) + rfm_df['F'].map(str) + rfm_df['M'].map(str)

# Calculate and add RFMScore value column showing total sum of RFMGroup values
rfm_df['RFMScore'] = rfm_df[['R', 'F', 'M']].sum(axis=1)

rfm_df.head()

In [None]:
#Data distribution after data normalization for Recency
Recency_Plot = Log_Tfd_Data['Recency']
plt.figure(figsize=(9,8))
sns.distplot(Recency_Plot)

In [None]:
#Data distribution after data normalization for Frequency
Frequency_Plot = Log_Tfd_Data.query('Frequency < 1000')['Frequency']
plt.figure(figsize=(9,8))
sns.distplot(Frequency_Plot)

In [None]:
#Data distribution after data normalization for Monetary
Monetary_Plot = Log_Tfd_Data.query('Monetary < 10000')['Monetary']
plt.figure(figsize=(9,8))
sns.distplot(Monetary_Plot)

In [None]:
from sklearn import preprocessing
import math
rfm_df['Recency_log'] = rfm_df['Recency'].apply(math.log)
rfm_df['Frequency_log'] = rfm_df['Frequency'].apply(math.log)
rfm_df['Monetary_log'] = rfm_df['Monetary'].apply(math.log)

**Some of the data are positively skewed so we need to convert into normally distributed,for that reason i used here log transformation.**

### 6. Data Scaling

In [None]:
# Scaling your data
# lets scale on Recency and Monetary
features_rec_mon=['Recency_log','Monetary_log']
X_features_rec_mon=rfm_df[features_rec_mon].values
scaler_rec_mon=preprocessing.StandardScaler()
X_rec_mon=scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon

##### Which method have you used to scale you data and why?

Here, the StandardScaler method is used to scale the data. It transforms the data so that it has a mean of 0 and a standard deviation of 1. Since there are some outliers in our dataset, standardization is robust to outliers. When outliers are present, they do not have a significant impact on the scaling process, which is why I chose to use this method.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Calculate quantiles on the transformed data
quantiles = rfm_df[['Recency_log', 'Frequency_log', 'Monetary_log']].quantile(q=[0.25, 0.5, 0.75])
quantiles = quantiles.to_dict()

# Define scoring functions for R, F, and M segments on transformed data
def RScoring(x, p, d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4

def FnMScoring(x, p, d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1

# Calculate R, F, and M segment values on the transformed data
rfm_df['R'] = rfm_df['Recency_log'].apply(RScoring, args=('Recency_log', quantiles,))
rfm_df['F'] = rfm_df['Frequency_log'].apply(FnMScoring, args=('Frequency_log', quantiles,))
rfm_df['M'] = rfm_df['Monetary_log'].apply(FnMScoring, args=('Monetary_log', quantiles,))

# Calculate and add RFMGroup value column showing combined concatenated score of RFM
rfm_df['RFMGroup'] = rfm_df['R'].map(str) + rfm_df['F'].map(str) + rfm_df['M'].map(str)

# Calculate and add RFMScore value column showing total sum of RFMGroup values
rfm_df['RFMScore'] = rfm_df[['R', 'F', 'M']].sum(axis=1)

rfm_df.head()

##### What data splitting ratio have you used and why?

Answer Here.

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***