## Project Title: 
## *üõí Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce*

## By Shubham Pandey

## Problem Type:
#### *‚óè	Unsupervised Machine Learning ‚Äì Clustering*
#### *‚óè	Collaborative Filtering ‚Äì Recommendation System*

## Problem Statement:
#### *The global e-commerce industry generates vast amounts of transaction data daily, offering valuable insights into customer purchasing behaviors. Analyzing this data is essential for identifying meaningful customer segments and recommending relevant products to enhance customer experience and drive business growth. This project aims to examine transaction data from an online retail business to uncover patterns in customer purchase behavior, segment customers based on Recency, Frequency, and Monetary (RFM) analysis, and develop a product recommendation system using collaborative filtering techniques.*

## Git-hub Link:
#### *https://github.com/Shubhampandey1git/-Shopper-Spectrum-Customer-Segmentation-and-Product-Recommendations-in-E-Commerce*

# **Imports**

In [1]:
import pandas as pd

# **1. Dataset Collection and Understanding** 

***Loading the dataset; Checking the Top rows, data types, missing values and summary stats***

In [2]:
# Loading the dataset
df = pd.read_csv('online_retail.csv')

# Printing the top 5 rows
print(df.head())

# Getting the data types and missing values
print(df.info())

# Getting the summary stats
print(df.describe(include='all'))

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2022-12-01 08:26:00       2.55     17850.0  United Kingdom  
1  2022-12-01 08:26:00       3.39     17850.0  United Kingdom  
2  2022-12-01 08:26:00       2.75     17850.0  United Kingdom  
3  2022-12-01 08:26:00       3.39     17850.0  United Kingdom  
4  2022-12-01 08:26:00       3.39     17850.0  United Kingdom  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       ------------

***Observations:***
1. Missing Customer IDs
2. Negative Quantity and UnitPrice
3. Some InvoiceNo start with 'C'
4. Date column is still a string

# **2. Data Preprocessing**

***Cleaning the data:***
1. Removing the rows with missing CustomerIDs
2. Removing canceled transactions
3. Removing negative or zero quantities and prices
4. Converting the InvoiceDate to datetime

In [5]:
# Removing the rows with missing CustomerID
df_clean = df.dropna(subset=['CustomerID'])

# Removing cancelled transactions  (InvoiceNos starting with 'C')
df_clean = df_clean[~df_clean['InvoiceNo'].astype(str).str.startswith('C')]

# Removing the rows with negative or zero quantity and prices
df_clean = df_clean[(df_clean['Quantity'] > 0) & (df_clean['UnitPrice'] > 0)]

# Convert InvoiceDate to datetime
df_clean['InvoiceDate'] = pd.to_datetime(df_clean['InvoiceDate'])

***Checking for duplicates***

In [6]:
duplicates = df_clean.duplicated().sum()
print(f'Duplicate rows: {duplicates}')

Duplicate rows: 5192


***Removing the Duplicates***

In [7]:
df_clean = df_clean.drop_duplicates()

***Checking the final shape of the Data Frame***

In [11]:
print(df_clean.describe(include='all'))

       InvoiceNo StockCode                         Description       Quantity  \
count     392692    392692                              392692  392692.000000   
unique     18532      3665                                3877            NaN   
top       576339    85123A  WHITE HANGING HEART T-LIGHT HOLDER            NaN   
freq         542      2023                                2016            NaN   
mean         NaN       NaN                                 NaN      13.119702   
min          NaN       NaN                                 NaN       1.000000   
25%          NaN       NaN                                 NaN       2.000000   
50%          NaN       NaN                                 NaN       6.000000   
75%          NaN       NaN                                 NaN      12.000000   
max          NaN       NaN                                 NaN   80995.000000   
std          NaN       NaN                                 NaN     180.492832   

                          I