# **Project Name**    -ONLINE RETAIL



##### **Project Type**    - Cluster
##### **Contribution**    - Individual
##### **Name -**          - Sana Fatima


# **Project Summary -**

This project aims to perform customer segmentation for an online retail store by applying clustering techniques to its transactional dataset. The goal is to identify distinct customer groups based on their purchasing behaviors, demographics, and other relevant attributes.

This information can then be used to tailor marketing strategies, personalize customer experiences, and ultimately improve business performance.

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

The online retail industry is highly competitive, and businesses need to understand their customers deeply to thrive. Traditional marketing approaches often treat all customers the same, leading to inefficient campaigns and missed opportunities. This project aims to address this challenge by leveraging customer segmentation through clustering.

This problem statement sets the stage for your project by clearly defining the challenge, proposed solution, and desired outcomes. It provides a framework for your analysis and guides your efforts towards achieving the project objectives

# **Attribute information**

1.InvoiceNo: A unique identifier for each transaction.

2.StockCode: A unique identifier for each product.

3.Description: A textual description of the product.

4.Quantity: The number of units of a product purchased in a single transaction.

5.InvoiceDate: The date and time of the transaction.

6.UnitPrice: The price of a single unit of the product in the currency (e.g., Sterling).

7.CustomerID: A unique identifier for each customer.

8.Country: The country where the customer resides.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm
import scipy.cluster.hierarchy as sch
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
import difflib

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('/content/Online Retail.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.describe(include='all')

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

In [None]:
# Check for duplicate rows
duplicates = df.duplicated()

In [None]:
# Remove duplicate rows, keeping the first occurrence
data = df.drop_duplicates()
data.reset_index(drop=True, inplace=True)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

**Description and CustomerID have null Values**

In [None]:
#null value in percentage
for col in df.columns:
  print(f"{col} : Count : {df[col].isnull().sum()} : Percentage : {round(df[col].isnull().sum()/df.shape[0]*100, 2)}")


In [None]:
### Remove null value
df['Description'].fillna('No Description', inplace=True)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

In [None]:
### Remove customerid
df.drop('CustomerID', axis=1, inplace=True)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.sample(10)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
#Replacing null values in 'director' column with 'unknown'
df['Description'].replace(np.nan, "unknown",inplace  = True)

In [None]:
df.notnull().sum()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
df["Country"].value_counts(normalize=True)

In [None]:
import plotly.express as px

# Calculate country distribution
country_distribution = df["Country"].value_counts(normalize=True)

# Create the bar chart
fig = px.bar(
    x=country_distribution.index,  # Country names on the x-axis
    y=country_distribution.values,  # Normalized counts on the y-axis
    title="Distribution of Countries",
    labels={"x": "Country", "y": "Normalized Count"},  # Add axis labels for clarity
)

fig.show()

In [None]:
# Convert "InvoiceNo" to a string type series
df["InvoiceNo"] = df.InvoiceNo.astype("str")
# Convert "Description" to a string type series and remove extra whitespaces
df["Description"] = df.Description.astype("str")
df["Description"] = df.Description.str.strip()


In [None]:
#creating copy for plot
products=data.copy()

#removing unknown
StockCode=products[products['StockCode']!='unknown']

In [None]:
#plot for top 10 products

plt.figure(figsize = (14,6))
sns.countplot(y='StockCode',data=StockCode,order=StockCode.StockCode.value_counts().head(10).index,palette="gist_rainbow")
plt.title('produts with most stockcode',fontweight="bold")
plt.show()

# Products Cancellation count with country

In [None]:
# Assuming 'df' is your original DataFrame

# Filter rows where InvoiceNo starts with 'C' indicating cancellation
canceled_orders = df[df['InvoiceNo'].str.startswith('C')]

# Group by Country and StockCode to count cancellations for each product in each country
country_product_cancellations = canceled_orders.groupby(['Country', 'StockCode'])['InvoiceNo'].count().reset_index()

# Rename the count column for clarity
country_product_cancellations.rename(columns={'InvoiceNo': 'CancellationCount'}, inplace=True)

# Display the resulting DataFrame
print(country_product_cancellations)


In [None]:
# Calculate total cancellations per country
country_cancellations = country_product_cancellations.groupby('Country')['CancellationCount'].sum().reset_index()

# Sort countries by cancellation count in descending order
country_cancellations = country_cancellations.sort_values('CancellationCount', ascending=False)

#ploting the graph a country whose cancellation  is high

In [None]:
# plot the bar graph
plt.figure(figsize=(12, 6))
sns.barplot(x='Country', y='CancellationCount', data=country_cancellations, order=country_cancellations['Country'])
top_country = country_cancellations.iloc[0]['Country']
plt.title(f'Product Cancellations by Country (Highest in {top_country})')
plt.xlabel('Country')
plt.ylabel('Cancellation Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()

# We will create a wordcloud to see which words appear the most in the titles for Description column

In [None]:
plt.subplots(figsize=(25,15))
# parameters for wordcloud
wordcloud = WordCloud(
                          background_color='Black',
                          stopwords=set(STOPWORDS),
                          max_words=1000,
                          max_font_size=50,
                          random_state=42,
                          width=1920,
                          height=1080
                         ).generate(" ".join(data['Description'].astype(str)))
# Plot the image
plt.title('Most used words in Stockcode column', fontsize = 20, pad=25)
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('products.png')
plt.show()

# Analysis on 'Country' column:-
So we create wordcloud for country column

In [None]:
plt.subplots(figsize=(25,15))
# parameters for wordcloud
wordcloud = WordCloud(
                          background_color='white',
                          width=1920,
                          height=1080
                         ).generate(" ".join(data['Description'].astype(str))) # Join with space, convert to string
# Plot the image
plt.title('Most used words in Country', fontsize = 20, pad=25)
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('Country.png')
plt.show()

# 4. Feature Engineering & Data Pre-processing

Feature Engineering
we will add all text based or categorical columns

In [None]:
# We will add all categorical and text based columns
data['text_info'] = data['Country'].astype(str)

In [None]:
# Checking
data['text_info'][0]

# Text cleaning

In [None]:
#text cleaning function
import re
def clean_text(x):
    return re.sub(r"[^a-zA-Z ]","",str(x))

In [None]:
# Applying above function on our combined column
data['text_info'] = data['text_info'].apply(clean_text)

In [None]:
# we will convert all words in lowercase
data['text_info'] = data['text_info'].str.lower()

In [None]:
#necessary import for nlp
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt')

# Stemming-

In [None]:
#stemming
stemmer = SnowballStemmer('english')
stop_words = set(stopwords.words('english'))

In [None]:
# defining a function to filter the words
def filter_words(string, filter_words):
  filtered=[]
  tokens = word_tokenize(string)
  for word in tokens:
    if word not in filter_words:
      filtered.append(stemmer.stem(word))
  return filtered

data['cleaned_text']= ''
for item, row in data.iterrows():
  data.at[item,'cleaned_text'] = filter_words(row['text_info'],stop_words)

data['cleaned_text']

In [None]:
#join words fun
def join_words(x):
  return " ".join(x)

In [None]:
#final column
data['cleaned_text'] = data['cleaned_text'].apply(join_words)

In [None]:
data.head(4)

In [None]:
words = data.cleaned_text
words

# using TF-IDF

Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector.

In [None]:
#using tfidf
#using tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
t_vectorizer = TfidfVectorizer(max_df = 0.9,min_df = 0.01, max_features=5000)
X= t_vectorizer.fit_transform(words)

In [None]:
X

# Applying PCA-Principal Component Analysis to reduce dimensions.

In [None]:
#PCA Code
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
transformer = PCA()
transformer.fit(X.toarray())

In [None]:
 #explained var v/s comp
plt.plot(np.cumsum(transformer.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

In [None]:
#using tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
# Adjust the parameters of TfidfVectorizer
# Try reducing min_df or increasing max_features
t_vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.001, max_features=5000)
X = t_vectorizer.fit_transform(words)

# Check the shape of X after vectorization
print("Shape of X after TF-IDF:", X.shape)

# If X.shape[1] is still 0, it means no features were extracted
if X.shape[1] == 0:
    print("No features were extracted by TF-IDF. Please adjust the parameters.")
else:
    # Proceed with PCA
    from sklearn.decomposition import PCA
    transformer = PCA(n_components=0.95)
    transformer.fit(X.toarray())
    X_transformed = transformer.transform(X.toarray())
    print("Shape of X_transformed:", X_transformed.shape)

## ***7. ML Model Implementation***

In [None]:
# vectorizing the test and train
X_vectorized = t_vectorizer.transform(words)

In [None]:
#applying pca
X= transformer.transform(X_vectorized.toarray())

In [None]:
X


# Cluster Model Implementation



In [None]:
# We will plot the graph to get the no. of clusters
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 60):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=200, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)


plt.plot(range(1, 60), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples

In [None]:
plt.figure(figsize=(16,12))
plt.title('dendrogram')
dend = dendrogram(linkage(X, method='ward'))

**AgglomerativeClustering**

In [None]:
cluster = AgglomerativeClustering(n_clusters=355)
labels_ = cluster.fit_predict(X)
labels_

In [None]:
silhouette_score(X, labels_)

In [None]:
# Check Silhouette Score for each cluster
silhouette_score_ = [  ]
range_n_clusters = [i for i in range(2,55)]
for n_clusters in range_n_clusters:
    clusterer = AgglomerativeClustering(n_clusters=n_clusters)
    preds = clusterer.fit_predict(X)
    #centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    silhouette_score_.append([int(n_clusters) , round(score , 3)])
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

In [None]:
plt.figure(figsize=(14,8))
plt.scatter(X[:,0],X[:,1],c=cluster.labels_, cmap='rainbow')
plt.show()

**KMeans Clusters**

In [None]:
cluster = KMeans(n_clusters=395, random_state=0)
y_pred = cluster.fit_predict(X)

silhouette_score(X, y_pred)

In [None]:
silhouette_samples(X, y_pred)

In [None]:
#predict the labels of clusters.
label = kmeans.fit_predict(X)

#Getting unique labels
u_labels = np.unique(label)

#plotting the results:
for i in u_labels:
    plt.scatter(X[label == i , 0] , X[label == i , 1] , label = i)
plt.rcParams["figure.figsize"] = (20,8)
plt.legend()
plt.show()

# **Conclusion**

1- Data Overview
1- Data Overview

We have 533645 rows and 8 columns provided in the data.

In the dataset we have 2 object columns and 6 integer column as year.

2- Checking the null values

InvoiceNo : Count : 0 : Percentage : 0.0

StockCode : Count : 0 : Percentage : 0.0

Description : Count : 0 : Percentage : 0.0

Quantity : Count : 0 : Percentage : 0.0

InvoiceDate : Count : 0 : Percentage : 0.0

UnitPrice : Count : 0 : Percentage : 0.0

CustomerID : Count : 135037 : Percentage : 25.16

Country : Count : 0 : Percentage : 0.0

Fist we have 135037 null values in customerid column.We have almost 25% null values in this column so we can not use this column in model training but we can use it in EDA.
3- Check Duplicate values in the dataset

we do not have any Duplicate values in the dataset.
Number of Unique : Country : 38

Number of Unique : Description : 4224

Number of Unique : Quantity : 722

Number of Unique : UnitPrice : 1630

Number of Unique : year : 2

Number of Unique : month : 12

Number of Unique : month_name : 12

Number of Unique : week_name : 6

Number of Unique : quarter : 4

Number of Unique : days : 31

Number of Unique : week : 6

Number of Unique : hour : 15

Number of Unique : minute : 60

2- Data pre-processing
1- Feature Engineering

For train the model we use description column and Country column.
2- We performe Text cleaning as our next step

convert all words in lowercase.
3- We performe Stemming as our next step

We remove all stopwords.
Also use stemming function.
4- We performe TF-IDF vectorizer

Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector.
5- Applying PCA-Principal Component Analysis to reduce dimensions.

We will use 2000 components
3- Applying models
1- Find the value of clusters

WE use Elbow method for finding k values.
Also use Silhouette Score for best score.
Also use Dendogram for finding the value of clusters.
2- Use Agglomerative Clustering

3- Use KMeans Clustering






### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***