# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üõç Recommendation System üõç</p>

<img src="https://www.digitaledge.org/wp-content/uploads/2018/09/What-is-a-recommendation-system.jpg">

    

<a id='contents_tabel'></a>
<div class="list-group" id="list-tab" role="tablist">
<p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">TABLE OF CONTENTS</p>   
    
* [1. IMPORTING LIBRARIES](#1)
    
* [2. LOADING DATA](#2)
    
* [3. Data Overview](#3)

* [4. EDA](#4)
        
* [5. Preprocessing](#5) 

<!-- * [6. SPLITING DATA](#6) 
 -->
* [6. Scaling](#7) 
   
* [7. MODEL](#8)

<!-- * [9. SUMMRY](#9)
  -->
* [8. END](#10)
    


<div style="border-radius:20px; padding: 20px; background-color: #f0f0f0; font-size:130%; text-align:left; border: 2px solid #4682B4;">

<h3 align="left" style="color: #4CAF50;"><font>Problem Statement:</font></h3>

    
    

In this project, we delve deep into the thriving sector of __online retail__ by analyzing a __transactional dataset__ from a UK-based retailer, available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/352/online+retail). This dataset documents all transactions between 2010 and 2011. Our primary objective is to amplify the efficiency of marketing strategies and boost sales through __customer segmentation__. We aim to transform the transactional data into a customer-centric dataset by creating new features that will facilitate the segmentation of customers into distinct groups using the __K-means clustering__ algorithm. This segmentation will allow us to understand the distinct __profiles__ and preferences of different customer groups. Building upon this, we intend to develop a __recommendation system__ that will suggest top-selling products to customers within each segment who haven't purchased those items yet, ultimately enhancing marketing efficacy and fostering increased sales.



This is a non-trivial task due to the following challenges:

1. **Data Quality**: Missing values, especially in crucial fields like Customer ID, can introduce significant biases.
   
2. **Feature Engineering**: Correctly identifying and computing features like RFM are critical for the success of the clustering algorithm.
   
3. **Optimal Number of Clusters**: Determining the right number of clusters that make business sense is essential for the recommendation engine to be useful.
   
4. **Scalability**: The algorithm needs to be efficient as we are dealing with a potentially large dataset with numerous transactions.
   
5. **Interpreting Clusters**: Post clustering, understanding the characteristics of each cluster is important for practical implementation.

We will employ the KMeans algorithm for the clustering task, fine-tune the model based on evaluation metrics, and ultimately generate a list of product recommendations for customers within each cluster. By doing so, we aim to improve the store's marketing strategies, increase sales, and enhance customer satisfaction.

By the end of this project, we hope to provide actionable insights that could be used to improve the effectiveness of the online store's recommendation system, thereby contributing to increased customer engagement and revenue growth.

<a id="1"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üìö IMPORTING LIBRARIES</p>

‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import plotly.express as px
from pandas import Timestamp
from sklearn.metrics.pairwise import cosine_similarity
from ipywidgets import widgets
from IPython.display import display
%matplotlib inline

In [None]:
# Set style of seaborn plots
sns.set_style('darkgrid')

# Palette
palette = sns.color_palette('plasma')

#warning
import warnings
warnings.filterwarnings('ignore')

<a id="2"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üìñ LOADING DATA</p>

‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

In [None]:
data = pd.read_excel('/kaggle/input/online-retail/Online Retail.xlsx')
data.head()

<a id="3"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üßê Data Overview</p>
‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

# Data Definitions

- **InvoiceNo**: Unique invoice identifier; shared across rows for multiple purchases in one invoice
- **StockCode**: Item identifier in an invoice
- **Description**: Text description of the stock item
- **Quantity**: Number of items purchased
- **InvoiceDate**: Purchase date
- **UnitPrice**: Price per item
- **CustomerID**: Customer identifier
- **Country**: Customer's country

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isnull().sum()


<a id="4"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üìä EDA</p>
‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

In [None]:
plt.figure(figsize=(14,6))
country_orders = data.groupby(['Country']).count()['InvoiceNo'].sort_values(ascending=False)
country_orders.drop('United Kingdom',inplace=True)
country_orders.plot(kind='bar', color='lightgreen')

<a id="5"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üßπ Preprocessing </p>
‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

In [None]:
# Preprocessing: Remove NaN CustomerID
data = data[data['CustomerID'].notnull()]

In [None]:
# Create InvoiceDay column
data['InvoiceDay'] = pd.to_datetime(data['InvoiceDate']).dt.date

In [None]:
#Calculate Total sum
data['TotalSum'] = data['Quantity'] * data['UnitPrice']

In [None]:
data.head()

In [None]:
# Create the 'pin_date' for recency calculation
pin_date = max(data['InvoiceDate']) + pd.Timedelta(days=1)

<div style="border-radius:20px; padding: 20px; background-color: #f0f0f0; font-size:130%; text-align:left; border: 2px solid #4682B4;">

The code above is used to measure "delay" in RFM (delay, frequency, monetary value) analysis. In this analysis, there are three key qualities for evaluation and segmentation:

Recency: How recently a customer has purchased
Frequency: How often a customer has purchased
Monetary value: How much a customer bought in total
To calculate "Recency" you need the date of each customer's last purchase. Here, pin_date is used as the reference point or "pin" to calculate these dates.

max(data['InvoiceDay']): Returns the date of the last invoice in the data.
dt.timedelta(1): Adds a day to the date.
pin_date: This is the final date used as the reference point for the "Recency" calculation. The closer a customer's purchase date is to this date, the better the "Recency" quality will be.
In other words, pin_date is used to calculate the time difference between each customer's last purchase date and date. This helps to understand how much more they have bought.

In [None]:
rfm = data.groupby('CustomerID').agg({
    'InvoiceDay': lambda x: (pin_date - Timestamp(x.max())).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'
})

In [None]:
rfm.rename(columns={
    'InvoiceDay' : 'Recency',
    'InvoiceNo' : 'Frequency',
    'TotalSum': 'Monetary'},
    inplace = True
)
rfm

<a id="7"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üìê Scaling</p>
‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

In [None]:
r_labels = range(4, 0, -1) #[4, 3, 2, 1]
r_groups = pd.qcut(rfm['Recency'], q=4, labels=r_labels)
f_labels = range(1, 5) # [1, 2, 3, 4]
f_groups = pd.qcut(rfm['Frequency'], q=4, labels=f_labels)
m_labels = range(1, 5)
m_groups = pd.qcut(rfm['Monetary'], q=4, labels=m_labels)

In [None]:
rfm['R'] = r_groups.values
rfm['F'] = f_groups.values
rfm['M'] = m_groups.values
rfm

<a id="8"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üí° MODEL</p>
‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)

In [None]:
x = rfm[['R', 'F', 'M']]
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300)
rfm['kmeans_cluster'] = kmeans.fit_predict(x)

In [None]:
rfm

In [None]:
cluster_counts = rfm['kmeans_cluster'].value_counts().reset_index()
cluster_counts.columns = ['Cluster', 'Number of Customers']

fig = px.bar(cluster_counts, x='Cluster', y='Number of Customers', title="Number of Customers per Cluster")
fig.show()

In [None]:
def suggest_top_product_for_cluster(cluster_number, data, rfm):
    if cluster_number not in range(5):  # Assuming clusters are numbered from 0 to 4
        return "Invalid cluster number."

    # Get the top-selling product for the cluster
    customers_in_cluster = rfm[rfm['kmeans_cluster'] == cluster_number].index.tolist()
    top_product = data.query('CustomerID == @customers_in_cluster') \
                      .groupby('StockCode')['InvoiceNo'] \
                      .count() \
                      .sort_values(ascending=False) \
                      .head(1) \
                      .index[0]

    # Find customers in the cluster who haven't bought the top product
    customers_not_bought = []
    for customer in customers_in_cluster:
        bought_products = data[data['CustomerID'] == customer]['StockCode'].unique()
        if top_product not in bought_products:
            customers_not_bought.append(customer)
    
    return {
        "cluster": cluster_number,
        "top_product": top_product,
        "customers_suggested": customers_not_bought
    }


In [None]:
sample_suggestion = suggest_top_product_for_cluster(4, data, rfm)
sample_suggestion

In [None]:
def prompt_for_customer_id():
    """
    Prompt the user to input a CustomerID and validate the input.
    """
    try:
        return float(input("Enter the CustomerID to recommend StockCodes: "))
    except ValueError:
        print("Invalid input. Ensure you enter a numeric CustomerID.")
        return None

def get_cluster_for_customer(customer_id):
    """
    Retrieve the associated cluster for a given customer.
    """
    return rfm.loc[customer_id, 'kmeans_cluster']

def get_popular_products_for_cluster(cluster):
    """
    Return the popular products for a specific cluster.
    """
    cluster_customers = rfm[rfm['kmeans_cluster'] == cluster].index
    return (data[data['CustomerID'].isin(cluster_customers)]
            .groupby('StockCode')['InvoiceNo']
            .count()
            .sort_values(ascending=False))

def recommend_products(customer_id, num_recommendations=5):
    """
    Recommend products to a customer based on their cluster's preferences.
    """
    customer_stock_codes = data[data['CustomerID'] == customer_id]['StockCode'].unique()
    cluster = get_cluster_for_customer(customer_id)
    popular_products = get_popular_products_for_cluster(cluster)
    
    recommended_products = popular_products[~popular_products.index.isin(customer_stock_codes)]
    print(f"\nTop Products for Cluster {cluster}:")
    for stock_code, count in popular_products.head(num_recommendations).items():
        print(f"Stock Code: {stock_code}, Purchased: {count} times")

    print(f"\nRecommended Stock Codes for Customer {customer_id}:")
    for stock_code, count in recommended_products.head(num_recommendations).items():
        print(f"Stock Code: {stock_code}")
    
    return recommended_products.head(num_recommendations)

if __name__ == "__main__":
    customer_id = prompt_for_customer_id()
    if customer_id:
        recommend_products(customer_id)
    else:
        print("Exiting due to invalid input.")

<a id="10"></a>
# <p style="background-color:#4B0082;font-family:newtimeroman;color:#FFF9ED;font-size:150%;text-align:center;border-radius:10px 10px;">üëã END </p>
‚¨ÜÔ∏è [Tabel of Contents](#contents_tabel)