<div align="Center">

# JK Lakshmipat University
## Institute of Engineering and Technology
### Machine Learning (CS1138)
#### Project-I
#### RFM model-based Customer Segmentation using Clustering and Classification

</div>
<hr>

#### Importing the Libraries

In [None]:
import pandas as pd

import plotly.express as px

from sklearn.model_selection import train_test_split

<hr>

### Data Configuration

#### Importing the Data

In [None]:
df1 = pd.read_excel('online_retail_II.xlsx', sheet_name='Year 2009-2010')
df2 = pd.read_excel('online_retail_II.xlsx', sheet_name='Year 2010-2011')
df = pd.concat([df1, df2])

#### Initial Dataset

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

#### Imputing the Dataset

In [None]:
df.isnull().sum()

- Description : Not Available
- Customer ID : -1

In [None]:
df['Description'] = df['Description'].fillna('Not Available')
df['Customer ID'] = df['Customer ID'].fillna(-1)

In [None]:
df.drop_duplicates(keep='first', inplace=True)

#### Feature Engineering

In [None]:
df['Customer ID'] = df['Customer ID'].astype(int)
df['TotalPrice'] = df['Price'] * df['Quantity']

In [None]:
df['Country'] = df['Country'].astype('category')
df['Description'] = df['Description'].astype('category')
df['StockCode'] = df['StockCode'].astype(str)

In [None]:
df[df['Price'] < 0]

In [None]:
df['Cancelled'] = df['Invoice'].astype(str).str.contains('C').astype(int)
df['Bad Debt'] = df['Invoice'].astype(str).str.contains('A').astype(int)
df['Invoice'] = df['Invoice'].astype(str).str.replace('[A-Z]', '', regex=True).astype(int)

StockCode contains Codes for different Situations, so it Cannot be Converted to numerical.

In [None]:
len(df['StockCode'].str.extractall(r"([a-zA-Z]+)").groupby(level=0).sum(numeric_only=False)[0].unique())

#### Final Dataset

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

<hr>

### Exploratory Data Analysis

#### Top Selling Products

In [None]:
plotDF = df.groupby('Description', observed=True).size().sort_values(ascending=False).reset_index()
plotDF.drop(plotDF[plotDF['Description'] == 'Not Available'].index, inplace=True)
plotDF.columns = ['Description', 'Count']

fig = px.bar(plotDF.head(20), x='Description', y='Count', title='Top Selling Products')
fig.show()

#### Most Expensive and Least Expensive Products

In [None]:
plotDF = df.drop(df[(df['Bad Debt'] == 1) | (df['Cancelled'] == 1) | (df['Description'] == 'Not Available') | (df['Customer ID'] == -1) | (df['Price'] == 0.0) | ~(df['StockCode'].str.isdigit())].index)
plotDF = plotDF[['Description', 'Price']]
plotDF.sort_values(by='Price', ascending=False, inplace=True)
plotDF.drop_duplicates(subset=['Description'], keep='first', inplace=True)
plotDF.reset_index(drop=True, inplace=True)

In [None]:
fig = px.line(plotDF, x='Description', y='Price', title='Prices of Offered Products')
fig.update_xaxes(showticklabels=False)
fig.show()

In [None]:
fig = px.bar(plotDF.head(20), x='Description', y='Price', title='Most Expensive Products')
fig.show()

In [None]:
fig = px.bar(plotDF.tail(20), x='Description', y='Price', title='Least Expensive Products')
fig.show()

#### Average No. of Orders per Customer

In [None]:
ordersPerCustomer = df[['Invoice', 'Customer ID']].drop_duplicates()
ordersPerCustomer.drop(ordersPerCustomer[ordersPerCustomer['Customer ID'] == -1].index, inplace=True)
ordersPerCustomer = ordersPerCustomer.groupby(['Customer ID'], observed=True).count().reset_index().sort_values(by='Invoice', ascending=False).reset_index(drop=True)
ordersPerCustomer.columns = ['Customer ID', 'Number of Orders']

In [None]:
print(f"Average Orders per Customer: {ordersPerCustomer['Number of Orders'].mean()}")

In [None]:
plotDF = ordersPerCustomer.head(20)

fig = px.bar(plotDF, x=plotDF.index, y='Number of Orders', hover_data=['Customer ID'], title='Most Number of Orders per Customer')
fig.update_xaxes(title='Customer Rank')
fig.show()

#### Average No. of Unique Items per Customer and per Order

#### Top Countries by No. of Customers and No. of Orders

In [None]:
plotDF = df.groupby('Country', observed=True).size().sort_values(ascending=False).reset_index()
plotDF.columns = ['Country', 'Count']

fig = px.bar(plotDF, x='Country', y='Count', title='Sales per Country')
fig.show()

In [None]:
plotDF = df[['Country','Customer ID']].drop_duplicates()
plotDF = plotDF.groupby(['Country'], observed=True)['Customer ID'].count().sort_values(ascending=False).reset_index()
plotDF.columns = ['Country', 'Count']

fig = px.bar(plotDF, x='Country', y='Count', title='Sales by Country per Customer')
fig.show()

In [None]:
plotDF = df[['Country','Invoice']].drop_duplicates()
plotDF = plotDF.groupby(['Country'], observed=True)['Invoice'].count().sort_values(ascending=False).reset_index()
plotDF.columns = ['Country', 'Count']

fig = px.bar(plotDF, x='Country', y='Count', title='Sales by Country per Order')
fig.show()

#### Total Sales per Month, per Week and per Day

#### Cancelled Items Analysis

#### Bad Debt Analysis

<hr>

## Machine Learning

#### Spliting Data into Train, Test and Validate

In [None]:
dfShuffled = df.sample(frac=1, random_state=42)

In [None]:
dfTrain, dfTest = train_test_split(dfShuffled, test_size=0.2, random_state=1)

dfTrain, dfValidate = train_test_split(dfTrain, test_size=0.2, random_state=1)

<hr>

### RFM Analysis

<hr>

### BG/NBD CLV Modelling
Beta-Geometric/Negative Binomial Distribution Customer Lifetime Value Modelling

<hr>

### Gamma-Gamma Modelling

<hr>

### k-Means Clustering

<hr>

### Hierarchical Clustering

<hr>

### k-NN Classification

<hr>

### Logistic Regression

<hr>

## Conclusion

<hr>