## Using Machine Learning to Predict If A Customers Will Purchase In The Next 90 Days And How Much They Will spend ? 
Machine Learning model to predict whether customers will make their next purchase after a certain period.If there is one major lesson that those in the retail business have learnt from the SARS-CoV-2 pandemic, it is the demand to switch to doing business via the Internet, i.e., e-commerce. The idea of e-commerce assists those in managerial positions to make decisions for the progress of their companies. Undoubtedly, most of these decisions are influenced by the results derived from studying the purchasing behavioural data of online customers by experts in data analysis, data science, and machine learning.

## Problem Statement
 the managerial team of an online retail shop approaches you, a data scientist, with the dataset wanting to know whether customers will make their next purchase 90 days from the day they made their last purchase. Your answer to their inquiry will help them identify which customers their marketing team need to have a focus on with regard to the next promotional offers they will be rolling out.

In [12]:
## load tools

# avoid displaying warnings
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#import machine learning related libraries
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix,classification_report
from sklearn.cluster import KMeans
import xgboost as xgb
import math
plt.style.use('seaborn-v0_8-whitegrid')

In [13]:
## load data

purchase_df = pd.read_csv('online_retail_II.csv')
purchase_df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
...,...,...,...,...,...,...,...,...
1067366,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
1067367,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
1067368,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
1067369,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [14]:
# changing invoice , price and Customer ID to invoiceID,unitprice and customerID
purchase_df.rename(columns = {
    'Invoice' : 'InvoiceID',
    'Price' : 'UnitPrice'
},inplace = True)

purchase_df

Unnamed: 0,InvoiceID,StockCode,Description,Quantity,InvoiceDate,UnitPrice,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
...,...,...,...,...,...,...,...,...
1067366,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
1067367,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
1067368,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
1067369,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680.0,France


In [15]:
# checking information about data
purchase_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   InvoiceID    1067371 non-null  object 
 1   StockCode    1067371 non-null  object 
 2   Description  1062989 non-null  object 
 3   Quantity     1067371 non-null  int64  
 4   InvoiceDate  1067371 non-null  object 
 5   UnitPrice    1067371 non-null  float64
 6   Customer ID  824364 non-null   float64
 7   Country      1067371 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 65.1+ MB


* Description,CustomerID are likely to have null values and InvoiceDate has to change to the proper data type which is datetime

In [16]:
# checking for missing values

purchase_df.isnull().sum()

InvoiceID           0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
UnitPrice           0
Customer ID    243007
Country             0
dtype: int64

* Notice as earlier stated Description and CostomerID have some missing values in them and must be dealt with.
* Dropping all missing values since they are considereably same compared to th whole data

In [17]:
purchase_df.dropna(inplace = True)
purchase_df.head()

Unnamed: 0,InvoiceID,StockCode,Description,Quantity,InvoiceDate,UnitPrice,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [18]:
purchase_df.isnull().sum()

InvoiceID      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
Customer ID    0
Country        0
dtype: int64

In [19]:
# converting InvoiceDate to datetime object
purchase_df['InvoiceDate'] = pd.to_datetime(purchase_df['InvoiceDate'])

In [20]:
purchase_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 824364 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceID    824364 non-null  object        
 1   StockCode    824364 non-null  object        
 2   Description  824364 non-null  object        
 3   Quantity     824364 non-null  int64         
 4   InvoiceDate  824364 non-null  datetime64[ns]
 5   UnitPrice    824364 non-null  float64       
 6   Customer ID  824364 non-null  float64       
 7   Country      824364 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 56.6+ MB


In [21]:
pd.DataFrame(purchase_df['InvoiceDate'].describe())

Unnamed: 0,InvoiceDate
count,824364
unique,41439
top,2011-11-14 15:27:00
freq,543
first,2009-12-01 07:45:00
last,2011-12-09 12:50:00


* Looking at the dataframe above ,  we can notice that onlice purchases made by customers starts from 2009-12-01 to 2011-12-09

## Exploratory Data  Analysis (EDA)

In [23]:
# number of customers
print(f'The total number of online customers : {len(purchase_df.CustomerID.unique())} customers ')

SyntaxError: f-string: unmatched '[' (1968864749.py, line 2)

In [None]:
# number of countries
print(f'The total countries each customer is situated : {len(purchase_df.Country.unique())} countries ')

* __What countries are represented most in the dataset:__

In [None]:
cus_cntry_df =  purchase_df.groupby(['Customer ID', 'Country']).count().reset_index()
cus_cntry_df = cus_cntry_df.groupby('Country')['Customer ID'].count().reset_index().sort_values(
    by=['Customer ID'], ascending=False)

# Create a new column, Percentage to calculate the customer representation in percentage
cus_cntry_df['Percentage']= np.round(cus_cntry_df.CustomerID / cus_cntry_df.CustomerID.sum() * 100, 2)

cus_cntry_df.head(10)

In [None]:
percent_margin = 0.25
cus_cntry_df['CountryCategory'] = cus_cntry_df['Country']
# Set Countries with Percentage less than or equal to percent_margin to 'Other Countries'

cus_cntry_df.loc[cus_cntry_df.Percentage <= percent_margin, 'CountryCategory'] = 'Other Countries'

cus_cntry_df.head(11)

In [None]:
# plot on how each countries are represented
plt.figure(figsize = (20,10))
country =cus_cntry_df['Percentage'][:11]
label = cus_cntry_df['CountryCategory'][:11]

# define Seaborn color palette to use
palette_color = sns.color_palette('bright')
plt.pie(country,labels = label,colors = palette_color,autopct='%.0f%%')
plt.axis('equal')  
plt.tight_layout()
plt.title('Top 10 Countries customers are living in',fontweight = 'bold',fontsize = 23);

**A far majority of online customers live in the United Kingdom which 92% ratio**

* __Calculate the revenue that was made in each month and what is the percentage revenue based on countries?__

In [None]:
purchase_df['Revenue'] = purchase_df['UnitPrice'] * purchase_df['Quantity']
purchase_df.head()

In [None]:
cus_revenue = purchase_df[['InvoiceDate','Revenue']]
cus_revenue.index = pd.to_datetime(cus_revenue['InvoiceDate'])
cus_revenue = cus_revenue.resample('M').sum()
cus_revenue.head()

In [None]:
cus_revenue.describe()

In [None]:
plt.figure(figsize = (30,5))
sns.lineplot(data = cus_revenue,y='Revenue',x=cus_revenue.index)
plt.title('Monthly Revenue from Dec.2009 to Jan.2012',fontweight = 'bold')
plt.xlabel('Invoice Date-Monthly')
plt.ylabel('Monthly Revenue');

**Here, one can observe that the company recorded its highest revenue in the month of November 2010, followed by November 2011. In addition, there is a rise in monthly revenue after August.**

In [None]:

cus_revenue['MonthlyGrowth'] = cus_revenue['Revenue'].pct_change()
plt.figure(figsize = (30,5))
plt.plot(cus_revenue.index,cus_revenue['MonthlyGrowth'])
plt.title('Monthly Revenue Growth from Dec.2009 to jan.2012',fontweight = 'bold')
plt.xlabel('Invoice Date-Monthly')
plt.ylabel('Monthly Revenue');

__we experience monthly growth peak around three periods in a year which are, January, March and August and then experience downdard trend around Decemeber and April__

## Monthly Active Customers

In [None]:
cus_monthly_active = (purchase_df[['Customer ID','InvoiceDate']])
cus_monthly_active = cus_monthly_active.groupby(pd.Grouper(key='InvoiceDate', freq='M'))['CustomerID'].nunique().reset_index()
cus_monthly_active

In [None]:
plt.figure(figsize=(20,6))
plt.bar(cus_monthly_active['InvoiceDate'],cus_monthly_active['Customer ID'])
plt.title('MONTHLY ACTIVE USERS')
plt.xticks(rotation =90);

__More active users around August,Octoberand November__

## Monthly Order Count

In [None]:

#create a new dataframe for no. of order by using quantity field
cus_monthly_order = (purchase_df[['InvoiceDate','Quantity']])
cus_monthly_order = cus_monthly_order.groupby(pd.Grouper(key='InvoiceDate', freq='M'))['Quantity'].sum().reset_index()
cus_monthly_order.head()

In [None]:
plt.figure(figsize=(20,6))
plt.bar(cus_monthly_order['InvoiceDate'],cus_monthly_order['Quantity'])
plt.title('MONTHLY USERS ORDER')
plt.xticks(rotation =90);

__As expected a decrease in April and December, We know that Active Customer Count directly affected Order Count decrease.__

## Average Revenue per Order

In [None]:
cus_monthly_order_rev = (purchase_df[['InvoiceDate','Revenue']])
cus_monthly_order_rev = cus_monthly_order_rev.groupby(pd.Grouper(key='InvoiceDate', freq='M'))['Revenue'].mean().reset_index()
cus_monthly_order_rev.head()

In [None]:
plt.figure(figsize=(20,6))
plt.bar(cus_monthly_order_rev['InvoiceDate'],cus_monthly_order_rev['Revenue'])
plt.title('MONTHLY USERS ORDER REVENUE')
plt.xticks(rotation =90);

### Next, explore the percentage revenue generated by the retail shop based on the countries their customers reside

In [None]:
cntry_revenue_df = purchase_df.groupby(['Country'])['Revenue'].sum().reset_index().sort_values(by=['Revenue'], 
                                                                                        ascending=False)

cntry_revenue_df['Percentage'] = np.round(cntry_revenue_df.Revenue / cntry_revenue_df.Revenue.sum() * 100, 2)

cntry_revenue_df.head(5)

From the output above, the top 5 countries with respect to revenue generated are:

* The United Kingdom
* The Republic of Ireland (EIRE)
* The Netherlands
* Germany
* France
with the United Kingdom recording the highest in percentage(83)

In [None]:
## group countries with revenue percentage value less than or equal to 0.25 together and then plot a pie chart.
percent_margin = 0.25

# Create a new column, CountryCategory and set values to the corresponding values of the Country column
cntry_revenue_df['CountryCategory'] = cntry_revenue_df.Country

# Set Countries with Percentage less than or equal to percent_margin to 'Other Countries'

cntry_revenue_df.loc[cntry_revenue_df.Percentage <= percent_margin, 'CountryCategory'] = 'Other Countries'

cntry_revenue_df.head(11)

In [None]:
# countrues based revenue
plt.figure(figsize = (20,10))
country =cntry_revenue_df['Percentage'][:11]
label = cntry_revenue_df['CountryCategory'][:11]

# define Seaborn color palette to use
palette_color = sns.color_palette('bright')
plt.pie(country,labels = label,colors = palette_color,autopct='%.0f%%')
plt.axis('equal')  
plt.tight_layout()
plt.title('Top 10 Countries customers are living in',fontweight = 'bold',fontsize = 23);

**With such a huge customer base in the United Kingdom, it is not surprising that 83% of the company’s revenue came from the United Kingdom.**

In [None]:
# highest selling product

high_sell_product = pd.DataFrame(purchase_df.groupby('Description')['Revenue'].sum().sort_values(ascending=False))

high_sell_product[:10]

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(data = high_sell_product[:10],y='Revenue',x=high_sell_product[:10].index)
plt.title('Highest revenue product')
plt.xticks(rotation = 90);

In [None]:
# least selling product

least_sell_product = pd.DataFrame(purchase_df.groupby('Description')['Revenue'].sum().sort_values(ascending=True))

least_sell_product[:10]

plt.figure(figsize=(15,5))
sns.barplot(data = least_sell_product[:10],y='Revenue',x=least_sell_product[:10].index)
plt.title('least revenue product')
plt.xticks(rotation = 90);

## Observation :
* Some product gave negative revenue, and i feel that made because they were returned so no revenue was generated from them

## Predicting customer's purchase

The goal of this section is to come up with a make a model using the given dataframe df_data, to estimate if a given customer will buy something again from the online shop in the next quarter.

Splitting the dataframe into two sub-dataframe :
* The first dataframe contains the last purchase of the customers from 01–12–2009 to 30–08–2011. This dataset gives the last purchase of all the online customers.This dataframe will be used to study the behavioural purchases of the online customers.


* The second dataframe contains the first purchase of the customers from  01–09–2011 to 9–12–2011 to get thier first purchase,will be used to study the behavioural purchases of the customers in the next quarter.


In [None]:
purchase_df

In [None]:
# Creating cohort analysis
cus_past_df = purchase_df[(purchase_df['InvoiceDate'] >= pd.Timestamp(2009,12,1)) & (purchase_df['InvoiceDate'] < 
                                                                                    pd.Timestamp(2011,9,1))].reset_index(drop=True)

cus_next_quarter = purchase_df[(purchase_df['InvoiceDate'] >= pd.Timestamp(2011,9,1)) & (purchase_df['InvoiceDate'] < 
                                                                                    pd.Timestamp(2011,12,9))].reset_index(drop=True)

In [None]:
cus_past_df['InvoiceDate'].min(),cus_past_df['InvoiceDate'].max(),cus_next_quarter['InvoiceDate'].min(),cus_next_quarter['InvoiceDate'].max()

In [None]:
# get the distinst customers in cus_past_df
cus_df = pd.DataFrame(cus_past_df['Customer ID'].unique())
cus_df.columns = ['Customer ID']

In [None]:
cus_df.head()

Let's find the first purchase made by each customer in the next quarter.

In [None]:
# Create a dataframe with CustomerID and customers first purchase 
# date in cus_next_quarter

cus_1st_purchase_in_next_quarter = cus_next_quarter.groupby('Customer ID')['InvoiceDate'].min().reset_index()
cus_1st_purchase_in_next_quarter.columns = ['Customer ID','FirstPurchaseDate']
cus_1st_purchase_in_next_quarter.head()

Let's find the last purchase made by each customer in the dataframe `cus_past_df`

In [None]:
cus_last_purchase_past_df = cus_past_df.groupby('Customer ID')['InvoiceDate'].max().reset_index()
cus_last_purchase_past_df.columns = ['CustomerID','LastPurchaseDate']
cus_last_purchase_past_df.head()

In [None]:
# Merge two dataframes cus_last_purchase_past_df and cus_1st_purchase_in_next_quarter
cus_purchase_dates = pd.merge(cus_last_purchase_past_df,cus_1st_purchase_in_next_quarter,on='CustomerID',how = 'left')
cus_purchase_dates.head()

Let's calculate the time difference in days between customer's last purchase in the dataframe `cus_last_purchase_past_df` and the first purchase in the dataframe `cus_1st_purchase_in_next_quarter`.

In [None]:
cus_purchase_dates['NextPurchaseDay'] = (cus_purchase_dates['FirstPurchaseDate'] - cus_purchase_dates['LastPurchaseDate']).dt.days
cus_purchase_dates.head()

In [None]:
# merge with cus_df

cus_df = pd.merge(cus_df,cus_purchase_dates[['Customer ID','NextPurchaseDay']],on='CustomerID',how='left')

cus_df.head()

In [None]:
# updating missing values with 9999 assuming they take along time to purchase sometime

cus_df.fillna(9999,inplace=True)
cus_df.head()

Next, we will define some features and add them to the dataframe ctm_dt to build our machine learning model. We will use the Recency - Frequency - Monetary Value segmentation method. That is, we will put the customers into groups based on the following:

* __Recency:__ Customers purchase behaviour based on their most recent purchase date and how many days they have been inactive since their last purchase.

* __Frequency:__ Customers purchase behaviour based on the number of times they buy from the online retail shop.

* __Monetary Value/Revenue:__ Customers purchase behaviour based the revenue they generate.

After we will apply K-means clustering to assign customers a score to each of the features.


## Recency
find the most recent purchase date of each customer and see how many days they have been inactive. Afterwards, we can apply K-means clustering to assign customers a recency score.`

In [None]:
cus_max_purchase = cus_past_df.groupby('Customer ID')['InvoiceDate'].max().reset_index()
cus_max_purchase.columns = ['CustomerID','LastPurchaseDate']
cus_max_purchase.head()

In [None]:
cus_max_purchase['Recency'] = (cus_max_purchase['LastPurchaseDate'].max()-cus_max_purchase['LastPurchaseDate']).dt.days

cus_df = pd.merge(cus_df,cus_max_purchase[['Customer ID','Recency']],on='CustomerID')

cus_df.head()

In [None]:
pd.DataFrame(cus_df['Recency'].describe())

In [None]:
plt.figure(figsize = (15,6))
sns.histplot(x='Recency',data=cus_df)
plt.title("Customers Recency in Days");

Next we will apply K-means clustering to assign a recency score. However, we need to know how many clusters in order to use the K-means algorithm. 

In [None]:
my_dict={}
cus_recency = cus_df[['Recency']]
for idx in range(1, 10):
    kmeans = KMeans(n_clusters=idx, max_iter=1000).fit(cus_recency)
    cus_recency["clusters"] = kmeans.labels_
    my_dict[idx] = kmeans.inertia_ 
    
plt.figure(figsize=(10,5))
sns.lineplot(x=list(my_dict.keys()),y=list(my_dict.values()))

In [None]:
number_of_clusters = 4
kmeans = KMeans(n_clusters=number_of_clusters)
kmeans.fit(cus_df[['Recency']])
cus_df['RecencyCluster'] = kmeans.predict(cus_df[['Recency']])
cus_df.head()

In [None]:
def order_cluster(df, target_field_name, cluster_field_name, ascending):
    """
    INPUT:
        - df                  - pandas DataFrame
        - target_field_name   - str - A column in the pandas DataFrame df
        - cluster_field_name  - str - Expected to be a column in the pandas DataFrame df
        - ascending           - Boolean
        
    OUTPUT:
        - df_final            - pandas DataFrame with target_field_name and cluster_field_name as columns
    
    """
    # Add the string "new_" to cluster_field_name
    new_cluster_field_name = "new_" + cluster_field_name
    
    # Create a new dataframe by grouping the input dataframe by cluster_field_name and extract target_field_name 
    # and find the mean
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    
    # Sort the new dataframe df_new, by target_field_name in descending order
    df_new = df_new.sort_values(by=target_field_name, ascending=ascending).reset_index(drop=True)
    
    # Create a new column in df_new with column name index and assign it values to df_new.index
    df_new["index"] = df_new.index
    
    # Create a new dataframe by merging input dataframe df and part of the columns of df_new based on 
    # cluster_field_name
    df_final = pd.merge(df, df_new[[cluster_field_name, "index"]], on=cluster_field_name)
    
    # Update the dataframe df_final by deleting the column cluster_field_name
    df_final = df_final.drop([cluster_field_name], axis=1)
    
    # Rename the column index to cluster_field_name
    df_final = df_final.rename(columns={"index": cluster_field_name})
    
    return df_final

In [None]:
cus_df= order_cluster(cus_df, 'Recency', 'RecencyCluster', False)
cus_df.head()

In [None]:
#print cluster characteristics
cus_df.groupby('RecencyCluster')['Recency'].describe()


Observe from the above that,3 covers the most recent customers whereas 1 has the most inactive customers.

## Frequency

Next, we will find customers purchase behaviour based on the number of times they buy from the online retail shop. That is, the total number of orders by each customer.

In [None]:
#get order counts for each user and create a dataframe with it

cus_frequency = cus_past_df.groupby('Customer ID')['InvoiceDate'].count().reset_index()
cus_frequency.columns = ['CustomerID','Frequency']
cus_frequency.head()

In [None]:
cus_df = pd.merge(cus_df,cus_frequency,on = 'Customer ID')

cus_df.head()

In [None]:
pd.DataFrame(cus_df['Frequency'].describe())

In [None]:
plt.figure(figsize = (10,6))
sns.histplot(x='Frequency',data=cus_df)
plt.title("Customers Frequency in Days");

In [None]:
kmeans = KMeans(n_clusters = number_of_clusters)
kmeans.fit(cus_df[['Frequency']])
cus_df['FrequencyCluster'] = kmeans.predict(cus_df[['Frequency']])

cus_df.head()

In [None]:
cus_df = order_cluster(cus_df, 'Frequency', 'FrequencyCluster', True)
cus_df.head()

In [None]:
#see details of each cluster
cus_df.groupby('FrequencyCluster')['Frequency'].describe()

 higher frequency number means better customers.

## Revenue

In [None]:
cus_revenue = cus_past_df.groupby('Customer ID')['Revenue'].sum().reset_index()
cus_df = pd.merge(cus_df,cus_revenue,on='CustomerID')
cus_df.rename(columns={'Revenue':'TotalRevenue'},inplace=True)


cus_revenue_mean = cus_past_df.groupby('CustomerID')['Revenue'].mean().reset_index()
cus_df = pd.merge(cus_df,cus_revenue_mean,on='CustomerID')
cus_df.rename(columns={'Revenue':'MeanRevenue'},inplace =True)

cus_df.head()

In [None]:
plt.figure(figsize = (10,6))
sns.histplot(x='TotalRevenue',data=cus_df)
plt.title("Customers Revenue in Days");

In [None]:
kmeans = KMeans(n_clusters = number_of_clusters)
kmeans.fit(cus_df[['TotalRevenue']])
cus_df['TotalRevenueCluster'] = kmeans.predict(cus_df[['TotalRevenue']])

cus_df.head()

In [None]:
# ordering cluster number
cus_df = order_cluster(cus_df, 'TotalRevenue', 'TotalRevenueCluster', True)
cus_df.head()

In [None]:
pd.DataFrame(cus_df.groupby('TotalRevenueCluster')['TotalRevenue'].describe())

## Overall Score

In [None]:
#calculate overall score and use mean() to see details

cus_df['OverallScore'] = cus_df['RecencyCluster'] + cus_df['FrequencyCluster'] + cus_df['TotalRevenueCluster']

cus_df.groupby('OverallScore')['Recency','Frequency','TotalRevenue'].mean()


The scoring above clearly shows us that customers with score 8 are our best customers whereas those who score 3 are the worst.

In [None]:
cus_df['Segment'] = 'low_value'
cus_df.loc[cus_df['OverallScore'] > 2, 'Segment'] = 'Mid-Value'
cus_df.loc[cus_df['OverallScore'] > 4, 'Segment'] = 'High-Value'

In [None]:
fig,axs = plt.subplots(nrows=3,ncols=1,figsize=(10,10))
sns.scatterplot(x='Frequency',y='TotalRevenue',data=cus_df,hue=cus_df['Segment'],ax=axs[0])
sns.scatterplot(x='Recency',y='TotalRevenue',data=cus_df,hue=cus_df['Segment'],ax=axs[1])
sns.scatterplot(x='Recency',y='Frequency',data=cus_df,hue=cus_df['Segment'],ax=axs[2])
plt.tight_layout()

In [None]:
cus_df.head()

In [None]:
#create ctm_class as a copy of ctm_dt before applying get_dummies
cus_data = cus_df.copy()
cus_data = pd.get_dummies(cus_data)
cus_data.head()

In [None]:
cus_data['NextPurchaseDay'].describe()

Since our goal is to estimate whether a customer will make a purchase in the next quarter, we will create a new column NextPurchaseDayRange
defined as follows:

* 0–20: Customers that will purchase in 0–20 days — Class name: 2
* 21–49: Customers that will purchase in 21–49 days — Class name: 1
* ≥ 50: Customers that will purchase in more than 50 days — Class name: 0

### Creating target variables

In [None]:
# customer purchase day range
cus_data['NextPurchaseDayRange'] = 2
cus_data.loc[cus_data.NextPurchaseDay>20,'NextPurchaseDayRange'] = 1
cus_data.loc[cus_data.NextPurchaseDay>50,'NextPurchaseDayRange'] = 0

__Checking the correletion between the features and labels__

In [None]:
cus_data

In [None]:
# plot t
plt.figure(figsize=(15,8))
sns.countplot(data=cus_data,x='NextPurchaseDayRange')
plt.title('NextPurchaseRange');

__Dataset is very imbalance__

In [None]:
corr = cus_data.corr()

In [None]:
plt.figure(figsize=(40,30))
sns.heatmap(corr,annot = True,fmt='.2F',linewidth = 0.2);

## Building Machine Learning Model

In [None]:
# dropping nextpurchaseday
cus_data.drop('NextPurchaseDay',axis = 1,inplace = True)

In [None]:
# splitting data for classification and regression

x = cus_data.drop('NextPurchaseDayRange',axis = 1)
y = cus_data['NextPurchaseDayRange']

# removing imbalance in the dataset
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy='not majority')
x_over,y_over = ros.fit_resample(x,y)

In [None]:
y_over.value_counts()

In [None]:
# instantiating models

models = {
    "LogisticRegression" : LogisticRegression(),
    "GaussianNB" : GaussianNB(),
    "RandomForestClassifier" : RandomForestClassifier(),
    "SVC" : SVC(),
    "DecisionTreeClassifier" : DecisionTreeClassifier(),
    "xgb.XGBClassifier" : xgb.XGBClassifier(eval_metric='mlogloss'),
    "KNeighborsClassifier" : KNeighborsClassifier()
}


In [None]:
class_scorer = ['accuracy']

In [None]:
# function to perform cross validation on all metrics and display them for each model
def fit_and_score(model,x,y,scorer):
        metric =[]
        index = []
        kfold = KFold(n_splits=2, random_state=24, shuffle=True)
        for key,value in model.items() :
            index.append(key)
            for i in range(len(scorer)) :
                score = cross_val_score(value,x,y,scoring = scorer[i],cv = kfold)
                score = np.mean(score)
                metric.append(score)
        df = pd.DataFrame(np.array(metric).reshape(len(model),len(scorer)))
        df.columns = scorer
        df.index = index
        return df  

In [None]:
# performing cross_validation for the classification
class_metrics = fit_and_score(model=models,scorer=class_scorer,y=y_over,x=x_over)

class_metrics

__The `randomforestclassifier`performed better

## Next Days spend Propability

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_over, y_over, test_size=0.2)

In [None]:
# instantiating model and fitting
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

In [None]:
# making prediction probalities
prediction_proba = rf.predict_proba(x_test)

# predictions
preds = rf.predict(x_test)

## Classfication Report

In [None]:
print(classification_report(preds,y_test))

In [None]:
## Savving model
from joblib import dump
dump(rf, 'C:/Users/user/Data_Driven/models/Customer_next_purchase_day_model_1.joblib')