<a href="https://colab.research.google.com/github/CharlesKasasira/ML_assignment/blob/main/ML_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Group Assignment

| Names | Studet Number | Registration Number |
| --- | --- | --- |
| Ssemakula Martin |	2100719937 |	21/U/19937/EVE |
| Nagaba Norman |	2100721359 |	21/U/21359/EVE |
| Kasasira Charles Derrick |	2100705662 |	21/U/05662/EVE |
| Kaboggoza Ronnie | 	2100710546 |	21/U/10546/EVE |
| Mafabi Daniel |	2100719677 |	21/U/19677/EVE | 

#### Question
- Go to kaggle and select a dataset.
- Describe the data and the objective and build candidate models using 3-5 algorithms
- Compare the performance and pick a best performing algorithm for further tuning
- Run the final optimized/tuned model for your algorithm and report the results.
- Present all steps along with the necessary explanations as a Google Colab or jupyter notebook. You can share jupyter notebooks via GitHub. Use **sklearn** to implement the algorithms

Dataset: https://www.kaggle.com/datasets/blewitts/ecommerce-rfm-analysis

#### Objective

The objective of this project is to build a machine learning model to predict which customers are likely to churn. In this case, churn refers to customers who stop making purchases from the online retailer.



#### Build candidate models using 5 algorithms.
We will build five candidate models to predict customer churn:

  - Decision trees
  - K-Nearest Neighbors
  - Naive bayes
  - Support vector machines (SVMs)
  - Random forests


In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
import pandas as pd

df = pd.read_csv('/content/gdrive/My Drive/ecom_data_rfm.csv')

#### Data Desciption

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,CustomerID,Frequency,Recency,Monetary,rankR,rankF,rankM,groupRFM,Country,Customer_Segment
0,1,12346,2,358,2.08,2,1,1,211,United Kingdom,Lost Lowest
1,2,12347,182,35,481.21,5,4,3,543,Iceland,Loyal Customers
2,3,12348,31,108,178.71,5,1,2,512,Finland,Potential Loyalist
3,4,12349,73,51,605.1,5,2,4,524,Italy,Recent High Spender
4,5,12350,17,343,65.3,2,1,1,211,Norway,Lost Lowest


This dataset contains customer purchase data from an e-commerce store, with columns such as:

  - **CustomerID**: unique identifier for each customer
  - **Frequency**: the number of times each customer made a purchase
  - **Recency**: how recently each customer made their last purchase
  - **Monetary**: the total amount spent by each customer on purchases
  - **rankR**: a rank assigned to each customer based on their Recency value
  - **rankF**: a rank assigned to each customer based on their Frequency value
  - **rankM**: a rank assigned to each customer based on their Monetary value
  - **groupRFM**: a combination of the rank values for Recency, Frequency, and Monetary, used to segment customers into different groups based on their purchase behavior
  - **Country**: the country where each customer is located
  - **Customer_Segment**: a label assigned to each customer based on their purchase behavior and groupRFM value.

The objective of our analysis will be to predict the **Customer_Segment** label for new customers based on their purchase behavior.

In [None]:
df.columns.values.tolist()

['Unnamed: 0',
 'CustomerID',
 'Frequency',
 'Recency',
 'Monetary',
 'rankR',
 'rankF',
 'rankM',
 'groupRFM',
 'Country',
 'Customer_Segment']

In [None]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [None]:
# check for any missing values and handle them

df.isnull().sum()
df.dropna(inplace=True)

In [None]:
# Drop unnecessary columns
df.drop(['Unnamed: 0', 'Country'], axis=1, inplace=True) # dropping Unnamed and country columns


In [None]:
df.head()

Unnamed: 0,CustomerID,Frequency,Recency,Monetary,rankR,rankF,rankM,groupRFM,Customer_Segment
0,12346,2,358,2.08,2,1,1,211,Lost Lowest
1,12347,182,35,481.21,5,4,3,543,Loyal Customers
2,12348,31,108,178.71,5,1,2,512,Potential Loyalist
3,12349,73,51,605.1,5,2,4,524,Recent High Spender
4,12350,17,343,65.3,2,1,1,211,Lost Lowest


In [None]:
# Encode the Customer_Segment column
le = LabelEncoder()
df['Customer_Segment'] = le.fit_transform(df['Customer_Segment'])

In [None]:
# Split the data into training and testing sets
X = df.drop(['Customer_Segment'], axis=1)
y = df['Customer_Segment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
X

Unnamed: 0,CustomerID,Frequency,Recency,Monetary,rankR,rankF,rankM,groupRFM
0,12346,2,358,2.08,2,1,1,211
1,12347,182,35,481.21,5,4,3,543
2,12348,31,108,178.71,5,1,2,512
3,12349,73,51,605.10,5,2,4,524
4,12350,17,343,65.30,2,1,1,211
...,...,...,...,...,...,...,...,...
4375,18280,10,310,47.65,2,1,1,211
4376,18281,7,213,39.36,3,1,1,311
4377,18282,13,40,62.68,5,1,1,511
4378,18283,756,36,1220.93,5,5,5,555


In [None]:
y

0       2
1       3
2       6
3       8
4       2
       ..
4375    2
4376    0
4377    5
4378    3
4379    6
Name: Customer_Segment, Length: 4326, dtype: int64

##### 1. Decision Tree Classifier

In [None]:
# Train and test a Decision Tree Classifier
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)
print('Decision Tree Classifier:')
print(classification_report(y_test, y_pred_dtc))

Decision Tree Classifier:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        69
           1       1.00      1.00      1.00         3
           2       1.00      1.00      1.00        98
           3       1.00      1.00      1.00       126
           4       1.00      1.00      1.00         4
           5       1.00      1.00      1.00       245
           6       1.00      0.99      1.00       239
           7       1.00      1.00      1.00        81
           8       1.00      1.00      1.00         1

    accuracy                           1.00       866
   macro avg       1.00      1.00      1.00       866
weighted avg       1.00      1.00      1.00       866



#### 2. K-Nearest Neighors Classifier

In [None]:
# Train and test a K-Nearest Neighbors Classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print('K-Nearest Neighbors Classifier:')
print(classification_report(y_test, y_pred_knn))

K-Nearest Neighbors Classifier:
              precision    recall  f1-score   support

           0       0.94      0.99      0.96        69
           1       0.00      0.00      0.00         3
           2       1.00      0.99      0.99        98
           3       0.96      0.93      0.94       126
           4       0.40      0.50      0.44         4
           5       0.93      0.98      0.95       245
           6       0.93      0.88      0.91       239
           7       0.92      0.99      0.95        81
           8       0.00      0.00      0.00         1

    accuracy                           0.94       866
   macro avg       0.68      0.69      0.68       866
weighted avg       0.94      0.94      0.94       866



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### 3. Naïve Bayes Classifier

In [None]:
# Train and test a Naive Bayes Classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
print('Naive Bayes Classifier:')
print(classification_report(y_test, y_pred_gnb))


Naive Bayes Classifier:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        69
           1       0.43      1.00      0.60         3
           2       1.00      0.98      0.99        98
           3       1.00      0.98      0.99       126
           4       0.50      0.75      0.60         4
           5       1.00      1.00      1.00       245
           6       1.00      0.99      0.99       239
           7       1.00      1.00      1.00        81
           8       1.00      1.00      1.00         1

    accuracy                           0.99       866
   macro avg       0.88      0.97      0.91       866
weighted avg       0.99      0.99      0.99       866



#### 4. Support Vector Classifier

In [None]:
# Train and test a Support Vector Classifier
svc = SVC(random_state=42)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
print('Support Vector Classifier:')
print(classification_report(y_test, y_pred_svc))

Support Vector Classifier:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        69
           1       0.00      0.00      0.00         3
           2       0.00      0.00      0.00        98
           3       0.98      0.44      0.60       126
           4       0.00      0.00      0.00         4
           5       0.32      1.00      0.49       245
           6       0.00      0.00      0.00       239
           7       0.00      0.00      0.00        81
           8       0.00      0.00      0.00         1

    accuracy                           0.35       866
   macro avg       0.15      0.16      0.12       866
weighted avg       0.23      0.35      0.23       866



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### 5. Random Forest

In [None]:
# Train and test a Random Forest
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
y_pred_rfc = rfc.predict(X_test)

print('Random Forest Classifier:')
print(classification_report(y_test, y_pred_rfc))

Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        69
           1       1.00      1.00      1.00         3
           2       1.00      1.00      1.00        98
           3       1.00      1.00      1.00       126
           4       1.00      0.75      0.86         4
           5       1.00      1.00      1.00       245
           6       1.00      0.99      1.00       239
           7       1.00      1.00      1.00        81
           8       1.00      1.00      1.00         1

    accuracy                           1.00       866
   macro avg       1.00      0.97      0.98       866
weighted avg       1.00      1.00      1.00       866



### Model Comparison

We will compare the performance of the four candidate models using a holdout set. The holdout set is a subset of the data that is not used to train the models. It is used to evaluate the performance of the models on unseen data.

We will use the following metrics to evaluate the performance of the models:

- Accuracy: The percentage of predictions that are correct.
- F1 score: A measure of the accuracy and precision of the predictions.
- Area under the curve (AUC): A measure of the overall performance of the model.

#### Best Performing Algorithm
The best performing algorithm is the one that has the highest **accuracy**, **F1 score**, and **AUC** on the holdout set. 

[answer] In this case, the best performing algorithm is the random forest model.

In [None]:
# Compare the performance of the four algorithms
print("ACCURACY")
print('Decision Tree Classifier accuracy:', accuracy_score(y_test, y_pred_dtc))
print('K-Nearest Neighbors Classifier accuracy:', accuracy_score(y_test, y_pred_knn))
print('Naive Bayes Classifier accuracy:', accuracy_score(y_test, y_pred_gnb))
print('Support Vector Classifier accuracy:', accuracy_score(y_test, y_pred_svc))
print('Random Forest Classifier accuracy:', accuracy_score(y_test, y_pred_rfc))


ACCURACY
Decision Tree Classifier accuracy: 0.9976905311778291
K-Nearest Neighbors Classifier accuracy: 0.9399538106235565
Naive Bayes Classifier accuracy: 0.9907621247113164
Support Vector Classifier accuracy: 0.3464203233256351
Random Forest Classifier accuracy: 0.9965357967667436


Based on the accuracy scores, it appears that the **Decision Tree Classifier** is the best performing algorithm.

### Final Model
We will tune the hyperparameters of the Decision Tree Classifier to improve its performance. The hyperparameters are the parameters of the model that are not learned from the data. We will use a grid search to find the best combination of hyperparameters using GridSearchCV function from scikit-learn.

In [None]:
# Define the parameter grid
param_grid = {
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10, 15, 20]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

# Fit the grid search object to the data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best parameters: ", grid_search.best_params_)

Best parameters:  {'max_depth': 10, 'min_samples_split': 2}


In [None]:
# Create a Decision Tree Classifier object with the best hyperparameters
dtc_tuned = DecisionTreeClassifier(
    max_depth=15,
    min_samples_split=2,
    random_state=42
)

# Train the model on the training data
dtc_tuned.fit(X_train, y_train)

# Evaluate the model on the test data
y_pred = dtc_tuned.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print("Tuned Decision Tree Classifier accuracy: ", accuracy)
print("Tuned Decision Tree Classifier F1 Score: ", f1)

Tuned Decision Tree Classifier accuracy:  0.9976905311778291
Tuned Decision Tree Classifier F1 Score:  0.9977021755584451


### Results

The final model has an accuracy of 99%, an F1 score of 0.9977. This means that the model is able to correctly predict which customers are likely to churn 99% of the time.