# Classification Model Development (Individual)

      Cross_Sell_Success_Dataset_2023
      
 Name: Piyush Kumar
 
 Due Date: March 3
 

# Introduction

The development of a classification model is the process of building a machine learning model that can accurately divide data into predefined classes or categories. Image recognition, sentiment analysis, fraud detection, and spam filtering are all common uses for this kind of model. In this context, a dataset containing customer data for a meal delivery service has been provided. The development of a classification model typically involves the following key steps:

Data collection and preparation: A dataset with labelled examples of the classes you want the model to classify must be gathered and prepared for this. To ensure that it accurately depicts the real-world scenarios that the model will use to classify, the dataset needs to be carefully curated. 

Feature selection and engineering:  Changing the raw data into a more meaningful representation or choosing a subset of features that are most crucial to the task at hand may be necessary.

Model selection and training: To determine whether a customer will purchase the new product line, classification models such as logistic regression, decision tree classifier, random forest classifier, gradient boosting classifier, and kneighbors will be developed. 

Model evaluation and tuning: The next step is to evaluate the model's performance and adjust its parameters to increase accuracy after it has been trained. 

Deployment: The company will be able to conduct more targeted marketing campaigns and increase its cross-selling success rate as a result of this analysis, which aims to develop a model that accurately predicts the likelihood of a customer purchasing the new product line.
For the purpose of predicting a dataset's cross-sell success, we developed a number of classification models for this project. Cross_sell_success becomes the target variable in this instance.



In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


# Load the dataset
df = pd.read_excel('Cross_Sell_Success_Dataset_2023.xlsx')
df.columns = df.columns.str.strip()

# Set random state
random_state = 219

# function to plot confusion matrix as heatmap
def plot_heatmap(confusion_matrix):
  # Define the labels for the classes
  class_names = ['0', '1']

  # Plot the heatmap
  sns.heatmap(confusion_matrix, annot=True, cmap='Blues', xticklabels=class_names, yticklabels=class_names)

  # Add labels and title
  plt.xlabel('Predicted labels')
  plt.ylabel('True labels')
  plt.title('Confusion Matrix')

  # Show the plot
  plt.show()

In [None]:
## reading the data from the excel dataset. 
cross_df = pd.read_excel('Cross_Sell_Success_Dataset_2023.xlsx')

cross_df.head()

In [None]:
## checking the characteristics of the dataframe
cross_df.info()

In [None]:
# finding the null values
df.isnull().sum()

Checking correlation to find which parameters are best for predicting the REVENUE value

In [None]:
# Check correlation

print(df.columns)
correlation = df.corr()['CROSS_SELL_SUCCESS'].sort_values(ascending=False)
correlation

In [None]:
# dropping unneccessary columns
X = df.drop(['CROSS_SELL_SUCCESS', 'EMAIL'], axis=1)

# output prediction column
y = df['CROSS_SELL_SUCCESS']

# splitting dataset to training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=219, stratify=y)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 1. Developing Logistic regression classifier

Logistic regression, a popular classification method, is frequently used in binary classification tasks. We will go over the steps involved in creating a logistic regression classifier in this section. When the relationship between the input features and the target variable is roughly linear, a logistic regression classifier can be very effective for classification tasks. However, it might struggle with relationships between features and the target variable that are more complicated and non-linear.


Steps:
For data manipulation, import the necessary libraries, such as pandas, scikit-learn, and matplotlib for visualizations.
Separate the target variable from the predictor variables before loading the data into a Pandas dataframe. Then calculate the accuracy, train test gap and auc score.


In [None]:
model_name = "Logistic Regression"
# Create and Fit Logistic regression model
lr = LogisticRegression(random_state=random_state)
lr.fit(X_train, y_train)

# predictions on train and test
lr_train_preds = lr.predict(X_train)
lr_test_preds = lr.predict(X_test)

# accuracy calculation
lr_train_accuracy = accuracy_score(y_train, lr_train_preds).round(4)
lr_test_accuracy = accuracy_score(y_test, lr_test_preds).round(4)

# train test gap
lr_train_test_gap = abs(lr_train_accuracy - lr_test_accuracy).round(4)

# auc score
lr_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1]).round(4)

# confusion matrix
lr_cm = confusion_matrix(y_test, lr_test_preds)

model_summary =  f"""\
Model Name:     {model_name}
Train Accuracy: {lr_train_accuracy}
Test Accuracy:  {lr_test_accuracy}
Train-Test Gap: {lr_train_test_gap}
AUC Score:      {lr_auc}
Confusion Matrix:
{lr_cm}
"""

# print(model_summary)
# plot_heatmap(lr_cm)

## 2. Decision tree classifier

Logistic regression, a popular classification method, is frequently used in binary classification tasks.Based on a set of input features, the algorithm constructs a tree-like structure that model’s decisions and their potential outcomes. Recursively partitioning the feature space into subsets that correspond to the various classes generates the tree structure. As a classification algorithm, decision trees offer several advantages. Because the tree structure reflects the algorithm's decision-making process, they are simple to comprehend and interpret. They are resistant to missing values and outliers and can deal with categorical as well as numerical features. Because the features selected at the root node are the most informative for the classification task, decision trees can also be used for feature selection.


Steps:
The training data are used to train the model. The model learns how to best divide the classes into a tree-like structure during this step. By comparing the predicted values of the model to the actual values of the testing data, one can determine the model's accuracy. 


In [None]:
model_name = "Decision Tree"
# Create and Fit Decision tree model
dt = DecisionTreeClassifier( random_state=random_state, max_depth=1,

                              class_weight='balanced')
dt.fit(X_train, y_train)

# predictions on train and test data
dt_train_preds = dt.predict(X_train)
dt_test_preds = dt.predict(X_test)

# accuracy calculation
dt_train_accuracy = accuracy_score(y_train, dt_train_preds).round(4)
dt_test_accuracy = accuracy_score(y_test, dt_test_preds).round(4)

# train test gap
dt_train_test_gap = abs(dt_train_accuracy - dt_test_accuracy).round(4)

# auc score
dt_auc = roc_auc_score(y_test, dt.predict_proba(X_test)[:,1]).round(4)

# confusion matrix
dt_cm = confusion_matrix(y_test, dt_test_preds)

model_summary =  f"""\
Model Name=     {model_name}
Train Accuracy: {dt_train_accuracy}
Test Accuracy:  {dt_test_accuracy}
Train-Test Gap: {dt_train_test_gap}
AUC Score:      {dt_auc}
Confusion Matrix:
{dt_cm}
"""

# print(model_summary)
# plot_heatmap(dt_cm)

## 3. Developing Random Forest Classifier

The powerful Random Forest Classifier is a machine learning algorithm that has performed exceptionally well in a wide variety of classification tasks, including those involving the cross_sell_success_2023 dataset. This reduces overfitting and increases the model's accuracy. The decision of which features to incorporate into the model is one of the main obstacles in the development of a Random Forest Classifier. This may entail transforming the raw data into a more meaningful representation that the algorithm can use as well as determining which variables are most crucial to the classification task. Feature selection is an iterative process that frequently necessitates extensive data knowledge and domain expertise. 
     Avoiding overfitting, which can occur when the model is too complex and fits the training data too closely, is another challenge in developing a Random Forest Classifier. The model's accuracy and generalizability can suffer as a result of overfitting. The Random Forest Classifier employs multiple decision trees and selects randomly subsets of features and data points for each tree to prevent overfitting. This helps to reduce the correlation between the trees and improves the model's accuracy.


In [None]:
model_name = "Random Forest"

# Create and fit Random Forest model
rf = RandomForestClassifier(random_state=random_state, max_depth=1,
                            
                            warm_start = False,
                            # criterion = 'entropy',
                            class_weight='balanced',
                            )
rf.fit(X_train, y_train)

# predictions on train and test data
rf_train_preds = rf.predict(X_train)
rf_test_preds = rf.predict(X_test)

# accuracy calculation
rf_train_accuracy = accuracy_score(y_train, rf_train_preds).round(4)
rf_test_accuracy = accuracy_score(y_test, rf_test_preds).round(4)

# train test gap
rf_train_test_gap = abs(rf_train_accuracy - rf_test_accuracy).round(4)

# auc score
rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[::,1]).round(4)

# confusion matrix
rf_cm = confusion_matrix(y_test, rf_test_preds)

model_summary =  f"""\
Model Name:     {model_name}
Train Accuracy: {rf_train_accuracy}
Test Accuracy:  {rf_test_accuracy}
Train-Test Gap: {rf_train_test_gap}
AUC Score:      {rf_auc}
Confusion Matrix:
{rf_cm}
"""

# print(model_summary)
# plot_heatmap(rf_cm)

## 4. Developing Gradient boosting classifier

In [None]:

model_name = "Gradient Boosting"
# Create and fit OLS model
# creating a Gradient Boosting Classifier
gbm = GradientBoostingClassifier(
    learning_rate= 0.0001, max_depth= 1, n_estimators= 2000,
    min_samples_leaf = 50, random_state=random_state,
)

# fitting the model to the training dataset
gbm.fit(X_train, y_train)

# predicting test set results
gbm_train_preds = gbm.predict(X_train)
gbm_test_preds = gbm.predict(X_test)

# training and testing accuracy
gbm_train_accuracy = accuracy_score(y_train, gbm_train_preds).round(4)
gbm_test_accuracy = accuracy_score(y_test, gbm_test_preds).round(4)

# train-test gap
gbm_train_test_gap = abs(gbm_train_accuracy - gbm_test_accuracy).round(4)

# auc score
gbm_auc = roc_auc_score(y_true  = y_test, y_score = gbm.predict_proba(X_test)[:,1]).round(4)

# confusion matrix
gbm_cm = confusion_matrix(y_true = y_test, y_pred = gbm_test_preds)

model_summary =  f"""\
Model Name:     {model_name}
Train Accuracy: {gbm_train_accuracy}
Test Accuracy:  {gbm_test_accuracy}
Train-Test Gap: {gbm_train_test_gap}
AUC Score:      {gbm_auc}
Confusion Matrix:
{gbm_cm}
"""

print(model_summary)
plot_heatmap(gbm_cm)

The Gradient Boosting Classifier is a powerful machine learning algorithm for classification tasks, such as those involving the cross_sell_success_2023 dataset. It can be very effective. The algorithm works by adding decision trees to the model iteratively, with each new tree focusing on repairing the previous tree's mistakes. With this method, complex and non-linear relationships between features and the target variable can be captured in highly accurate models.  Tuning the parameters of a Gradient Boosting Classifier to prevent overfitting, which can occur when the model is too complex and fits the training data too closely, is one of the most difficult challenges. Developers can create models that can be used in production to classify new data with high accuracy and efficiency by tuning the parameters and avoiding overfitting.


Steps:
With the desired parameters, create an instance of the Gradient Boosting Classifier model. Predict outcomes from the testing dataset using the model. Improve the model's performance by adjusting its hyperparameters.

From this model we got:

Model Name:     Gradient Boosting

Train Accuracy: 0.6785

Test Accuracy:  0.6797

Train-Test Gap: 0.0012

AUC Score:      0.601

Confusion Matrix:

[[  0 156]
 [  0 331]]



## 5. Developing KNeighbors  classifier

In [None]:

model_name = "KNeighbors"

# creating a Gradient Boosting Classifier
knn = KNeighborsClassifier()

# fitting the model to the training dataset
knn.fit(X_train, y_train)

# predicting test set results
knn_train_preds = knn.predict(X_train)
knn_test_preds = knn.predict(X_test)

# training and testing accuracy
knn_train_accuracy = accuracy_score(y_train, knn_train_preds).round(4)
knn_test_accuracy = accuracy_score(y_test, knn_test_preds).round(4)

# train-test gap
knn_train_test_gap = abs(knn_train_accuracy - knn_test_accuracy).round(4)

# auc score
knn_auc = roc_auc_score(y_true  = y_test, y_score = knn.predict_proba(X_test)[::,1]).round(4)

# confusion matrix
knn_cm = confusion_matrix(y_true = y_test, y_pred = knn_test_preds)

model_summary =  f"""\
Model Name:     {model_name}
Train Accuracy: {knn_train_accuracy}
Test Accuracy:  {knn_test_accuracy}
Train-Test Gap: {knn_train_test_gap}
AUC Score:      {knn_auc}
Confusion Matrix:
{knn_cm}
"""

# print(model_summary)
# plot_heatmap(knn_cm)

For classification tasks, including those involving the cross_sell_success_2023 dataset, the KNeighbors Classifier is a straightforward but efficient machine learning algorithm. The algorithm works by determining a data point's k-nearest neighbors and classifying it according to the majority class of those neighbors. When using the KNeighbors Classifier, selecting the appropriate distance metric for locating neighbors is an important consideration. Depending on the data and the classification task at hand, this could be the Manhattan distance, the Euclidean distance, or something else entirely.


Steps:
Creating a gradient boosting classifier and predicting test set result. Update the centroid position and compute the mean of the data points in each cluster. The above steps should be repeated until either the centroid positions stop changing or a certain number of iterations have been completed.


# Evaluation
 Basis of the guideline mentioned in the classic model development where final model score of AUC should be less than or equal to 0.90. On the following assumption, have derived conclusion through various models. Have used logistic regression, decision tree classifier, random forest classifier, gradient boosting classifier and KNeighbors classifier. Logisitc regression classifier is giving a AUC of 0.56 with a train test gap of 0.015. Decision tree classifier is giving a AUC score of 0.60 with a train test gap of 0.028. Random forest classifier AUC score is 0.55 with a train test gap of 0.0124. KNeighbors AUC score is 0.50 with a train test gap of .014. The final model therefore selected is gradient boosting. The final AUC score comes to 0.60.   The train test gap is 0.0012 which is also aligned as per the guideline mentioned. The following is also plotted on heat map in the confusion matrix projected. 
From this model we got:

Model Name:     Gradient Boosting

Train Accuracy: 0.6785

Test Accuracy:  0.6797

Train-Test Gap: 0.0012

AUC Score:      0.601

Confusion Matrix:

[[  0 156]
 [  0 331]]
