Classification Model Development

Classification is the process of assigning a set of predefined categories or classes to a new observation or instance based on the observation's features or attributes. It is a supervised machine learning technique in which the algorithm is trained on a labeled dataset to learn relationships and patterns between input features and output classes. Once trained, the model can predict the class of previously unseen new observations (Amin et al., 2019). In that regard, classification models are machine learning models that use a set of features or attributes to predict the category or class of a given input. They are commonly used in image recognition, text classification, fraud detection, and spam filtering, among other things (Amin et al., 2019). Logistic Regression, Decision Trees, Random Forests, Support Vector Machines (SVMs), Naive Bayes, and Neural Networks are some examples of classification models. 

A logistic regression model is typically used for binary classification problems in which the output variable has only two possible values i.e., 0 or 1, yes or No, etc. It utilizes a logistic function to model the likelihood of the output variable belonging to a particular class (Zou et al., 2019). Decision trees, on the other hand, use a tree-like structure to represent the classification decision-making process. It divides the input according to the values of its features, and each branch of the tree corresponds to a feature-based decision. The random forest model is a collection of decision trees, with each tree trained on a different set of input data and features (Hao et al., 2019). The final prediction is made by aggregating all of the trees' predictions.

Furthermore, SVMs are used to solve both binary and multi-class classification problems. It works by finding the hyperplane that best separates the input data into different classes (Hao et al., 2019). The Naive Bayes model is based on Bayes' theorem and assumes that given the class, the features are conditionally independent. It is frequently used in text classification tasks. Finally, the Neural Networks model is a complex, non-linear model that can be applied to binary and multi-class classification problems (Hao et al., 2019). It learns features and makes predictions using multiple layers of nodes. These are just a few examples of the many different classification models that are available, and the choice of model depends on the specific problem being solved and the characteristics of the input data. In this project, we build three models i.e., logistic regression, random forest, and support vector machine to classify the binary response variable. 
We have chosen logistic regression, random forest, and support vector machine as our classification models because they are known for their ability to accurately classify binary response variables. All three models have been extensively used in research and industry for binary classification tasks, making them reliable choices for our specific problem (Chen et al., 2020). By employing these algorithms, we hope to build a robust and accurate classification system that can effectively predict the binary response variable.

My code performs a binary classification task on the Cross_Sell_Success_Dataset_2023.xlsx dataset using three different models: logistic regression, random forest, and support vector machine. The goal is to evaluate the performance of each model and choose the one with the highest area under the receiver operating characteristic curve (AUC).
Here is how the code is built and how it works:
First, the necessary libraries are imported: pandas for data handling, train_test_split from sklearn for splitting data into training and testing sets, SimpleImputer and Pipeline from sklearn for imputing missing values and scaling data, respectively, and LogisticRegression, RandomForestClassifier, SVC, accuracy_score, roc_auc_score, and confusion_matrix from sklearn for building and evaluating classification models.

Secondly, the dataset is loaded using pd.read_excel() and split into training and testing sets using train_test_split(). The target variable y is the first column of the dataset, and the features are all the columns except for the first two. Third, a Pipeline is created to preprocess the data. This pipeline includes an imputer to fill missing values with the median and a standard scaler to scale the data. Imputing is a method for estimating missing values from available data. The median imputer is a commonly used method that replaces missing values with the feature's median. This ensures that the dataset is complete and that the machine learning algorithms can be used without losing important information (Choudhury & Pal 2019).

The standard scaler is a technique for scaling data. The data is scaled by modifying the feature values to have a zero mean and unit variance (Nguyen et al., 2019). The standard scaler is a common scaling method used to standardize the dataset's features. As a result, the range of feature values is consistent, and the features are comparable. Scaling is useful because it helps to normalize the values of the features, ensuring that no one feature dominates the others (Nguyen et al., 2019). This can help improve the performance of machine learning algorithms by allowing them to process input data more efficiently.

The fourth step of the model-building process entailed preprocessing of the training and testing data using the pipeline created in step 3. Next, a list of models to test is defined, including logistic regression, random forest, and support vector machine. In the fifth step, we create a loop that goes through each model in the list, and the model is fitted to the training data. The predicted values are then computed for both the training and testing data. The accuracy score and AUC score are calculated for both the training and testing data, and a confusion matrix is printed for the testing data. The best model is chosen based on the Train-Test Gap and AUC score on the testing data.

The AUC (Area Under the Receiver Operating Characteristic Curve) score is a popular metric for assessing the performance of a binary classification model. It assesses the model's ability to distinguish between positive and negative classes (Sofaer et al., 2019). The ROC curve compares the true positive rate (TPR) to the false positive rate (FPR) for various classifier probability thresholds. The AUC score is the area under the ROC curve, which ranges from 0.0 to 1.0, with 0.5 representing a random guess and 1.0 representing perfect classification (Sofaer et al., 2019). A high AUC score indicates that the model can make distinctions between positive and negative classes well, whereas a low score indicates poor performance (Sofaer et al., 2019). The AUC score is especially useful when the class distribution is skewed, with one class outnumbering the others (Sofaer et al., 2019). In such cases, accuracy alone can be deceptive because the model can achieve high accuracy simply by predicting the majority class and ignoring the minority class.

In this project, we select the best model based on the Train-Test Gap and AUC score. The Train-Test Gap should be low, and the AUC should be high. As can be seen in the code, we initialize best_auc to 0 and lowest_gap to infinity before looping through the models. We then evaluate each model's performance, calculate the train-test gap, and compare it to the previous best model's train-test gap and AUC score. If the current model has an AUC score greater than or equal to 0.5 and a lower train-test gap, we update the best model, best AUC score, and lowest train-test gap accordingly. The Support Vector Machine, with an AUC score of 0.5 and a Train-Test Gap of 0.024, is the classification model that meets these criteria. SVM is an effective binary classification algorithm because it can handle high-dimensional data and learn non-linear decision boundaries. It is also less prone to overfitting and generalizes well to previously unseen data. This model is a versatile and widely used algorithm in many different domains such as finance, healthcare, and image classification because it can be used for both linear and non-linear classification tasks. The last line of the code instructs python to print the best model and its corresponding Train-Test Gap and AUC score. 

References

Amin, M. S., Chiam, Y. K., & Varathan, K. D. (2019). Identification of significant features and data mining techniques in predicting heart disease. Telematics and Informatics, 36, 82-93.
Chen, R. C., Dewi, C., Huang, S. W., & Caraka, R. E. (2020). Selecting critical features for data classification based on machine learning methods. Journal of Big Data, 7(1), 52.
Choudhury, S. J., & Pal, N. R. (2019). Imputation of missing data with neural networks for classification. Knowledge-Based Systems, 182, 104838.
Hao, J., & Ho, T. K. (2019). Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348-361.
Nguyen, G., Dlugolinsky, S., Bobák, M., Tran, V., López García, Á., Heredia, I., ... & Hluchý, L. (2019). Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artificial Intelligence Review, 52, 77-124.
Sofaer, H. R., Hoeting, J. A., & Jarnevich, C. S. (2019). The area under the precision‐recall curve as a performance metric for rare binary events. Methods in Ecology and Evolution, 10(4), 565-577.
Zou, X., Hu, Y., Tian, Z., & Shen, K. (2019, October). Logistic regression model optimization and case analysis. In 2019 IEEE 7th international conference on computer science and network technology (ICCSNT) (pp. 135-139). IEEE.

In [5]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix
# Load dataset
df = pd.read_excel(r'C:\Users\User\Desktop\Cross_Sell_Success_Dataset_2023.xlsx')
# Split data into training and testing sets
X = df.iloc[:, 2:].values
y = df.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=219)
# Create pipeline to impute missing values and scale data
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
])
# Preprocess training and testing data
X_train_prepared = num_pipeline.fit_transform(X_train)
X_test_prepared = num_pipeline.transform(X_test)
# Define list of models to test
models = [('Logistic Regression', LogisticRegression()),
          ('Random Forest', RandomForestClassifier()),
          ('Support Vector Machine', SVC())]
# Loop through models and evaluate performance
best_model = None
best_auc = 0
lowest_gap = float('inf')
for name, model in models:
    model.fit(X_train_prepared, y_train)
    y_train_pred = model.predict(X_train_prepared)
    y_test_pred = model.predict(X_test_prepared)
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_auc = roc_auc_score(y_train, y_train_pred)
    test_auc = roc_auc_score(y_test, y_test_pred)
    train_test_gap = train_acc - test_acc
    cm = confusion_matrix(y_test, y_test_pred)
    print(f"{name} - Training Accuracy: {train_acc:.3f}, Testing Accuracy: {test_acc:.3f}, Train-Test Gap: {train_test_gap:.3f}, AUC Score: {test_auc:.3f}")
    print(f"Confusion Matrix:\n{cm}")
    if test_auc >= 0.5 and train_test_gap < lowest_gap:
        best_model = model
        best_auc = test_auc
        lowest_gap = train_test_gap
print(f"\nBest model: {best_model}\nBest AUC score: {best_auc:.3f}, Lowest Train-Test Gap: {lowest_gap:.3f}") 


Logistic Regression - Training Accuracy: 0.680, Testing Accuracy: 0.656, Train-Test Gap: 0.024, AUC Score: 0.492
Confusion Matrix:
[[  0  65]
 [  2 128]]
Random Forest - Training Accuracy: 1.000, Testing Accuracy: 0.651, Train-Test Gap: 0.349, AUC Score: 0.488
Confusion Matrix:
[[  0  65]
 [  3 127]]
Support Vector Machine - Training Accuracy: 0.690, Testing Accuracy: 0.667, Train-Test Gap: 0.024, AUC Score: 0.500
Confusion Matrix:
[[  0  65]
 [  0 130]]

Best model: SVC()
Best AUC score: 0.500, Lowest Train-Test Gap: 0.024
