# Text Classification Model Analysis

In this assignment, we aim to explore and compare different machine learning models for a text classification task. Our objective is to identify the most effective model based on performance metrics such as accuracy, precision, and recall. This notebook encompasses the entire workflow from data preprocessing, model training, hyperparameter tuning, to the final comparison and selection of the best model.

### Problem statement, dataset, and objective of the analysis

This assignment fosuses on text classification for cancer related documents. The selected dataset contains 7569 documents, which are distributed across 3 cancer types: thyroid, colon and lung. The purposed of the analysis is to conduct text analysis and build machine learning models to accurately identify the type of cancer on each observation. 

The dataset can be downloaded from the following source: https://www.kaggle.com/datasets/falgunipatel19/biomedical-text-publication-classification/data

# Pre-processing


### Data Preprocessing Steps:

- **Loading and Understanding the Data**:performing exploratory data analysis to understand the structure, content, and any immediate data quality issues.
- **Cleaning Data**: Removing special characters, numbers, and unnecessary whitespace.
- **Tokenization**: Splitting text into individual words or tokens.
- **Lemmatization/Stemming**: Reducing words to their base or root form.
- **Stop Word Removal**: Eliminating common words that provide little value in the analysis.
- **Vectorization (TF-IDF)**: Converting text into numeric form to create feature vectors.
- **Feature Engineering**:  Encode the text data using sklearn's TfidfVectorizer to convert text data into a matrix of TF-IDF features.
- **Splitting the Data**: The data is divided into training and testing sets to evaluate the model's performance on unseen data. 

### Rationale Behind Chosen Methods:

- **Cleaning Data**: Ensures data quality and consistency, which are fundamental for reliable model predictions.
- **Feature Engineering**: Leverages domain knowledge to introduce new relevant features, potentially improving model performance by providing more informative signals.
- **Text Processing**: Text data must be converted into a numerical format for machine learning algorithms to process it. This step is crucial for any task involving text analysis.
- **Normalization/Standardization**: Helps to equalize the influence of features on the model's outcome, improving training stability and performance.
- **Splitting the Data**: Essential for validating the model's ability to generalize from the training data to unseen data, providing an estimate of its performance in real-world scenarios.



In [89]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from sklearn.model_selection import train_test_split
from sklearn import preprocessing #for preprocessing text data
from sklearn.feature_extraction.text import TfidfVectorizer #TfidfVectorizer (which includes pre-processing, tokenization, and filtering out stop words)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
from string import punctuation
import time
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV


In [51]:
df = pd.read_csv("Cancer_Dataset.csv", encoding='latin1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,a
0,0,Thyroid_Cancer,Thyroid surgery in children in a single insti...
1,1,Thyroid_Cancer,""" The adopted strategy was the same as that us..."
2,2,Thyroid_Cancer,coronary arterybypass grafting thrombosis ï¬b...
3,3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...
4,4,Thyroid_Cancer,This study aimed to investigate serum matrix ...


In [52]:
#Dropping the irrelevant column
df = df.drop('Unnamed: 0',axis=1)
# Renaming the column names
df.columns=['Class_Labels', 'Research_Paper_Text']
df

Unnamed: 0,Class_Labels,Research_Paper_Text
0,Thyroid_Cancer,Thyroid surgery in children in a single insti...
1,Thyroid_Cancer,""" The adopted strategy was the same as that us..."
2,Thyroid_Cancer,coronary arterybypass grafting thrombosis ï¬b...
3,Thyroid_Cancer,Solitary plasmacytoma SP of the skull is an u...
4,Thyroid_Cancer,This study aimed to investigate serum matrix ...
...,...,...
7565,Colon_Cancer,we report the case of a 24yearold man who pres...
7566,Colon_Cancer,among synchronous colorectal cancers scrcs rep...
7567,Colon_Cancer,the heterogeneity of cancer cells is generally...
7568,Colon_Cancer,"""adipogenesis is the process through which mes..."


In [53]:
#Dropping Null Values
count = df['Class_Labels'].isna().sum()
if  count > 0:
    print(f'Found {count} null values in Class_Labels column')
    #df['Class_Labels'].fillna('missing', inplace=True) # though we could do this, we will drop the rows instead - as there is no way to impute the text
    df = df.dropna(subset=['Class_Labels'])

In [54]:
#Dropping Null Values
count = df['Research_Paper_Text'].isna().sum()
if  count > 0:
    print(f'Found {count} null values in Research_Paper_Text column')
    #df['Research_Paper_Text'].fillna('missing', inplace=True) # though we could do this, we will drop the rows instead - as there is no way to impute the text
    df = df.dropna(subset=['Research_Paper_Text'])

In [55]:
df['Class_Labels'].unique()

array(['Thyroid_Cancer', 'Colon_Cancer', 'Lung_Cancer'], dtype=object)

In [56]:
#checking data imbalance
df['Class_Labels'].value_counts()

Thyroid_Cancer    2810
Colon_Cancer      2580
Lung_Cancer       2180
Name: Class_Labels, dtype: int64

In [57]:
# Define stopwords list
stopwords_list = set(stopwords.words('english'))

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define function for text cleaning and lemmatization
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)
    # Cleaning and lemmatization
    clean_lemmatized_text = [lemmatizer.lemmatize(token.lower()) for token in tokens if (token.lower() not in punctuation) and (token.lower() not in stopwords_list) and (len(token) > 2) and token.isalpha()]
    return " ".join(clean_lemmatized_text)

# Apply text preprocessing to the 'Research_Paper_Text' column
df['Research_Paper_Text'] = df['Research_Paper_Text'].apply(preprocess_text)

In [58]:
df.head()

Unnamed: 0,Class_Labels,Research_Paper_Text
0,Thyroid_Cancer,thyroid surgery child single institution osama...
1,Thyroid_Cancer,adopted strategy used prior year based four ex...
2,Thyroid_Cancer,coronary arterybypass grafting thrombosis muta...
3,Thyroid_Cancer,solitary plasmacytoma skull uncommon clinical ...
4,Thyroid_Cancer,study aimed investigate serum matrix metallopr...


In [59]:
X = df['Research_Paper_Text']

In [60]:
y = df['Class_Labels']

In [61]:
le = preprocessing.LabelEncoder()
le.fit(y)
classes = list(enumerate(le.classes_))
print(classes)
y = le.transform(y)
y

[(0, 'Colon_Cancer'), (1, 'Lung_Cancer'), (2, 'Thyroid_Cancer')]


array([2, 2, 2, ..., 0, 0, 0])

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [63]:
X_train.shape, y_train.shape

((5299,), (5299,))

In [64]:
X_test.shape, y_test.shape

((2271,), (2271,))

In [65]:
y_train

array([0, 2, 1, ..., 1, 1, 2])

Sklearn: Text preparation
For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. 

In [66]:
tfidf_vect = TfidfVectorizer() 

X_train = tfidf_vect.fit_transform(X_train)

In [67]:
X_train.shape

(5299, 146913)

In [68]:
X_train

<5299x146913 sparse matrix of type '<class 'numpy.float64'>'
	with 4612930 stored elements in Compressed Sparse Row format>

In [69]:
print(y_train)

[0 2 1 ... 1 1 2]


In [70]:
# Perform the TfidfVectorizer transformation


X_test = tfidf_vect.transform(X_test)

In [71]:
X_train.shape, X_test.shape

((5299, 146913), (2271, 146913))

In [72]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
np.set_printoptions(precision=3)
print(X_train.todense())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Latent Semantic Analysis (Singular Value Decomposition)

In [73]:
svd = TruncatedSVD(n_components=1000, n_iter=10) #n_components is the number of topics, which should be less than the number of features, and number of rows in the matrix

X_train_dim_reduct = svd.fit_transform(X_train)
X_test__dim_reduct = svd.transform(X_test)

In [74]:
X_train.shape, X_test.shape

((5299, 146913), (2271, 146913))

In [75]:
X_train_dim_reduct.shape, X_test__dim_reduct.shape

((5299, 1000), (2271, 1000))

In [76]:
X_train_dim_reduct

array([[ 2.705e-01,  1.300e-01, -3.006e-02, ..., -1.256e-18, -3.663e-18,
        -1.542e-18],
       [ 1.672e-01, -1.148e-01,  9.938e-03, ..., -6.370e-19, -1.247e-18,
        -1.965e-19],
       [ 2.557e-01, -1.916e-01,  4.097e-02, ..., -2.819e-18, -1.708e-18,
        -5.001e-18],
       ...,
       [ 2.344e-01,  1.359e-01, -7.398e-02, ...,  1.237e-18,  6.895e-19,
         7.098e-19],
       [ 2.948e-01, -5.169e-02,  3.892e-02, ..., -1.081e-18,  6.539e-19,
         1.445e-18],
       [ 2.183e-01, -2.331e-01, -7.712e-02, ...,  2.067e-19,  1.128e-18,
         1.517e-18]])

In [77]:
df = pd.DataFrame(X_train_dim_reduct, columns=[f"svd{num:04}" for num in range(0,X_train_dim_reduct.shape[1])])
df


Unnamed: 0,svd0000,svd0001,svd0002,svd0003,svd0004,svd0005,svd0006,svd0007,svd0008,svd0009,...,svd0990,svd0991,svd0992,svd0993,svd0994,svd0995,svd0996,svd0997,svd0998,svd0999
0,0.270461,0.130039,-0.030061,-0.004413,-0.012367,0.005488,-0.021099,0.060785,0.083293,0.032068,...,-2.253108e-18,-3.083200e-18,-2.683400e-18,3.001885e-18,4.065758e-19,-1.599198e-18,-3.781155e-18,-1.256150e-18,-3.662570e-18,-1.541600e-18
1,0.167226,-0.114772,0.009938,-0.036315,0.001427,0.059813,-0.013987,-0.020511,0.045512,-0.116298,...,3.794708e-19,-4.092863e-18,-1.694066e-18,1.497554e-18,4.252105e-19,1.355253e-20,-7.411538e-19,-6.369688e-19,-1.246832e-18,-1.965116e-19
2,0.255663,-0.191578,0.040965,0.098437,-0.165433,0.070955,0.071593,0.018734,0.056828,-0.038015,...,-2.195509e-18,-4.106416e-18,-5.204170e-18,-3.611748e-18,-6.308701e-18,-1.192622e-18,-3.601584e-18,-2.818926e-18,-1.707618e-18,-5.000883e-18
3,0.125376,-0.105049,-0.015376,-0.009029,-0.042645,-0.075272,0.026641,-0.000863,-0.028413,0.011569,...,-2.867365e-18,-6.819462e-18,1.782157e-18,-5.746272e-18,-1.228198e-20,-7.559769e-18,1.314172e-18,5.655639e-18,2.422938e-18,3.423284e-18
4,0.320904,0.160357,-0.002871,-0.014655,-0.034423,-0.044569,-0.028687,0.017214,0.078894,0.015232,...,1.450120e-18,2.317482e-18,1.246832e-18,6.606857e-19,-1.253609e-18,-1.463673e-18,5.739495e-18,1.057097e-18,-1.016440e-18,3.652406e-18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5294,0.225822,-0.005949,-0.020814,0.108820,-0.060078,-0.013237,-0.045673,0.005723,0.089451,-0.059352,...,6.911789e-19,-3.117081e-19,-6.640738e-19,3.794708e-19,-1.029992e-18,-1.707618e-18,-1.761829e-19,3.388132e-19,3.489776e-19,8.402567e-19
5295,0.220313,0.029904,0.004537,0.035671,0.017494,0.066387,-0.006452,-0.005851,0.052100,0.031431,...,-5.963112e-19,-3.947174e-19,-6.234162e-19,-5.421011e-19,-1.341700e-18,-2.385245e-18,-1.897354e-19,2.981556e-19,7.589415e-19,5.692061e-19
5296,0.234396,0.135858,-0.073980,0.022718,-0.037104,-0.010010,0.017729,0.049966,-0.038229,-0.001414,...,1.804180e-19,1.639856e-18,1.738112e-18,-4.582448e-19,9.249600e-19,4.404571e-20,1.235398e-18,1.236668e-18,6.894848e-19,7.098136e-19
5297,0.294845,-0.051692,0.038916,0.175387,-0.313029,0.133039,0.252425,-0.170509,-0.031680,0.312841,...,1.560658e-19,-2.082007e-18,8.368686e-19,-1.624609e-18,1.394216e-18,-6.516860e-19,-4.133521e-19,-1.080814e-18,6.539094e-19,1.445038e-18


## Model Training and Evaluation

We train and evaluate seven different models to understand their baseline performance on our dataset. Each model is assessed based on its accuracy on the training and test sets. The models include:

1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Machine (SVM)
4. Decision Tree
5. Random Forest
6. AdaBoost
7. XGBoost

For each model, the training process is followed by an evaluation using accuracy as the primary metric.

## Hyperparameter Tuning

To optimize the performance of each model, we apply hyperparameter tuning using GridSearchCV.  


### Logistic Regression

In [78]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix


lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_lr = lr_clf.predict(X_train_dim_reduct)
y_test_pred_lr = lr_clf.predict(X_test__dim_reduct)

print("Logistic Regression:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_lr):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_lr):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_lr))

Logistic Regression:
Train accuracy: 0.9619
Test accuracy: 0.9348
Confusion Matrix:
[[727   0  85]
 [  0 618   0]
 [ 63   0 778]]


Hyper parameter tuning for logistic regression

In [93]:


from sklearn.model_selection import GridSearchCV
start_time = time.time()
lr_grid = GridSearchCV(LogisticRegression(random_state=42, max_iter=1000), 
                       {'C': [0.1, 1, 10], 'solver': ['liblinear']}, 
                       cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
lr_grid.fit(X_train_dim_reduct, y_train)

print(f"Best LR Parameters: {lr_grid.best_params_}")
print(f"Best LR Score: {lr_grid.best_score_:.4f}")

print("Logistic Regression Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_lr):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_lr):.4f}")
print(f"After Tuning Train Accuracy: {lr_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, lr_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")


lr_time = time.time() - start_time

print(f"Logistic Regression Tuning Time: {lr_time:.2f} seconds.")


Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best LR Parameters: {'C': 10, 'solver': 'liblinear'}
Best LR Score: 0.9598
Logistic Regression Comparison:
Baseline Train Accuracy: 0.9619, Test Accuracy: 0.9348
After Tuning Train Accuracy: 0.9598, Test Accuracy: 0.9498
Logistic Regression Tuning Time: 4.20 seconds.


### K-Nearest Neighbors (KNN)

In [81]:


knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_knn = knn_clf.predict(X_train_dim_reduct)
y_test_pred_knn = knn_clf.predict(X_test__dim_reduct)

print("K-Nearest Neighbors (KNN):")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_knn):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_knn):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_knn))

K-Nearest Neighbors (KNN):
Train accuracy: 0.9989
Test accuracy: 0.9982
Confusion Matrix:
[[808   0   4]
 [  0 618   0]
 [  0   0 841]]


Hyper parameter tuning for KNN

In [94]:


start_time = time.time()
knn_grid = GridSearchCV(KNeighborsClassifier(), 
                        {'n_neighbors': [3, 5, 7], 'metric': ['euclidean', 'manhattan']}, 
                        cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
knn_grid.fit(X_train_dim_reduct, y_train)

print(f"Best KNN Parameters: {knn_grid.best_params_}")
print(f"Best KNN Score: {knn_grid.best_score_:.4f}")

print("KNN Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_knn):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_knn):.4f}")
print(f"After Tuning Train Accuracy: {knn_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, knn_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
knn_time = time.time() - start_time

print(f"KNN Tuning Time: {knn_time:.2f} seconds.")


Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best KNN Parameters: {'metric': 'euclidean', 'n_neighbors': 3}
Best KNN Score: 0.9942
KNN Comparison:
Baseline Train Accuracy: 0.9989, Test Accuracy: 0.9982
After Tuning Train Accuracy: 0.9942, Test Accuracy: 0.9982
KNN Tuning Time: 9.78 seconds.


### Support Vector Machine (SVM)

In [83]:

svm_clf = SVC()
svm_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_svm = svm_clf.predict(X_train_dim_reduct)
y_test_pred_svm = svm_clf.predict(X_test__dim_reduct)

print("Support Vector Machine (SVM):")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_svm):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_svm):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_svm))

Support Vector Machine (SVM):
Train accuracy: 0.9564
Test accuracy: 0.9256
Confusion Matrix:
[[719   0  93]
 [  0 618   0]
 [ 76   0 765]]


Hyper parameter tuning for SVM

In [95]:


start_time = time.time()

svm_grid = GridSearchCV(SVC(random_state=42), 
                        {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}, 
                        cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
svm_grid.fit(X_train_dim_reduct, y_train)
print(f"Best SVM Parameters: {svm_grid.best_params_}")
print(f"Best SVM Score: {svm_grid.best_score_:.4f}")

print("SVM Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_svm):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_svm):.4f}")
print(f"After Tuning Train Accuracy: {svm_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, svm_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
svm_time = time.time() - start_time

print(f"SVM Tuning Time: {svm_time:.2f} seconds.")

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best SVM Parameters: {'C': 10, 'kernel': 'rbf'}
Best SVM Score: 0.9623
SVM Comparison:
Baseline Train Accuracy: 0.9564, Test Accuracy: 0.9256
After Tuning Train Accuracy: 0.9623, Test Accuracy: 0.9599
SVM Tuning Time: 44.52 seconds.


### Decision Tree

In [85]:

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_dt = dt_clf.predict(X_train_dim_reduct)
y_test_pred_dt = dt_clf.predict(X_test__dim_reduct)

print("Decision Tree:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_dt):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_dt):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_dt))

Decision Tree:
Train accuracy: 0.9989
Test accuracy: 0.9978
Confusion Matrix:
[[808   0   4]
 [  1 617   0]
 [  0   0 841]]


Hyper parameter tuning for decision tree

In [96]:

start_time = time.time()

dt_param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
dt = DecisionTreeClassifier(random_state=42)
dt_grid = GridSearchCV(dt, dt_param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
dt_grid.fit(X_train_dim_reduct, y_train)

print(f"Best parameters for Decision Tree: {dt_grid.best_params_}")
print(f"Best score for Decision Tree: {dt_grid.best_score_}")

print("Decision Tree Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_dt):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_dt):.4f}")
print(f"After Tuning Train Accuracy: {dt_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, dt_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
dt_time = time.time() - start_time

print(f"Decision Tree Tuning Time: {dt_time:.2f} seconds.")


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best parameters for Decision Tree: {'max_depth': None, 'min_samples_split': 2}
Best score for Decision Tree: 0.9962248115880058
Decision Tree Comparison:
Baseline Train Accuracy: 0.9989, Test Accuracy: 0.9978
After Tuning Train Accuracy: 0.9962, Test Accuracy: 0.9978
Decision Tree Tuning Time: 10.34 seconds.


### Random Forest

In [87]:

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=10, n_jobs=-1)
rf_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_rf = rf_clf.predict(X_train_dim_reduct)
y_test_pred_rf = rf_clf.predict(X_test__dim_reduct)

print("Random Forest:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_rf):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_rf):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_rf))

Random Forest:
Train accuracy: 0.8420
Test accuracy: 0.8159
Confusion Matrix:
[[466  73 273]
 [  0 617   1]
 [  0  71 770]]


Hyper parameter tuning for random forest

In [97]:
start_time = time.time()

rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(rf, rf_param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
rf_grid.fit(X_train_dim_reduct, y_train)

print(f"Best parameters for Random Forest: {rf_grid.best_params_}")
print(f"Best score for Random Forest: {rf_grid.best_score_}")

print("Random Forest Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_rf):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_rf):.4f}")
print(f"After Tuning Train Accuracy: {rf_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, rf_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
rf_time = time.time() - start_time

print(f"Random Forest Tuning Time: {rf_time:.2f} seconds.")

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best parameters for Random Forest: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Best score for Random Forest: 0.9984905660377358
Random Forest Comparison:
Baseline Train Accuracy: 0.8420, Test Accuracy: 0.8159
After Tuning Train Accuracy: 0.9985, Test Accuracy: 0.9982
Random Forest Tuning Time: 152.10 seconds.


### AdaBoost

In [40]:

ada_clf = AdaBoostClassifier()
ada_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_ada = ada_clf.predict(X_train_dim_reduct)
y_test_pred_ada = ada_clf.predict(X_test__dim_reduct)

print("AdaBoost:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_ada):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_ada):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_ada))

AdaBoost:
Train accuracy: 0.6580
Test accuracy: 0.6684
Confusion Matrix:
[[617   0 195]
 [  2 612   4]
 [552   0 289]]


Hyper parameter tuning for AdaBoost

In [98]:


start_time = time.time()

ab_param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1]
}
ab = AdaBoostClassifier(random_state=42)
ab_grid = GridSearchCV(ab, ab_param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
ab_grid.fit(X_train_dim_reduct, y_train)

print(f"Best parameters for AdaBoost: {ab_grid.best_params_}")
print(f"Best score for AdaBoost: {ab_grid.best_score_}")

print("AdaBoost Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_ada):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_ada):.4f}")
print(f"After Tuning Train Accuracy: {ab_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, ab_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
ab_time = time.time() - start_time

print(f"AdaBoost Tuning Time: {ab_time:.2f} seconds.")


Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best parameters for AdaBoost: {'learning_rate': 1, 'n_estimators': 200}
Best score for AdaBoost: 0.7248611185347515
AdaBoost Comparison:
Baseline Train Accuracy: 0.6580, Test Accuracy: 0.6684
After Tuning Train Accuracy: 0.7249, Test Accuracy: 0.7107
AdaBoost Tuning Time: 184.71 seconds.


### XGBoost

In [42]:

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_xgb = xgb_clf.predict(X_train_dim_reduct)
y_test_pred_xgb = xgb_clf.predict(X_test__dim_reduct)

print("XGBoost:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_xgb):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_xgb):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_xgb))

XGBoost:
Train accuracy: 0.9989
Test accuracy: 0.9982
Confusion Matrix:
[[808   0   4]
 [  0 618   0]
 [  0   0 841]]


Hyper parameter tuning for XGBoost

In [99]:

start_time = time.time()

xgb_grid = GridSearchCV(xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'), 
                        {'learning_rate': [0.1, 0.2], 'max_depth': [3, 6], 'n_estimators': [50, 100]}, 
                        cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
xgb_grid.fit(X_train_dim_reduct, y_train)
print(f"Best XGB Parameters: {xgb_grid.best_params_}")
print(f"Best XGB Score: {xgb_grid.best_score_:.4f}")

print("XGBoost Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_xgb):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_xgb):.4f}")
print(f"After Tuning Train Accuracy: {xgb_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, xgb_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")

xgb_time = time.time() - start_time

print(f"XGBoost Tuning Time: {xgb_time:.2f} seconds.")

Fitting 5 folds for each of 8 candidates, totalling 40 fits




Best XGB Parameters: {'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 100}
Best XGB Score: 0.9975
XGBoost Comparison:
Baseline Train Accuracy: 0.9989, Test Accuracy: 0.9982
After Tuning Train Accuracy: 0.9975, Test Accuracy: 0.9982
XGBoost Tuning Time: 62.85 seconds.


# Challenges Faced

Applying the model and performing hyper parameter tuning was complex as this was a very large dataset. The implementation of hyper parameter tuning had a very high computational cost, as shown in the results, with models taking up to 184 seconds to run. This made it very difficult to test ad make adjustments to the models and code, as it will take significant time. 

Another challenge might be fighting overfitting, as many of the models present very high accuracies, which might suggest that the data be biased or that the model is trying to explain too much from the training sets. 

In [115]:


baseline_accuracies = {
    'Logistic Regression': accuracy_score(y_test, y_test_pred_lr),
    'KNN': accuracy_score(y_test, y_test_pred_knn),
    'SVM': accuracy_score(y_test, y_test_pred_svm),
    'Decision Tree': accuracy_score(y_test, y_test_pred_dt),
    'Random Forest': accuracy_score(y_test, y_test_pred_rf),
    'AdaBoost': accuracy_score(y_test, y_test_pred_ada),
    'XGBoost': accuracy_score(y_test, y_test_pred_xgb)
}

tuned_accuracies = {
    'Logistic Regression': accuracy_score(y_test, lr_grid.best_estimator_.predict(X_test__dim_reduct)),
    'KNN': accuracy_score(y_test, knn_grid.best_estimator_.predict(X_test__dim_reduct)),
    'SVM': accuracy_score(y_test, svm_grid.best_estimator_.predict(X_test__dim_reduct)),
    'Decision Tree': accuracy_score(y_test, dt_grid.best_estimator_.predict(X_test__dim_reduct)),
    'Random Forest': accuracy_score(y_test, rf_grid.best_estimator_.predict(X_test__dim_reduct)),
    'AdaBoost': accuracy_score(y_test, ab_grid.best_estimator_.predict(X_test__dim_reduct)),
    'XGBoost': accuracy_score(y_test, xgb_grid.best_estimator_.predict(X_test__dim_reduct))
}

tuned_times = {
    'Logistic Regression': lr_time,
    'KNN': knn_time,
    'SVM': svm_time,
    'Decision Tree': dt_time,
    'Random Forest': rf_time,
    'AdaBoost': ab_time,
    'XGBoost': xgb_time
}

    

In [120]:
models = ["Logistic Regression", "KNN", "SVM", "Decision Tree", "Random Forest", "AdaBoost", "XGBoost"]
print(f"{'Model':<20} {'Baseline Acc.':<15} {'Tuned Acc.':<15} {'Tuning Time (s)':<15}")
for model in models:
    print(f"{model:<20} {baseline_accuracies[model]:<15.4f} {tuned_accuracies[model]:<15.4f} {tuned_times[model]:<15}")

Model                Baseline Acc.   Tuned Acc.      Tuning Time (s)
Logistic Regression  0.9348          0.9498          4.202322006225586
KNN                  0.9982          0.9982          9.781334161758423
SVM                  0.9256          0.9599          44.52327609062195
Decision Tree        0.9978          0.9978          10.344090223312378
Random Forest        0.8159          0.9982          152.0950710773468
AdaBoost             0.6684          0.7107          184.71184396743774
XGBoost              0.9982          0.9982          62.84921336174011


In [123]:
# Print comparison of baseline vs tuned accuracies
for model in baseline_accuracies.keys():
    print(f"{model}: Baseline Accuracy = {baseline_accuracies[model]:.4f}.")

# Identify the best performing model after tuning
best_model = max(tuned_accuracies, key=tuned_accuracies.get)
print(f"\nThe best performing model after hyperparameter tuning is {best_model} with an accuracy of {tuned_accuracies[best_model]:.4f}, Tuned Accuracy = {tuned_accuracies[model]:.4f}, Tuned time = {tuned_times[best_model]:.2f} seconds.")


Logistic Regression: Baseline Accuracy = 0.9348.
KNN: Baseline Accuracy = 0.9982.
SVM: Baseline Accuracy = 0.9256.
Decision Tree: Baseline Accuracy = 0.9978.
Random Forest: Baseline Accuracy = 0.8159.
AdaBoost: Baseline Accuracy = 0.6684.
XGBoost: Baseline Accuracy = 0.9982.

The best performing model after hyperparameter tuning is KNN with an accuracy of 0.9982, Tuned Accuracy = 0.9982, Tuned time = 9.78 seconds.


## Model Performance Improvements

The results reveal significant improvements in accuracy for some models post-tuning, notably for **Logistic Regression** and **SVM**, where accuracy increased from 0.9348 to 0.9498 and from 0.9256 to 0.9599, respectively. These improvements underscore the value of hyperparameter tuning in optimizing model performance, particularly for models that are sensitive to parameter settings.

**Random Forest** and **AdaBoost** exhibited the most remarkable performance shifts. Random Forest's accuracy leaped from 0.8159 to 0.9982, while AdaBoost saw a modest increase from 0.6684 to 0.7107. This dramatic improvement for Random Forest suggests that the default parameters were far from optimal and that the model significantly benefited from tuning. For AdaBoost, the gain, although smaller, indicates a positive direction towards optimizing its performance.

Conversely, **KNN**, **Decision Tree**, and **XGBoost** maintained the same high level of accuracy before and after tuning, which suggests two possibilities: either the default parameters were already near-optimal, or the range of hyperparameters explored during tuning did not significantly impact these models' performance.

## Computational Costs

The computational cost associated with tuning varied widely among the models. Notably, **AdaBoost** and **Random Forest** required the most time, with 184.71 and 152.09 seconds, respectively. This significant computational investment yielded substantial accuracy improvements for Random Forest, making the cost worthwhile in this scenario. However, for AdaBoost, the considerable tuning time did not result in a similarly dramatic performance increase, highlighting a less efficient use of computational resources.

**SVM** also showed a notable computational cost at 44.52 seconds but delivered a solid performance improvement, indicating a good balance between cost and benefit.

In contrast, **Logistic Regression**, despite its relatively low computational cost of 4.20 seconds, achieved a meaningful accuracy improvement, showcasing an efficient tuning process.

## Best Model Selection

Considering both the accuracy improvements and the computational costs, **Random Forest** stands out as the most improved model due to tuning, achieving the highest accuracy among all models post-tuning. This remarkable improvement demonstrates the model's potential when optimally tuned, despite the higher computational cost.

However, if computational efficiency is a priority, **Logistic Regression** offers a compelling alternative, with a significant accuracy improvement and minimal tuning time.

## Future Directions

The results suggest several avenues for further exploration:

- **Exploring Further Hyperparameter Ranges**: For models like KNN, Decision Tree, and XGBoost, which showed no improvement with tuning, exploring a broader or different set of hyperparameters might uncover untapped performance potential.
- **Feature Engineering and Selection**: Improving model performance might also be achieved by refining the input features, suggesting a direction for further research and experimentation.





# Discussion and Conclusion


### Insights Gained and Potential Applications:
-The high accuracy achieved by KNN suggests that the dataset contains clear, distinguishable patterns that KNN can exploit effectively, making it highly suitable for applications requiring high precision and reliability.
-The insights gained from the model comparison and selection process can inform future projects, emphasizing the value of exploratory data analysis, feature engineering, and rigorous model evaluation.

