### Rishin Tiwari
# Text Classification Model Analysis

In this Project, we aim to explore and compare different machine learning models for a text classification task. Our objective is to identify the most effective model based on performance metrics such as accuracy, precision, and recall. This notebook encompasses the entire workflow from data preprocessing, model training, hyperparameter tuning, to the final comparison and selection of the best model.

### Problem statement, dataset, and objective of the analysis

This assignment fosuses on text classification for cancer related documents. The selected dataset contains 7569 documents, which are distributed across 3 cancer types: thyroid, colon and lung. The purposed of the analysis is to conduct text analysis and build machine learning models to accurately identify the type of cancer on each observation. 

The dataset can be downloaded from the following source: https://www.kaggle.com/datasets/falgunipatel19/biomedical-text-publication-classification/data

# Pre-processing


### Data Preprocessing Steps:

- **Loading and Understanding the Data**:performing exploratory data analysis to understand the structure, content, and any immediate data quality issues.
- **Cleaning Data**: Removing special characters, numbers, and unnecessary whitespace.
- **Tokenization**: Splitting text into individual words or tokens.
- **Lemmatization/Stemming**: Reducing words to their base or root form.
- **Stop Word Removal**: Eliminating common words that provide little value in the analysis.
- **Vectorization (TF-IDF)**: Converting text into numeric form to create feature vectors.
- **Feature Engineering**:  Encode the text data using sklearn's TfidfVectorizer to convert text data into a matrix of TF-IDF features.
- **Splitting the Data**: The data is divided into training and testing sets to evaluate the model's performance on unseen data. 

### Rationale Behind Chosen Methods:

- **Cleaning Data**: Ensures data quality and consistency, which are fundamental for reliable model predictions.
- **Feature Engineering**: Leverages domain knowledge to introduce new relevant features, potentially improving model performance by providing more informative signals.
- **Text Processing**: Text data must be converted into a numerical format for machine learning algorithms to process it. This step is crucial for any task involving text analysis.
- **Normalization/Standardization**: Helps to equalize the influence of features on the model's outcome, improving training stability and performance.
- **Splitting the Data**: Essential for validating the model's ability to generalize from the training data to unseen data, providing an estimate of its performance in real-world scenarios.



In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from sklearn.model_selection import train_test_split
from sklearn import preprocessing #for preprocessing text data
from sklearn.feature_extraction.text import TfidfVectorizer #TfidfVectorizer (which includes pre-processing, tokenization, and filtering out stop words)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
from string import punctuation
import time
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV


In [None]:
df = pd.read_csv("Cancer_Dataset.csv", encoding='latin1')
df.head()

In [None]:
#Dropping the irrelevant column
df = df.drop('Unnamed: 0',axis=1)
# Renaming the column names
df.columns=['Class_Labels', 'Research_Paper_Text']
df

In [None]:
#Dropping Null Values
count = df['Class_Labels'].isna().sum()
if  count > 0:
    print(f'Found {count} null values in Class_Labels column')
    #df['Class_Labels'].fillna('missing', inplace=True) # though we could do this, we will drop the rows instead - as there is no way to impute the text
    df = df.dropna(subset=['Class_Labels'])

In [None]:
#Dropping Null Values
count = df['Research_Paper_Text'].isna().sum()
if  count > 0:
    print(f'Found {count} null values in Research_Paper_Text column')
    #df['Research_Paper_Text'].fillna('missing', inplace=True) # though we could do this, we will drop the rows instead - as there is no way to impute the text
    df = df.dropna(subset=['Research_Paper_Text'])

In [None]:
df['Class_Labels'].unique()

In [None]:
#checking data imbalance
df['Class_Labels'].value_counts()

In [None]:
# Define stopwords list
stopwords_list = set(stopwords.words('english'))

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define function for text cleaning and lemmatization
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)
    # Cleaning and lemmatization
    clean_lemmatized_text = [lemmatizer.lemmatize(token.lower()) for token in tokens if (token.lower() not in punctuation) and (token.lower() not in stopwords_list) and (len(token) > 2) and token.isalpha()]
    return " ".join(clean_lemmatized_text)

# Apply text preprocessing to the 'Research_Paper_Text' column
df['Research_Paper_Text'] = df['Research_Paper_Text'].apply(preprocess_text)

In [None]:
df.head()

In [None]:
X = df['Research_Paper_Text']

In [None]:
y = df['Class_Labels']

In [None]:
le = preprocessing.LabelEncoder()
le.fit(y)
classes = list(enumerate(le.classes_))
print(classes)
y = le.transform(y)
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
y_train

Sklearn: Text preparation
For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. 

In [None]:
tfidf_vect = TfidfVectorizer() 

X_train = tfidf_vect.fit_transform(X_train)

In [None]:
X_train.shape

In [None]:
X_train

In [None]:
print(y_train)

In [None]:
# Perform the TfidfVectorizer transformation


X_test = tfidf_vect.transform(X_test)

In [None]:
X_train.shape, X_test.shape

In [None]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
np.set_printoptions(precision=3)
print(X_train.todense())

## Latent Semantic Analysis (Singular Value Decomposition)

In [None]:
svd = TruncatedSVD(n_components=1000, n_iter=10) #n_components is the number of topics, which should be less than the number of features, and number of rows in the matrix

X_train_dim_reduct = svd.fit_transform(X_train)
X_test__dim_reduct = svd.transform(X_test)

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train_dim_reduct.shape, X_test__dim_reduct.shape

In [None]:
X_train_dim_reduct

In [None]:
df = pd.DataFrame(X_train_dim_reduct, columns=[f"svd{num:04}" for num in range(0,X_train_dim_reduct.shape[1])])
df


## Model Training and Evaluation

We train and evaluate seven different models to understand their baseline performance on our dataset. Each model is assessed based on its accuracy on the training and test sets. The models include:

1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Machine (SVM)
4. Decision Tree
5. Random Forest
6. AdaBoost
7. XGBoost

For each model, the training process is followed by an evaluation using accuracy as the primary metric.

## Hyperparameter Tuning

To optimize the performance of each model, we apply hyperparameter tuning using GridSearchCV.  


### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix


lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_lr = lr_clf.predict(X_train_dim_reduct)
y_test_pred_lr = lr_clf.predict(X_test__dim_reduct)

print("Logistic Regression:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_lr):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_lr):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_lr))

Hyper parameter tuning for logistic regression

In [None]:


from sklearn.model_selection import GridSearchCV
start_time = time.time()
lr_grid = GridSearchCV(LogisticRegression(random_state=42, max_iter=1000), 
                       {'C': [0.1, 1, 10], 'solver': ['liblinear']}, 
                       cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
lr_grid.fit(X_train_dim_reduct, y_train)

print(f"Best LR Parameters: {lr_grid.best_params_}")
print(f"Best LR Score: {lr_grid.best_score_:.4f}")

print("Logistic Regression Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_lr):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_lr):.4f}")
print(f"After Tuning Train Accuracy: {lr_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, lr_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")


lr_time = time.time() - start_time

print(f"Logistic Regression Tuning Time: {lr_time:.2f} seconds.")


### K-Nearest Neighbors (KNN)

In [None]:


knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_knn = knn_clf.predict(X_train_dim_reduct)
y_test_pred_knn = knn_clf.predict(X_test__dim_reduct)

print("K-Nearest Neighbors (KNN):")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_knn):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_knn):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_knn))

Hyper parameter tuning for KNN

In [None]:


start_time = time.time()
knn_grid = GridSearchCV(KNeighborsClassifier(), 
                        {'n_neighbors': [3, 5, 7], 'metric': ['euclidean', 'manhattan']}, 
                        cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
knn_grid.fit(X_train_dim_reduct, y_train)

print(f"Best KNN Parameters: {knn_grid.best_params_}")
print(f"Best KNN Score: {knn_grid.best_score_:.4f}")

print("KNN Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_knn):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_knn):.4f}")
print(f"After Tuning Train Accuracy: {knn_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, knn_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
knn_time = time.time() - start_time

print(f"KNN Tuning Time: {knn_time:.2f} seconds.")


### Support Vector Machine (SVM)

In [None]:

svm_clf = SVC()
svm_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_svm = svm_clf.predict(X_train_dim_reduct)
y_test_pred_svm = svm_clf.predict(X_test__dim_reduct)

print("Support Vector Machine (SVM):")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_svm):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_svm):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_svm))

Hyper parameter tuning for SVM

In [None]:


start_time = time.time()

svm_grid = GridSearchCV(SVC(random_state=42), 
                        {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}, 
                        cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
svm_grid.fit(X_train_dim_reduct, y_train)
print(f"Best SVM Parameters: {svm_grid.best_params_}")
print(f"Best SVM Score: {svm_grid.best_score_:.4f}")

print("SVM Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_svm):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_svm):.4f}")
print(f"After Tuning Train Accuracy: {svm_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, svm_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
svm_time = time.time() - start_time

print(f"SVM Tuning Time: {svm_time:.2f} seconds.")

### Decision Tree

In [None]:

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_dt = dt_clf.predict(X_train_dim_reduct)
y_test_pred_dt = dt_clf.predict(X_test__dim_reduct)

print("Decision Tree:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_dt):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_dt):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_dt))

Hyper parameter tuning for decision tree

In [None]:

start_time = time.time()

dt_param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
dt = DecisionTreeClassifier(random_state=42)
dt_grid = GridSearchCV(dt, dt_param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
dt_grid.fit(X_train_dim_reduct, y_train)

print(f"Best parameters for Decision Tree: {dt_grid.best_params_}")
print(f"Best score for Decision Tree: {dt_grid.best_score_}")

print("Decision Tree Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_dt):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_dt):.4f}")
print(f"After Tuning Train Accuracy: {dt_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, dt_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
dt_time = time.time() - start_time

print(f"Decision Tree Tuning Time: {dt_time:.2f} seconds.")


### Random Forest

In [None]:

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=10, n_jobs=-1)
rf_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_rf = rf_clf.predict(X_train_dim_reduct)
y_test_pred_rf = rf_clf.predict(X_test__dim_reduct)

print("Random Forest:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_rf):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_rf):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_rf))

Hyper parameter tuning for random forest

In [None]:
start_time = time.time()

rf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(rf, rf_param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
rf_grid.fit(X_train_dim_reduct, y_train)

print(f"Best parameters for Random Forest: {rf_grid.best_params_}")
print(f"Best score for Random Forest: {rf_grid.best_score_}")

print("Random Forest Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_rf):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_rf):.4f}")
print(f"After Tuning Train Accuracy: {rf_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, rf_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
rf_time = time.time() - start_time

print(f"Random Forest Tuning Time: {rf_time:.2f} seconds.")

### AdaBoost

In [None]:

ada_clf = AdaBoostClassifier()
ada_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_ada = ada_clf.predict(X_train_dim_reduct)
y_test_pred_ada = ada_clf.predict(X_test__dim_reduct)

print("AdaBoost:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_ada):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_ada):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_ada))

Hyper parameter tuning for AdaBoost

In [None]:


start_time = time.time()

ab_param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1]
}
ab = AdaBoostClassifier(random_state=42)
ab_grid = GridSearchCV(ab, ab_param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
ab_grid.fit(X_train_dim_reduct, y_train)

print(f"Best parameters for AdaBoost: {ab_grid.best_params_}")
print(f"Best score for AdaBoost: {ab_grid.best_score_}")

print("AdaBoost Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_ada):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_ada):.4f}")
print(f"After Tuning Train Accuracy: {ab_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, ab_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")
ab_time = time.time() - start_time

print(f"AdaBoost Tuning Time: {ab_time:.2f} seconds.")


### XGBoost

In [None]:

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train_dim_reduct, y_train)

y_train_pred_xgb = xgb_clf.predict(X_train_dim_reduct)
y_test_pred_xgb = xgb_clf.predict(X_test__dim_reduct)

print("XGBoost:")
print(f"Train accuracy: {accuracy_score(y_train, y_train_pred_xgb):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred_xgb):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_xgb))

Hyper parameter tuning for XGBoost

In [None]:

start_time = time.time()

xgb_grid = GridSearchCV(xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'), 
                        {'learning_rate': [0.1, 0.2], 'max_depth': [3, 6], 'n_estimators': [50, 100]}, 
                        cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
xgb_grid.fit(X_train_dim_reduct, y_train)
print(f"Best XGB Parameters: {xgb_grid.best_params_}")
print(f"Best XGB Score: {xgb_grid.best_score_:.4f}")

print("XGBoost Comparison:")
print(f"Baseline Train Accuracy: {accuracy_score(y_train, y_train_pred_xgb):.4f}, Test Accuracy: {accuracy_score(y_test, y_test_pred_xgb):.4f}")
print(f"After Tuning Train Accuracy: {xgb_grid.best_score_:.4f}, Test Accuracy: {accuracy_score(y_test, xgb_grid.best_estimator_.predict(X_test__dim_reduct)):.4f}")

xgb_time = time.time() - start_time

print(f"XGBoost Tuning Time: {xgb_time:.2f} seconds.")

# Challenges Faced

Applying the model and performing hyper parameter tuning was complex as this was a very large dataset. The implementation of hyper parameter tuning had a very high computational cost, as shown in the results, with models taking up to 184 seconds to run. This made it very difficult to test ad make adjustments to the models and code, as it will take significant time. 

Another challenge might be fighting overfitting, as many of the models present very high accuracies, which might suggest that the data be biased or that the model is trying to explain too much from the training sets. 

In [None]:


baseline_accuracies = {
    'Logistic Regression': accuracy_score(y_test, y_test_pred_lr),
    'KNN': accuracy_score(y_test, y_test_pred_knn),
    'SVM': accuracy_score(y_test, y_test_pred_svm),
    'Decision Tree': accuracy_score(y_test, y_test_pred_dt),
    'Random Forest': accuracy_score(y_test, y_test_pred_rf),
    'AdaBoost': accuracy_score(y_test, y_test_pred_ada),
    'XGBoost': accuracy_score(y_test, y_test_pred_xgb)
}

tuned_accuracies = {
    'Logistic Regression': accuracy_score(y_test, lr_grid.best_estimator_.predict(X_test__dim_reduct)),
    'KNN': accuracy_score(y_test, knn_grid.best_estimator_.predict(X_test__dim_reduct)),
    'SVM': accuracy_score(y_test, svm_grid.best_estimator_.predict(X_test__dim_reduct)),
    'Decision Tree': accuracy_score(y_test, dt_grid.best_estimator_.predict(X_test__dim_reduct)),
    'Random Forest': accuracy_score(y_test, rf_grid.best_estimator_.predict(X_test__dim_reduct)),
    'AdaBoost': accuracy_score(y_test, ab_grid.best_estimator_.predict(X_test__dim_reduct)),
    'XGBoost': accuracy_score(y_test, xgb_grid.best_estimator_.predict(X_test__dim_reduct))
}

tuned_times = {
    'Logistic Regression': lr_time,
    'KNN': knn_time,
    'SVM': svm_time,
    'Decision Tree': dt_time,
    'Random Forest': rf_time,
    'AdaBoost': ab_time,
    'XGBoost': xgb_time
}

    

In [None]:
models = ["Logistic Regression", "KNN", "SVM", "Decision Tree", "Random Forest", "AdaBoost", "XGBoost"]
print(f"{'Model':<20} {'Baseline Acc.':<15} {'Tuned Acc.':<15} {'Tuning Time (s)':<15}")
for model in models:
    print(f"{model:<20} {baseline_accuracies[model]:<15.4f} {tuned_accuracies[model]:<15.4f} {tuned_times[model]:<15}")

In [None]:
# Print comparison of baseline vs tuned accuracies
for model in baseline_accuracies.keys():
    print(f"{model}: Baseline Accuracy = {baseline_accuracies[model]:.4f}.")

# Identify the best performing model after tuning
best_model = max(tuned_accuracies, key=tuned_accuracies.get)
print(f"\nThe best performing model after hyperparameter tuning is {best_model} with an accuracy of {tuned_accuracies[best_model]:.4f}, Tuned Accuracy = {tuned_accuracies[model]:.4f}, Tuned time = {tuned_times[best_model]:.2f} seconds.")


## Model Performance Improvements

The results reveal significant improvements in accuracy for some models post-tuning, notably for **Logistic Regression** and **SVM**, where accuracy increased from 0.9348 to 0.9498 and from 0.9256 to 0.9599, respectively. These improvements underscore the value of hyperparameter tuning in optimizing model performance, particularly for models that are sensitive to parameter settings.

**Random Forest** and **AdaBoost** exhibited the most remarkable performance shifts. Random Forest's accuracy leaped from 0.8159 to 0.9982, while AdaBoost saw a modest increase from 0.6684 to 0.7107. This dramatic improvement for Random Forest suggests that the default parameters were far from optimal and that the model significantly benefited from tuning. For AdaBoost, the gain, although smaller, indicates a positive direction towards optimizing its performance.

Conversely, **KNN**, **Decision Tree**, and **XGBoost** maintained the same high level of accuracy before and after tuning, which suggests two possibilities: either the default parameters were already near-optimal, or the range of hyperparameters explored during tuning did not significantly impact these models' performance.

## Computational Costs

The computational cost associated with tuning varied widely among the models. Notably, **AdaBoost** and **Random Forest** required the most time, with 184.71 and 152.09 seconds, respectively. This significant computational investment yielded substantial accuracy improvements for Random Forest, making the cost worthwhile in this scenario. However, for AdaBoost, the considerable tuning time did not result in a similarly dramatic performance increase, highlighting a less efficient use of computational resources.

**SVM** also showed a notable computational cost at 44.52 seconds but delivered a solid performance improvement, indicating a good balance between cost and benefit.

In contrast, **Logistic Regression**, despite its relatively low computational cost of 4.20 seconds, achieved a meaningful accuracy improvement, showcasing an efficient tuning process.

## Best Model Selection

Considering both the accuracy improvements and the computational costs, **Random Forest** stands out as the most improved model due to tuning, achieving the highest accuracy among all models post-tuning. This remarkable improvement demonstrates the model's potential when optimally tuned, despite the higher computational cost.

However, if computational efficiency is a priority, **Logistic Regression** offers a compelling alternative, with a significant accuracy improvement and minimal tuning time.

## Future Directions

The results suggest several avenues for further exploration:

- **Exploring Further Hyperparameter Ranges**: For models like KNN, Decision Tree, and XGBoost, which showed no improvement with tuning, exploring a broader or different set of hyperparameters might uncover untapped performance potential.
- **Feature Engineering and Selection**: Improving model performance might also be achieved by refining the input features, suggesting a direction for further research and experimentation.





# Discussion and Conclusion


### Insights Gained and Potential Applications:
-The high accuracy achieved by KNN suggests that the dataset contains clear, distinguishable patterns that KNN can exploit effectively, making it highly suitable for applications requiring high precision and reliability.
-The insights gained from the model comparison and selection process can inform future projects, emphasizing the value of exploratory data analysis, feature engineering, and rigorous model evaluation.

