%load_ext watermark
%watermark -a "Chibuzor Enyioko" -d -v -p numpy,pandas,matplotlib,seaborn,sklearn

# Project 2: Supervised Classification

This project uses python packages to perform different unsupervised learning methods on a given breast cancer and diabetes dataset.

## Part 1: Breast Cancer Dataset
### Importing the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns

# importing data sets
bc_training_data = pd.read_csv("cancer_training.csv")
bc_test_data = pd.read_csv("cancer_testing.csv")


### Problems
1. Identify which column(s) (both train and test) has/have missing values? Identify the ‘row id’s.
“Impute” them with “Average/Most Frequent” values.

In [2]:
from sklearn.impute import SimpleImputer

bc_training_data.replace("?", np.nan, inplace=True)
bc_test_data.replace("?", np.nan, inplace=True)

# Identifying columns with missing values
# training set

missing_val_columns = bc_training_data.isnull().sum()
print(f"In the training set, the missing columns are:\n{missing_val_columns[missing_val_columns > 0]}")

for col in missing_val_columns[missing_val_columns > 0].index:
    missing_rows = bc_training_data[bc_training_data[col].isnull()]
    print(f"Column '{col}' has missing values")
    row_id = []
    for i in missing_rows['id']:
        row_id.append(i)
    print(f"Row ID's with a missing value in column '{col}': {row_id}")

# test set

missing_val_columns_test = bc_test_data.isnull().sum()
print("\n"f"In the test set, the missing columns are:\n{missing_val_columns_test[missing_val_columns_test > 0]}")
for col in missing_val_columns_test[missing_val_columns_test > 0].index:
    missing_rows_test = bc_test_data[bc_test_data[col].isnull()]
    print(f"Column '{col}' has missing values")
    row_id = []
    for i in missing_rows_test['id']:
        row_id.append(i)
    print(f"Row ID's with a missing value in column '{col}': {row_id}")


# Imputating missing values with most frequent value

pd.set_option('future.no_silent_downcasting', True)
imputer = SimpleImputer(strategy='most_frequent')
train_imputed = pd.DataFrame(imputer.fit_transform(bc_training_data), columns=bc_training_data.columns)
test_imputed = pd.DataFrame(imputer.transform(bc_test_data), columns=bc_test_data.columns)





In the training set, the missing columns are:
node-caps      3
breast-quad    1
dtype: int64
Column 'node-caps' has missing values
Row ID's with a missing value in column 'node-caps': [7, 64, 179]
Column 'breast-quad' has missing values
Row ID's with a missing value in column 'breast-quad': [155]

In the test set, the missing columns are:
node-caps    5
dtype: int64
Column 'node-caps' has missing values
Row ID's with a missing value in column 'node-caps': [21, 32, 51, 55, 72]


2. Calculate accuracy using each of these classifiers (up to 3 decimal places):

3. Now tweak the parameters of the above models, what is the best result you can get? Write the answer and upload the workbook as proof. Name this classifier widget as “<classifier>-best”. Example (if the tree widget is the best performer)

#### Logistic Regression

In [3]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix, classification_report

X_train = train_imputed.drop(columns=['id', 'class'])
y_train = train_imputed['class']
X_test = test_imputed.drop(columns=['id', 'class'])
y_test = test_imputed['class']

categorical_cols = X_train.select_dtypes(include=['object']).columns

categorical_cols = X_test.select_dtypes(include=['object']).columns

# Create a column transformer to one-hot encode categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'  # Keep other columns as they are
)

# Transform the training data
X_train_encoded = preprocessor.fit_transform(X_train)
X_test_encoded = preprocessor.transform(X_test)


clf = LogisticRegression(penalty='l2', C=0.5, max_iter=1000)
clf.fit(X_train_encoded, y_train)

# Metrics

y_pred = clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred, digits=3))


                      precision    recall  f1-score   support

no-recurrence-events      0.783     0.871     0.824        62
   recurrence-events      0.529     0.375     0.439        24

            accuracy                          0.733        86
           macro avg      0.656     0.623     0.632        86
        weighted avg      0.712     0.733     0.717        86



In [4]:
# tweaking parameters for Logistic Regression
clf_best = LogisticRegression(penalty='l1', solver='saga', C=7, max_iter=1000)
clf_best.fit(X_train_encoded, y_train)
y_pred_best = clf_best.predict(X_test_encoded)
print(classification_report(y_test, y_pred_best, digits=3))

                      precision    recall  f1-score   support

no-recurrence-events      0.825     0.839     0.832        62
   recurrence-events      0.565     0.542     0.553        24

            accuracy                          0.756        86
           macro avg      0.695     0.690     0.693        86
        weighted avg      0.753     0.756     0.754        86



#### Naive Bayes

In [5]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
y_pred_gnb = gnb.fit(X_train_encoded.toarray(), y_train).predict(X_test_encoded.toarray())
print(classification_report(y_test, y_pred_gnb, digits=3))

                      precision    recall  f1-score   support

no-recurrence-events      0.889     0.258     0.400        62
   recurrence-events      0.324     0.917     0.478        24

            accuracy                          0.442        86
           macro avg      0.606     0.587     0.439        86
        weighted avg      0.731     0.442     0.422        86



#### SVM

In [6]:
# SVM
from sklearn import svm

svm_clf = svm.SVC(C=1.0, kernel='rbf', gamma='auto', max_iter=100)
svm_clf.fit(X_train_encoded, y_train)
y_pred_svm = svm_clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred_svm, zero_division=0, digits=3))


                      precision    recall  f1-score   support

no-recurrence-events      0.726     0.984     0.836        62
   recurrence-events      0.500     0.042     0.077        24

            accuracy                          0.721        86
           macro avg      0.613     0.513     0.456        86
        weighted avg      0.663     0.721     0.624        86





In [13]:

# tweaking parameters for SVM
svm_clf_best = svm.SVC(C=1.1, kernel='poly', gamma='scale', max_iter=-1)
svm_clf_best.fit(X_train_encoded, y_train)
y_pred_svm_best = svm_clf_best.predict(X_test_encoded)
print(classification_report(y_test, y_pred_svm_best, zero_division=0, digits=3))


                      precision    recall  f1-score   support

no-recurrence-events      0.846     0.887     0.866        62
   recurrence-events      0.667     0.583     0.622        24

            accuracy                          0.802        86
           macro avg      0.756     0.735     0.744        86
        weighted avg      0.796     0.802     0.798        86



#### Random Forest

In [8]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=10, min_samples_split=5)
rf_clf.fit(X_train_encoded, y_train)
y_pred_rf = rf_clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred_rf, digits=3))

                      precision    recall  f1-score   support

no-recurrence-events      0.803     0.790     0.797        62
   recurrence-events      0.480     0.500     0.490        24

            accuracy                          0.709        86
           macro avg      0.642     0.645     0.643        86
        weighted avg      0.713     0.709     0.711        86



In [9]:
# tweaking random forest parameters
rf_clf_best = RandomForestClassifier(n_estimators=50, min_samples_split=10)
rf_clf_best.fit(X_train_encoded, y_train)
y_pred_rf_best = rf_clf_best.predict(X_test_encoded)
print(classification_report(y_test, y_pred_rf_best, digits=3))

                      precision    recall  f1-score   support

no-recurrence-events      0.803     0.855     0.828        62
   recurrence-events      0.550     0.458     0.500        24

            accuracy                          0.744        86
           macro avg      0.677     0.657     0.664        86
        weighted avg      0.732     0.744     0.737        86



#### k-Nearest Neighbors

In [10]:
# k-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='uniform')
knn_clf.fit(X_train_encoded, y_train)
y_pred_knn = knn_clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred_knn, digits=3))


                      precision    recall  f1-score   support

no-recurrence-events      0.803     0.790     0.797        62
   recurrence-events      0.480     0.500     0.490        24

            accuracy                          0.709        86
           macro avg      0.642     0.645     0.643        86
        weighted avg      0.713     0.709     0.711        86



#### Tree

In [11]:
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=5, min_samples_split=5, min_samples_leaf=2)
tree_clf.fit(X_train_encoded, y_train)
y_pred_tree = tree_clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred_tree, digits=3))

                      precision    recall  f1-score   support

no-recurrence-events      0.806     0.871     0.837        62
   recurrence-events      0.579     0.458     0.512        24

            accuracy                          0.756        86
           macro avg      0.692     0.665     0.674        86
        weighted avg      0.743     0.756     0.746        86



### Final Result

The model with the best accuracy after parameter modification was the **SVM model** with a CA of **0.802**