%load_ext watermark
%watermark -a "Chibuzor Enyioko" -d -v -p numpy,pandas,matplotlib,seaborn,sklearn

# Project 2: Supervised Classification

This project uses python packages to perform different unsupervised learning methods on a given breast cancer and diabetes dataset.

## Part 1: Breast Cancer Dataset
### Importing the Data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns

# importing data sets
bc_training_data = pd.read_csv("cancer_training.csv")
bc_test_data = pd.read_csv("cancer_testing.csv")


### Problems
1. Identify which column(s) (both train and test) has/have missing values? Identify the ‘row id’s.
“Impute” them with “Average/Most Frequent” values.

In [10]:
from sklearn.impute import SimpleImputer

# Imputating missing values with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
train_imputed = pd.DataFrame(imputer.fit_transform(bc_training_data), columns=bc_training_data.columns)
test_imputed = pd.DataFrame(imputer.transform(bc_test_data), columns=bc_test_data.columns)

# Identifying columns with missing values
# training set

train_imputed.replace("?", np.nan, inplace=True)
missing_val_columns = train_imputed.isnull().sum()
print(f"In the training set, the missing columns are:\n{missing_val_columns[missing_val_columns > 0]}")

for col in missing_val_columns[missing_val_columns > 0].index:
    missing_rows = train_imputed[train_imputed[col].isnull()]
    print(f"Column '{col}' has missing values")
    row_id = []
    for i in missing_rows['id']:
        row_id.append(i)
    print(f"Row ID's with a missing value in column '{col}': {row_id}")

# test set
test_imputed.replace("?", np.nan, inplace=True)
missing_val_columns_test = test_imputed.isnull().sum()
print("\n"f"In the test set, the missing columns are:\n{missing_val_columns_test[missing_val_columns_test > 0]}")
for col in missing_val_columns_test[missing_val_columns_test > 0].index:
    missing_rows_test = test_imputed[test_imputed[col].isnull()]
    print(f"Column '{col}' has missing values")
    row_id = []
    for i in missing_rows_test['id']:
        row_id.append(i)
    print(f"Row ID's with a missing value in column '{col}': {row_id}")




In the training set, the missing columns are:
node-caps      3
breast-quad    1
dtype: int64
Column 'node-caps' has missing values
Row ID's with a missing value in column 'node-caps': [7, 64, 179]
Column 'breast-quad' has missing values
Row ID's with a missing value in column 'breast-quad': [155]

In the test set, the missing columns are:
Series([], dtype: int64)


  train_imputed.replace("?", np.nan, inplace=True)


2. Calculate accuracy using each of these classifiers (up to 3 decimal places):

In [13]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix, classification_report

X_train = train_imputed.drop(columns=['id', 'class'])
y_train = train_imputed['class']
X_test = test_imputed.drop(columns=['id', 'class'])
y_test = test_imputed['class']

categorical_cols = X_train.select_dtypes(include=['object']).columns

categorical_cols = X_test.select_dtypes(include=['object']).columns

# Create a column transformer to one-hot encode categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'  # Keep other columns as they are
)

# Transform the training data
X_train_encoded = preprocessor.fit_transform(X_train)
X_test_encoded = preprocessor.transform(X_test)


clf = LogisticRegression(penalty='l2', C=0.5, max_iter=1000)
clf.fit(X_train_encoded, y_train)

# Metrics

y_pred = clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred))


                      precision    recall  f1-score   support

no-recurrence-events       0.79      0.87      0.83        62
   recurrence-events       0.56      0.42      0.48        24

            accuracy                           0.74        86
           macro avg       0.67      0.64      0.65        86
        weighted avg       0.73      0.74      0.73        86

