# Maciek Staniszewski - Homework

I would like to present my solutions to the classification homework task. Let's start with importing necessary libraries:

In [63]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import zscore

## Loading Data

I will import the dataset as a data frame:

In [6]:
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns= np.append(cancer['feature_names'], ['target']))
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


## Train/Test Split

I need to create a train/test split for my data to avoid testing on a train dataset. 

### Features

I will extract features from the dataset and apply the z-score transformation to all of the numeric columns:

In [8]:
X = df.drop(['target'],axis=1)
numeric_cols = X.select_dtypes(include=[np.number]).columns
X[numeric_cols].apply(zscore)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


### Target

Extraction of the target is easy:

In [10]:
y = df['target'].values

### Split

Lastly, the actual train/test 30% split, which is reproducible:

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X.values,y, test_size=0.3, random_state=123)

## Feature Selection

I will use `sklearn` to select top features for classificaion:

In [71]:
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(X_train, y_train)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
X_train_og = X_train.copy()
X_test_og = X_test.copy()
X_train = pd.DataFrame(X_train[:,cols_idxs], columns=df.columns[cols_idxs])
X_test = pd.DataFrame(X_test[:,cols_idxs], columns=df.columns[cols_idxs])
X_train.head()

Unnamed: 0,mean perimeter,mean concave points,worst radius,worst perimeter,worst concave points
0,74.52,0.04105,12.48,82.28,0.09653
1,88.06,0.01917,14.67,94.17,0.05802
2,111.6,0.06527,21.58,140.5,0.1984
3,88.44,0.01141,15.49,100.3,0.05104
4,126.2,0.09664,23.72,159.8,0.1872


I see, that the 5 features returned by `SelectKBest` method are:
* mean perimeter
* mean concave points
* worst radius
* worst perimeter
* worst concave points

## Simple Logistic Regression

I create the logistic regression model and compute the confusion matrix to serve as benchmark:

In [67]:
logreg = LogisticRegression(random_state=16)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
# Calculate the Mean Squared Error
mse_basic = mean_squared_error(y_test, y_pred)
print(f"Basic Logit MSE: {mse_basic}")
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Basic GBM MSE: 0.06432748538011696


array([[61,  7],
       [ 4, 99]])

I see that the accuracy of the model is **93.56%**.

## ADABoost Model

I'm using ADABoost similarly to the laboratory excercises:

In [68]:
# Step 3: Create a weak classifier (e.g., Decision Tree)
base_classifier = DecisionTreeClassifier(max_depth=1)

# Step 4: Create an AdaBoost classifier
adaboost_classifier = AdaBoostClassifier(base_classifier, random_state=42)

# Step 5: Define hyperparameters for tuning
param_grid = {
    'n_estimators': [10, 20, 50, 100, 200, 300, 500],  # Number of weak classifiers
    'learning_rate': [0.01, 0.1, 0.5, 1.0, 5, 10] # Learning rate
}

# Step 6: Create a GridSearchCV object for hyperparameter tuning
grid_search = GridSearchCV(estimator=adaboost_classifier, param_grid=param_grid, cv=5)

# Step 7: Train the model with cross-validation
grid_search.fit(X_train, y_train)

# Step 8: Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Step 9: Get the best model
best_model = grid_search.best_estimator_

# Step 10: Make predictions on the test set using the best model
y_pred = best_model.predict(X_test)

# Step 11: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
# Calculate the Mean Squared Error
mse_basic = mean_squared_error(y_test, y_pred)
print(f"Basic ADA MSE: {mse_basic}")
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Best Hyperparameters: {'learning_rate': 0.5, 'n_estimators': 20}
Accuracy: 0.9473684210526315
Basic GBM MSE: 0.05263157894736842


array([[63,  5],
       [ 4, 99]])

I see that ADABoost performs better than logistic regression, the accuracy is at **94.74%**. It better classified two false-negative cases, which were identified as positive cases.

## GBM Model

In [69]:
# Initialize and fit a GradientBoostingRegressor
gbm_basic = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gbm_basic.fit(X_train, y_train)

# Make predictions on the test set
y_pred_basic = gbm_basic.predict(X_test)

# Calculate the Mean Squared Error
mse_basic = mean_squared_error(y_test, y_pred_basic)
print(f"Basic GBM MSE: {mse_basic}")

Basic GBM MSE: 0.04398605736480542


I cannot create a confusion metric for the GBM model. I see that by MSE it performed the best of all three models. As the feature selection mechanism is built-it, I want to see if it outperforms `SelectKBest`:

In [72]:
gbm_basic.fit(X_train_og, y_train)

# Make predictions on the test set
y_pred_basic = gbm_basic.predict(X_test_og)

# Calculate the Mean Squared Error
mse_basic = mean_squared_error(y_test, y_pred_basic)
print(f"Basic GBM MSE: {mse_basic}")

Basic GBM MSE: 0.026687219989606607


Indeed! GBM on the original train/test dataset (`X_test_og` and `X_train_og`) performs much better. The feature selection mechanism outperforms the top 5 features selected by `SelectKBest`.

## Results
The results are as expected due to the exceptional predictive power of the GBM model and robustness to outliers. Feature selection mechanism built in to the GBM suggests to use more features in testing. The logistic regression is the simplest model and it performed the worst of all three tested models. ADABoost decided to use only 20 weak classifiers and a fairly fast learning rate, which is fairly surprising. It also took the longest to calculate.