# Project： Breast Cancer Prediction

## Feature Selection & Hyperparameter Tuning

**Introduction：**

This project focuses on predictive modeling and model optimization using the Breast Cancer dataset (data-breast-cancer.csv). The goal is to build and compare different ML classification models, evaluate their performance, and improve accuracy through hyperparameter tuning.

**Models**
- Logistic Regression, 
- Decision Tree, 
- Random Forest, 
- SVM, 
- and K-Nearest Neighbors (KNN)

**Hyperparameter Tuning**

- Apply Random Search & Grid Search to optimize each model.
- Compare the tuned models against baseline performance.
- Identify the best-performing classifier for breast cancer prediction.


In [None]:
__author__ = "Bing Huang"
__email__ = "Binghuang1990@gmail.com"

In [231]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report


## 0. Load Data

In [232]:
df = pd.read_csv("data_breast_cancer.csv", index_col=0)
df.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [233]:
# here we need to drop id, otherwise there will be problems when predictions
df = pd.read_csv("data_breast_cancer.csv")
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [234]:
df = df.drop(columns="id")
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 1. Preprocess the data

1.1 Split dataset

1.2 Define transformer pipelines

- Define Categorical Transformer Pipeline
   - Missing values: Constant imputer to fill the missing data
   - Dummy variable: One-hot encoder to convert categorical variables into dummy variables

- Define Numeric Transformer Pipeline
   - missing: K-nearst neighbor imputer to fill missing values 
   - Standard Scaler to scale the numeric features

In [235]:
df.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [236]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

In [237]:
# Split dataset

X = df.drop(columns = "diagnosis")
y = df.diagnosis

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, stratify= y, random_state = 42)

X_train.shape,  y_train.shape

((455, 30), (455,))

In [238]:
# define pipelines

num_features = X_train.columns.to_list()

num_transformer = Pipeline(steps = [('scaler', StandardScaler().set_output(transform="pandas"))
                                    ])

preprocessor = ColumnTransformer(transformers=[("num", num_transformer,
                                                       num_features)]).set_output(transform= "pandas")

preprocessor

## 2 Baseline model: Logistic Regression Model

#### 2.1. Build the model

In [239]:
lr_model = LogisticRegression(random_state=42, solver='liblinear')

pipeline_lr = Pipeline(steps=[
                              ("pre_process", preprocessor),
                              ("model", lr_model)
                              ])
pipeline_lr

#### 2.2 Train and Test model

In [240]:
from sklearn.metrics import r2_score, mean_squared_error

In [241]:
# train model
pipeline_lr.fit(X_train, y_train)
# make prediction 
y_pred_lr = pipeline_lr.predict(X_test)

In [242]:
y_pred_lr

array(['B', 'M', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B',
       'B', 'M', 'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M',
       'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'M', 'M',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'B',
       'B', 'M', 'M', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M',
       'B', 'M', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M'], dtype=object)

In [243]:
type(y_test)

pandas.core.series.Series

#### 2.3 Check all metrics
- **Classification problems**
   1. confusion matrix
   2. classification_report
      - precision: The ratio of true positives to the sum of true and false positives for each class;
      - Recall: The ratio of true positives to the sum of true positives and false negatives for each class.
      - F1-Score: The harmonic mean 调和平均值 of precision and recall.
- **Regression Problems**
   1. MAE: mean absolute error: Measures the average of the absolute differences between predicted and actual values.
   2. RMSE: Root mean squared error: square root of MSE,  which puts the error in the same units as the original values. 
   3. MSE: Calculates the average of the squared differences between predicted and actual values. It penalizes larger errors more than smaller ones, making it sensitive to outliers.
   3. R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables.

When to Use Each Metric:

- MSE/RMSE: Penalize larger errors more; use them when large errors are especially bad.
- MAE: Use when you want to treat all errors equally without over-penalizing outliers.
- R-squared: Use to understand the proportion of variance in the target variable explained by the model.

While R-squared provides some insight into how well your model explains the variance in the data, it does not directly capture the prediction capability, especially for unseen data or non-linear relationships. To fully assess predictive performance, it’s essential to complement R2 with metrics like RMSE, MAE, and cross-validation.

In [244]:
# 1. check confusion_matrix

class_labels = pipeline_lr.named_steps['model'].classes_
# print(confusion_matrix(y_true = y_test, y_pred=y_pred))
pd.DataFrame(confusion_matrix(y_true = y_test, y_pred=y_pred_lr),
             columns = class_labels,
             index = class_labels)

Unnamed: 0,B,M
B,71,1
M,2,40


In [245]:
# 2. classification_report
print(classification_report(y_test, y_pred_lr))

              precision    recall  f1-score   support

           B       0.97      0.99      0.98        72
           M       0.98      0.95      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [277]:
score_LR = round (accuracy_score(y_test, y_pred_lr)*100, 2)

Insight: Baseline model's prediction capability is actually quite good

## 3. Decision Tree

 - Build DT model (no need to scalling)
 - train model
 - make predictions
 - check metrics
 - Tuning the model
    - cross validation
    - check metrics



In [247]:
from sklearn.tree import DecisionTreeClassifier

In [248]:
# num_features = X_train.select_dtypes(include=['int', 'float']).columns.tolist()
# cat_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# cat_transformer = Pipeline(steps=[("cat_imputer", SimpleImputer(strategy='constant', fill_value='UNKNOWN').set_output(transform="pandas")),
#                                   ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore").set_output(transform="pandas"))
#                                   ])

# num_transformer = Pipeline(steps=[("knn_imputer", KNNImputer(n_neighbors=5).set_output(transform="pandas")),
#                                   ("scaler", StandardScaler().set_output(transform="pandas"))
#                                   ])

# preprocessor_new = ColumnTransformer(transformers=[
#                                                ("num", num_transformer, / "num", 'passthrough',
#                                                        num_features),
#                                                ("cat", cat_transformer,
#                                                        cat_features)
#                                                ]).set_output(transform="pandas")

# dtree = DecisionTreeClassifier(random_state=42)

# pipeline_dtree = Pipeline([("pre_process", preprocessor_new),
#                            ("model", dtree)])

# pipeline_dtree.fit(X_train, y_train)

# y_pred = pipeline_dtree.predict(X_test)

In [249]:
# no need to scale the data 
# no preprocess the data (no missing value, no scaling/on-hot encoder)

dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)

y_pred_dt = dtree.predict(X_test)

In [250]:
dtree

In [251]:
# check cross-validation:fi score :
# 注意 “M/B” 不适用 f1 score, 必须提前指定哪个label 是positive的
# 或者看 accuracy
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score

# Perform cross-validation
f1_scorer = make_scorer(f1_score, pos_label='M')
cv_scores = cross_val_score(dtree, X_train, y_train, cv=5, scoring=f1_scorer)  # for f1 score you need to identify which label treated as positive

print(f"Cross-validation F1 scores: {cv_scores}")
print(f"Mean cross-validation F1 score: {cv_scores.mean()}")

Cross-validation F1 scores: [0.95522388 0.94285714 0.875      0.91428571 0.85714286]
Mean cross-validation F1 score: 0.9089019189765457


In [252]:
cv_scores = cross_val_score(dtree, X_train, y_train, cv=5, scoring='accuracy')  # for f1 score you need to identify which label treated as positive

print(f"Cross-validation accurancy scores: {cv_scores}")
print(f"Mean cross-validation accurancy: {cv_scores.mean()}")

Cross-validation accurancy scores: [0.96703297 0.95604396 0.91208791 0.93406593 0.89010989]
Mean cross-validation accurancy: 0.9318681318681318


In [253]:
# check confusion metrix
# check classification report

cm = confusion_matrix(y_test, y_pred_dt, labels = ["B", "M"])
cm

array([[68,  4],
       [ 4, 38]], dtype=int64)

In [254]:
cm1 = confusion_matrix(y_test, y_pred_dt, labels = ["M", "B"])
cm1

array([[38,  4],
       [ 4, 68]], dtype=int64)

In [255]:
pd.DataFrame(cm,
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,68,4
Actual M,4,38


In [256]:
print(classification_report(y_test, y_pred_dt))

              precision    recall  f1-score   support

           B       0.94      0.94      0.94        72
           M       0.90      0.90      0.90        42

    accuracy                           0.93       114
   macro avg       0.92      0.92      0.92       114
weighted avg       0.93      0.93      0.93       114



In [274]:
score_dt = score_RF = round(accuracy_score(y_test, y_pred_dt) * 100, 2 )

#### Tuning the decision tree
- Criterion: 
  - {“gini”, “entropy”}, default=”gini”
  - The quality of the split in the decision tree is measured by the function called criteria
- max_depth:
  - max_depth hyperparameter controls the maximum depth to which the decision tree is allowed to grow.
  - When the max_depth is deeper it allows the tree to capture more complex patterns in the training data potentially reducing the training error. However, setting max_depth too high can lead to overfitting where the model memorizes the noise in the training data. It is very important to tune max_depth carefully to find the right balance between model complexity and generalization performance. 
- max_features:
  -  The max_features hyperparameter allow us to control the number of features to be considered when looking for the best split in the decision tree.
  - an integer, float, auto, sqrt, log2
  - auto: only used for random forest
- min_samples_split:
  - The min_sample_split hyperparameter defines the minimal number of samples that are needed to split a node.
  - min_samples_split = 10 ensures a node must have at least 10 samples before splitting.
- min_samples_leaf:
  - The min_samples_leaf hyperparameter defines the required minimal amount of samples to be present at a leaf node.
  -  min_samples_leaf = 5. We set a hyperparameter value of 5 to min_samples_leaf that ensures each leaf node in the decision tree must contain at least 5 samples

In [257]:
dtree.get_params()
# Criterion: 

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 42,
 'splitter': 'best'}

- Tuning model: Grid Search + Cross validation

In [258]:
from sklearn.model_selection import GridSearchCV

param_grid = {
              'criterion': ['gini', 'entropy'],
              'max_depth': [3, 5, 7, 10, 20],
              'max_features': [None, 10, 20,'log2', 'sqrt'],
              'min_samples_split': [2, 10, 20, 50],
              'min_samples_leaf': [1, 3, 10]
}

# run grid search
grid_search = GridSearchCV(estimator=dtree, 
                           param_grid=param_grid, 
                           cv=5, 
                           scoring='accuracy', 
                           verbose=1, 
                           n_jobs=-1) # Parallel Execution, speed up the process
 
# fit the model (under each condtion of grid search) to the training data
grid_search.fit(X_train, y_train)

# get best paramters and best scores
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters for Decision Tree is: {best_params}")
print(f"Best cross-validation score for Decision Tree is: {best_score}")



Fitting 5 folds for each of 600 candidates, totalling 3000 fits
Best parameters for Decision Tree is: {'criterion': 'entropy', 'max_depth': 7, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 10}
Best cross-validation score for Decision Tree is: 0.9494505494505496


- Check predictions - best model

In [259]:
dtree_best = grid_search.best_estimator_

dtree_best.fit(X_train, y_train)

y_pred_dt_tuning = dtree_best.predict(X_test)

pd.DataFrame(confusion_matrix(y_test, y_pred_dt_tuning, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,72,0
Actual M,6,36


In [260]:
print(classification_report(y_test, y_pred_dt_tuning))

              precision    recall  f1-score   support

           B       0.92      1.00      0.96        72
           M       1.00      0.86      0.92        42

    accuracy                           0.95       114
   macro avg       0.96      0.93      0.94       114
weighted avg       0.95      0.95      0.95       114



**Insight：** 

After tuning the model, the cross-validation scores improve, but this does not mean the model would necessary preform better on the unseen data

## 4. Random Forest
- build RF model
- train the model
- make predictions
- check metrics
- Tune the model
  - cross validation
  - check metrics

In [261]:
from sklearn.ensemble import RandomForestClassifier

RForest = RandomForestClassifier(random_state = 42)  # creates an instance of RandomForestClassifier 
RForest.fit(X_train, y_train)  # fits the model using X_train and y_train.

y_pred_rf = RForest.predict(X_test)

In [262]:
# check metrics

pd.DataFrame(confusion_matrix(y_test, y_pred_rf, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])


Unnamed: 0,predicted B,predicted M
Actual B,72,0
Actual M,3,39


In [263]:
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           B       0.96      1.00      0.98        72
           M       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



- Tuning the Random Forest model
   - criterion: {“gini”, “entropy”}
   - n_estimators: The number of trees in the forest. int, default=100
   - max_depth: The maximum depth of the tree. int, default=None
   - min_samples_split: The minimum number of samples required to split an internal node
   - max_features: {“sqrt”, “log2”, None}


In [264]:
RForest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [265]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'n_estimators': [100, 150, 200],
              'criterion': ['gini', 'entropy'],
              'max_depth': [3, 5, 7, 10, 20],
              'max_features': [None, 10, 20,'log2', 'sqrt'],
              'min_samples_split': [2, 10, 20, 50],
              'min_samples_leaf': [1, 3, 10]
}

# run randomized search
random_search = RandomizedSearchCV(estimator=RForest, 
                                 param_distributions = param_grid, # 这里不同
                                 cv=5, 
                                 scoring='accuracy', 
                                 verbose=1, 
                                 n_jobs=-1) # Parallel Execution, speed up the process
 
# fit the model (under each condtion of grid search) to the training data
random_search.fit(X_train, y_train)

# get best paramters and best scores
best_params = random_search.best_params_
best_score  = random_search.best_score_

print(f"Best parameters for RandomForest is: {best_params}")
print(f"Best cross-validation score for RandomForest is: {best_score}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters for RandomForest is: {'n_estimators': 150, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None, 'max_depth': 7, 'criterion': 'gini'}
Best cross-validation score for RandomForest is: 0.9626373626373628


In [266]:
# check prediction with Best RF

RForest_best = random_search.best_estimator_

RForest_best.fit(X_train, y_train)

y_pred_rf_tuning = RForest_best.predict(X_test)


In [267]:

pd.DataFrame(confusion_matrix(y_test, y_pred_rf_tuning, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,72,0
Actual M,3,39


In [268]:
print(classification_report(y_test, y_pred_rf_tuning))

              precision    recall  f1-score   support

           B       0.96      1.00      0.98        72
           M       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



再 试试 gridsearchCV

In [280]:
param_grid = {'n_estimators': [100, 150, 200],
              'criterion': ['gini', 'entropy'],
              'max_depth': [3, 10, 20],
              'max_features': [None, 'log2', 'sqrt'],
              'min_samples_split': [2, 10],
              'min_samples_leaf': [1, 3]
}

# run grid search
grid_search = GridSearchCV(estimator=RForest, 
                                 param_grid = param_grid, # 这里不同
                                 cv=5, 
                                 scoring='accuracy', 
                                 verbose=1, 
                                 n_jobs=-1) # Parallel Execution, speed up the process
 
# fit the model (under each condtion of grid search) to the training data
grid_search.fit(X_train, y_train)

# get best paramters and best scores
best_params = grid_search.best_params_
best_score  = grid_search.best_score_

print(f"Best parameters for RandomForest is: {best_params}")
print(f"Best cross-validation score for RandomForest is: {best_score}")

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best parameters for RandomForest is: {'criterion': 'gini', 'max_depth': 10, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}
Best cross-validation score for RandomForest is: 0.9626373626373628


In [281]:
# prediction using the best model, and check metrics

RForest_best = grid_search.best_estimator_
RForest_best.fit(X_train, y_train)

y_pred_rf_tuning = RForest_best.predict(X_test)

pd.DataFrame(confusion_matrix(y_test, y_pred_rf_tuning, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,72,0
Actual M,3,39


In [282]:
print(classification_report(y_test, y_pred_rf_tuning))

              precision    recall  f1-score   support

           B       0.96      1.00      0.98        72
           M       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



In [283]:
# 总结一下
from sklearn.metrics import accuracy_score
score_RF = round(accuracy_score(y_test, y_pred_rf_tuning) * 100, 2 )
cm_RF = pd.DataFrame(confusion_matrix(y_test, y_pred_rf_tuning, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

In [284]:
print(score_RF)
print(cm_RF)

97.37
          predicted B  predicted M
Actual B           72            0
Actual M            3           39


## 5. Support Vector Machine SVM
  - Build model
  - Tuning the model - cross validation
  - train model
  - make predictions
  - check metics

### SVM: SVC vs. linear SVC
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
https://scikit-learn.org/stable/modules/svm.html#svm-classification

- SVC: Support Vector Classification

The implementation is based on **libsvm**. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using LinearSVC or SGDClassifier instead,



- Linear SVC ( linear Support Vector Classification):

Similar to SVC with parameter kernel=’linear’, but implemented in terms of **liblinear** rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

- Difference

The main differences between LinearSVC and SVC lie in the loss function used by default, and in the handling of intercept regularization between those two implementations.

Between SVC and LinearSVC, one important decision criterion is that LinearSVC tends to be faster to converge the larger the number of samples is. This is due to the fact that the linear kernel is a special case, which is optimized for in Liblinear, but not in Libsvm.


In [204]:
# need the scalling-- can use pipeline

from sklearn.svm import SVC

svc_model = SVC(random_state=42)

pipeline_svc = Pipeline(steps=[
                              ("pre_process", preprocessor),
                              ("model", svc_model)
                              ])
pipeline_svc

In [205]:
pipeline_svc.fit(X_train, y_train)

y_pred_svc = pipeline_svc.predict(X_test)

pd.DataFrame(confusion_matrix(y_test, y_pred_svc, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,72,0
Actual M,3,39


In [206]:
print(classification_report(y_test, y_pred_svc))

              precision    recall  f1-score   support

           B       0.96      1.00      0.98        72
           M       1.00      0.93      0.96        42

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



In [272]:
score_svc = round(accuracy_score(y_test, y_pred_svc) * 100, 2 )
score_svc

97.37

In [None]:
# # without pipeline

# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)
# from sklearn import svm
# # Create a Support Vector Classifier
# svc = svm.SVC(random_state=42)
# # Train the model using the training sets 
# svc.fit(X_train,y_train)
# # Prediction on test data
# y_pred = svc.predict(X_test)
# # Calculating the accuracy
# acc_svm = round( metrics.accuracy_score(y_test, y_pred) * 100, 2 )
# print( 'Accuracy of SVM model : ', acc_svm )

- SVC tuning
  - C: Regularization parameter. 
    - The strength of the regularization is inversely proportional to C, 
    - default=1.0
  - kernel: 
    - {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}
  - degree:
    - only for ‘poly’
    -  Ignored by all other kernels
  - gamma:
    -  Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’
    - ['scale', 'auto']

- Radial Basis Function (RBF) kernel: You're tuning C and gamma.
- Linear kernel: You're tuning C.

In [182]:
svc_model.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': 42,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [222]:
# RandomSearch

# param_grid = {
#                "C": [0.1, 1, 10],
#                'kernel': ['linear', 'poly', 'rbf'],
#                'gamma' : ['scale', 'auto'],
#                'degree' : [2, 3, 4]}

param_grid = [
    {"C": [0.01, 0.1, 1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': ["auto", "scale", 0.001, 0.01, 0.1, 1]},
    {"C": [0.01, 0.1, 1, 10, 100, 1000], 'kernel': ['linear']}
]

grid_search = GridSearchCV(estimator=svc_model, 
                          param_grid=param_grid,           
                                   scoring='accuracy', 
                                   cv=5, 
                                   verbose=1, 
                                   n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters for SVC : {best_params}")
print(f"Best cross-validation accuracy for SVC: {best_score}")

Fitting 5 folds for each of 42 candidates, totalling 210 fits
Best parameters for SVC : {'C': 100, 'kernel': 'linear'}
Best cross-validation accuracy for SVC: 0.9692307692307693


In [223]:
svc_best = grid_search.best_estimator_

y_pred_svc_tuning = svc_best.predict(X_test)

pd.DataFrame(confusion_matrix(y_test, y_pred_svc_tuning, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,70,2
Actual M,6,36


In [224]:
print(classification_report(y_test, y_pred_svc_tuning))

              precision    recall  f1-score   support

           B       0.92      0.97      0.95        72
           M       0.95      0.86      0.90        42

    accuracy                           0.93       114
   macro avg       0.93      0.91      0.92       114
weighted avg       0.93      0.93      0.93       114



In [271]:
score_svc_tuning = round(accuracy_score(y_test, y_pred_svc_tuning) * 100, 2 )
score_svc_tuning

92.98

## 6. K-Nearest Neighbors Model

In [218]:

from sklearn.neighbors import KNeighborsClassifier

# knn_model = KNeighborsClassifier( random_state = 42)
# The KNeighborsClassifier in scikit-learn does not have a random_state parameter.

knn_model = KNeighborsClassifier()

pipeline_knn = Pipeline(steps=[
                              ("pre_process", preprocessor),
                              ("model", knn_model)
                              ])
pipeline_knn

In [None]:
# # if do not use pipeline
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

In [220]:
knn_model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [226]:
param_grid = {'n_neighbors': [3, 5, 7, 10],
            'weights': ['uniform', 'distance'],
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'leaf_size': [10, 20, 30, 40]}

grid_search = GridSearchCV(estimator=knn_model, 
                           param_grid=param_grid,           
                                   scoring='accuracy', 
                                   cv=5, 
                                   verbose=1, 
                                   n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best parameters for KNN : {best_params}")
print(f"Best cross-validation accuracy for KNN: {best_score}")

Fitting 5 folds for each of 128 candidates, totalling 640 fits
Best parameters for KNN : {'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 3, 'weights': 'distance'}
Best cross-validation accuracy for KNN: 0.9252747252747252


In [227]:
knn_best = grid_search.best_estimator_

y_pred_knn = knn_best.predict(X_test)

pd.DataFrame(confusion_matrix(y_test, y_pred_knn, labels = ["B", "M"]),
             columns = ["predicted B", "predicted M"],
             index = ["Actual B", "Actual M"])

Unnamed: 0,predicted B,predicted M
Actual B,72,0
Actual M,9,33


In [229]:
score_knn = round(accuracy_score(y_test, y_pred_knn) * 100, 2 )
score_knn

92.11

## 7. Compare model preformance

In [285]:
# compare scores

scores = pd.DataFrame(
    {'model': ['LR', 'DT', 'RF', 'SVM', 'KNN'],
     'score': [score_LR, score_dt, score_RF, score_svc, score_knn]
    }
)

scores.sort_values(by='score', ascending=False)

Unnamed: 0,model,score
0,LR,97.37
2,RF,97.37
3,SVM,97.37
1,DT,92.98
4,KNN,92.11
