# **Importing Libraries**

- Pandas for data handling
- The 3 Classification models that I will use:
    + Support Vector Machine (SVC)
    + Logistic Regression
    + Random Forest Classifier
- Model evaluation metrics:
    + Accuracy
    + Confusion Matrix
    + Precision, Recall, F1-Score
- GridSearchCV for hyperparameter tuning
- train_test_split to split the data


In [1]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# **Reading The CSV File**

In [2]:
df = pd.read_csv("Breast Cancer Wisconsin Dataset.csv", )

# **Data Preprocessing**
In this step, some data preprocessing is done to prepare the dataset for model training:

1. The **"id"** column is dropped from the DataFrame since it does not contribute to the prediction.
2. The **"diagnosis"** column, which contains categorical values ("M" for malignant and "B" for benign), is transformed into numerical values:
   - **1** for malignant tumors
   - **0** for benign tumors

After this replacement, the unique values in the **"diagnosis"** column are checked to ensure the transformation has been applied correctly.


In [3]:
df.drop(columns=["id"], inplace=True)

df["diagnosis"].replace(["M","B"], [1, 0], inplace=True)
df["diagnosis"].unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["diagnosis"].replace(["M","B"], [1, 0], inplace=True)
  df["diagnosis"].replace(["M","B"], [1, 0], inplace=True)


array([1, 0])

In [4]:
df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'diagnosis'],
      dtype='object')

# **Selecting Features for Modeling**
The dataset contains 30 features, excluding the id and the diagnosis columns.

These features are based on **10 core properties**, each measured in three different ways:

- **Mean** (average value across cells)
- **Standard Error (SE)** (variation within the sample)
- **Worst/Max** (largest value observed)

This results in:
> 10 properties × 3 measurement types = 30 total features


To simplify the model and reduce redundancy, **only the 10 "mean" features** will be used for classification.

The goal is to classify tumors as **malignant** or **benign** using these core features.

In [5]:
x = df.loc[:, "radius_mean" : "fractal_dimension_mean"]
y = df["diagnosis"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=20, random_state=10)

# **Accuracy Test**
Defining a function to evaluate each of the 3 models using the evaluation metrics imported earlier:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix

In [6]:
def test_scores(model_name, predictions):
    
    accuracy = accuracy_score(y_test, predictions)
    confusion = confusion_matrix(y_test, predictions)
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    
    confusion_str = (f"True Negative: {confusion[0][0]}, "
                     f"True Positive: {confusion[1][1]}, "
                     f"False Positive: {confusion[0][1]}, "
                    f"False Negative: {confusion[1][0]}")
    
    return {
        "Model": model_name,
        "Accuracy": accuracy,
        "Confusion Matrix": confusion_str,
        "Precision": precision,
        "Recall": recall,
        "F1": f1
    }

# **Support Vector Classifier with Hyperparameter Tuning**
In this step, a Support Vector Classifier (SVC) is used along with `GridSearchCV` to find the best combination of hyperparameters.

The following settings are explored:
- **C**: [0.01, 0.1, 1, 10]
- **Kernel**: ["linear", "rbf"]

In [7]:
svc = SVC()
cvSVC = GridSearchCV(svc, {
    "C": [0.01, 0.1, 1, 10],
    "kernel": ["linear", "rbf"]
}, cv=10
, scoring=['accuracy', 'precision', 'recall', 'f1'],
refit="accuracy")

cvSVC.fit(x_train, y_train)


## **Viewing Grid Search Results**
The results from `GridSearchCV` are stored in a DataFrame to allow detailed comparison of each parameter combination.

The table includes:
- Parameters tested: **C** and **kernel**
- Mean and standard deviation for each evaluation metric across the 10 folds:
  - Accuracy
  - Precision
  - Recall
  - F1-Score
- Ranking of parameter combinations based on each metric

The results are sorted by **accuracy rank** to highlight the best-performing configuration.

In [8]:
pd.options.display.max_columns = None

results_df = pd.DataFrame(cvSVC.cv_results_)
results_df[["param_C","param_kernel", "mean_test_accuracy","mean_test_precision","mean_test_recall", "mean_test_f1", "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1", "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"]].sort_values("rank_test_accuracy")

Unnamed: 0,param_C,param_kernel,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
6,10.0,linear,0.925286,0.932413,0.867857,0.895433,0.01732,0.043865,0.076541,0.027136,1,5,1,1
4,1.0,linear,0.916162,0.91107,0.863095,0.883557,0.026173,0.043605,0.078515,0.039077,2,6,2,2
2,0.1,linear,0.905219,0.888906,0.853095,0.868757,0.033566,0.042006,0.078953,0.05079,3,7,3,3
0,0.01,linear,0.903367,0.888518,0.848095,0.865737,0.035869,0.045683,0.083604,0.054327,4,8,4,4
7,10.0,rbf,0.886936,0.948057,0.74119,0.82844,0.038586,0.068939,0.084509,0.064086,5,4,5,5
5,1.0,rbf,0.8833,0.949182,0.731429,0.820961,0.038036,0.06798,0.097018,0.066473,6,3,6,6
3,0.1,rbf,0.866936,0.969925,0.668095,0.786237,0.03776,0.060244,0.09354,0.069812,7,2,7,7
1,0.01,rbf,0.792323,1.0,0.444048,0.610552,0.030808,0.0,0.080526,0.079911,8,1,8,8


## **Testing the Best SVC Model**
After identifying the best-performing Support Vector Classifier through GridSearchCV, the model is evaluated on the test set.

In [9]:
svc = cvSVC.best_estimator_
svc_pred = svc.predict(x_test)

svc_scores = test_scores("SVC", svc_pred)
svc_scores

{'Model': 'SVC',
 'Accuracy': 0.85,
 'Confusion Matrix': 'True Negative: 11, True Positive: 6, False Positive: 2, False Negative: 1',
 'Precision': 0.75,
 'Recall': 0.8571428571428571,
 'F1': 0.8}

# **Logistic Regression with Hyperparameter Tuning**
In this step, a Logistic Regression model is used along with `GridSearchCV` to find the best value for the regularization parameter **C**.

The following setting is explored:
- **C**: [0.01, 0.1, 1, 10, 100]

In [10]:
lg = LogisticRegression(max_iter=1000)

cvLG = GridSearchCV(lg, {
    "C": [0.01, 0.1, 1, 10, 100]
}, cv=10
, scoring=['accuracy', 'precision', 'recall', 'f1'],
refit="accuracy")

cvLG.fit(x_train, y_train)

In [11]:
results_df = pd.DataFrame(cvLG.cv_results_)
results_df[["param_C", "mean_test_accuracy","mean_test_precision","mean_test_recall", "mean_test_f1", "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1", "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"]].sort_values("rank_test_accuracy")

Unnamed: 0,param_C,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
4,100.0,0.93798,0.948445,0.886905,0.91293,0.023698,0.049732,0.080877,0.037059,1,1,1,1
3,10.0,0.919832,0.921528,0.863095,0.888109,0.023368,0.0433,0.078515,0.035107,2,2,2,2
2,1.0,0.912525,0.903873,0.858095,0.878029,0.031428,0.036186,0.083233,0.048555,3,3,3,3
0,0.01,0.903367,0.893468,0.843333,0.865291,0.029831,0.041118,0.078743,0.0464,4,4,5,4
1,0.1,0.901549,0.883616,0.848095,0.863712,0.038651,0.048617,0.083604,0.05742,5,5,4,5


## **Testing the Best Linear Regression Model**
After identifying the best-performing Linear Regression Model through GridSearchCV, the model is evaluated on the test set.

In [12]:
lg = cvLG.best_estimator_
lg_pred = lg.predict(x_test)

lg_scores = test_scores("Logistic Regression", lg_pred)
lg_scores

{'Model': 'Logistic Regression',
 'Accuracy': 0.85,
 'Confusion Matrix': 'True Negative: 11, True Positive: 6, False Positive: 2, False Negative: 1',
 'Precision': 0.75,
 'Recall': 0.8571428571428571,
 'F1': 0.8}

# **Random Forest Classifier with Hyperparameter Tuning**
In this step, a Random Forest Classifier is used along with `GridSearchCV` to find the optimal number of trees in the forest.

The following setting is explored:
- **n_estimators**: [100, 200, 300, 400]

In [13]:
rf = RandomForestClassifier(criterion="entropy", random_state=10)

cvRF = GridSearchCV(rf, {
    "n_estimators": [100, 200, 300, 400]
}, cv=10
, scoring=['accuracy', 'precision', 'recall', 'f1'],
refit="accuracy")

cvRF.fit(x_train, y_train)

In [14]:
results_df = pd.DataFrame(cvRF.cv_results_)
results_df[["param_n_estimators", "mean_test_accuracy","mean_test_precision","mean_test_recall", "mean_test_f1", "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1", "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"]].sort_values("rank_test_accuracy")

Unnamed: 0,param_n_estimators,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
0,100,0.93798,0.924287,0.911905,0.916422,0.027567,0.058902,0.049131,0.037855,1,1,1,1
1,200,0.936128,0.920042,0.911667,0.913965,0.030109,0.063139,0.054236,0.04129,2,2,2,2
2,300,0.93431,0.919609,0.906905,0.911418,0.030039,0.063227,0.052516,0.040992,3,3,3,3
3,400,0.932492,0.919082,0.901905,0.908598,0.030947,0.06341,0.055244,0.042619,4,4,4,4


## **Refining Random Forest Hyperparameter Tuning**

After determining that 100 estimators provided the best results in the initial `GridSearchCV`, a second cross-validation was conducted to ensure that there were no better, smaller values for the **n_estimators** parameter.

The following values were tested:
- **n_estimators**: [25, 50, 75, 100]

In [15]:
cvRF = GridSearchCV(rf, {
    "n_estimators": [25, 50, 75, 100]
}, cv=10
, scoring=['accuracy', 'precision', 'recall', 'f1'],
refit="accuracy")

cvRF.fit(x_train, y_train)

In [16]:
results_df = pd.DataFrame(cvRF.cv_results_)
results_df[["param_n_estimators", "mean_test_accuracy","mean_test_precision","mean_test_recall", "mean_test_f1", "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1", "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"]].sort_values("rank_test_accuracy")

Unnamed: 0,param_n_estimators,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
1,50,0.93798,0.921163,0.916905,0.916983,0.032007,0.065773,0.054528,0.042963,1,2,2,1
3,100,0.93798,0.924287,0.911905,0.916422,0.027567,0.058902,0.049131,0.037855,1,1,3,2
2,75,0.936128,0.916051,0.916905,0.914969,0.0361,0.069521,0.050197,0.048108,3,3,1,3
0,25,0.93431,0.915929,0.911667,0.911974,0.033177,0.065503,0.054236,0.044502,4,4,4,4


## **Evaluating Random Forest Hyperparameters**
After conducting the second round of cross-validation with smaller values for **n_estimators**, the results for **n_estimators = 50** and **n_estimators = 100** showed very close performance in terms of **accuracy**, **precision**, **recall**, and **F1-score**.

Although **n_estimators = 50** achieved slightly better results in **precision** and **recall**, the decision was made to select **n_estimators = 100**. This choice was based on the fact that:
- The accuracy scores were identical between the two values.
- **n_estimators = 100** had a **lower standard deviation** across the metrics, indicating that it provided more stable and consistent performance.
- The difference in **precision** and **recall** was minimal, and the stability of the **100 estimators** configuration outweighed the slight advantage of **50 estimators**.

Therefore, **n_estimators = 100** was selected for its more reliable performance across different folds in cross-validation.

## **Training the Random Forest Classifier**

After refining the **n_estimators** parameter to 100, the Random Forest Classifier was reinitialized and retrained using the selected hyperparameters:
- **n_estimators**: 100
- **criterion**: "entropy"
- **random_state**: 10

In [17]:
del rf

rf = RandomForestClassifier(n_estimators=100, criterion="entropy", random_state=10)
rf.fit(x_train, y_train)

rf_pred = rf.predict(x_test)
rf_scores = test_scores("Random Forest", rf_pred)

rf_scores

{'Model': 'Random Forest',
 'Accuracy': 1.0,
 'Confusion Matrix': 'True Negative: 13, True Positive: 7, False Positive: 0, False Negative: 0',
 'Precision': 1.0,
 'Recall': 1.0,
 'F1': 1.0}

# **Comparing Model Performance**
To compare the performance of the three models (Support Vector Classifier, Logistic Regression, and Random Forest), the scores for each model were collected into a list called `combined_scores`. These scores include evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix for each model.

The results were then converted into a DataFrame (`comparative_df`), which provides an easy-to-read summary of the performance of all three models. This DataFrame allows for a direct comparison of how each model performed across the different metrics.


In [18]:
pd.options.display.max_colwidth = None
combined_scores = [svc_scores, lg_scores, rf_scores]
comparative_df = pd.DataFrame(combined_scores)

comparative_df

Unnamed: 0,Model,Accuracy,Confusion Matrix,Precision,Recall,F1
0,SVC,0.85,"True Negative: 11, True Positive: 6, False Positive: 2, False Negative: 1",0.75,0.857143,0.8
1,Logistic Regression,0.85,"True Negative: 11, True Positive: 6, False Positive: 2, False Negative: 1",0.75,0.857143,0.8
2,Random Forest,1.0,"True Negative: 13, True Positive: 7, False Positive: 0, False Negative: 0",1.0,1.0,1.0


# **Model Comparison Results**

The performance of the three models—Support Vector Classifier (SVC), Logistic Regression, and Random Forest—was evaluated across several metrics, including **accuracy**, **precision**, **recall**, **F1-score**, and the **confusion matrix**. 

- **SVC** and **Logistic Regression** both achieved the same accuracy of **0.85**. They had identical confusion matrices with 11 true negatives, 6 true positives, 2 false positives, and 1 false negative. Despite their similar overall performance, they both showed **precision** and **recall** scores of **0.75** and **0.857**, respectively.

- **Random Forest**, on the other hand, outperformed the other two models with an accuracy of **1.00**. It also had a perfect confusion matrix, with 13 true negatives and 7 true positives, and no false positives or false negatives. Furthermore, **precision**, **recall**, and **F1-score** were all **1.00**, indicating flawless performance.

## **Conclusion**
Given that **Random Forest** achieved perfect accuracy and other evaluation metrics, it is the best model for this dataset among the three. It demonstrated superior stability and effectiveness in classifying the data compared to the SVC and Logistic Regression models.
