1. Load breast cancer dataset (**structured data**)

For more details about the data: https://scikit-learn.org/1.5/modules/generated/sklearn.datasets.load_breast_cancer.html

In [175]:

from sklearn.datasets import load_breast_cancer

my_data = load_breast_cancer()


2. Visualize the data

- Only **5 points** for visualizing the data
- Use TSNE algorithm: sklearn.manifold.TSNE
- A good and simple code can be found here (they used PCA instead of TSNE): https://skp2707.medium.com/pca-on-cancer-dataset-4d7a97f5fdb8

In [176]:
from sklearn.manifold import TSNE
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



3. Split **my_data** to train and test:

- Define X_train, X_test, Y_train, Y_test
- Choose **test_size** for splitting **my_data**
- Use **train_test_split** (for details: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html)

In [177]:
from sklearn.model_selection import train_test_split
# X_train, X_test, Y_train, Y_test = train_test_split(...)
X=my_data.data
Y=my_data.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

4. Train **model_decision_tree**

- Library: sklearn.tree.DecisionTreeClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize DecisionTreeClassifier options   

In [178]:
from sklearn.tree import DecisionTreeClassifier

# model_decision_tree = DecisionTreeClassifier(...)
model_decision_tree = DecisionTreeClassifier( criterion='entropy',
    max_depth=3,
    min_samples_split=10,
    min_samples_leaf=2,
    random_state=42
        )
# model_decision_tree.fit(...)
model_decision_tree.fit(X_train,Y_train)
Y_pred_dt=model_decision_tree.predict(X_test)
''

''

5. Train model_random_forest
- Library: sklearn.ensemble.RandomForestClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize RandomForestClassifier options

In [179]:
from sklearn.ensemble import RandomForestClassifier

# model_random_forest = RandomForestClassifier(...)
model_random_forest = RandomForestClassifier(
    n_estimators=100,
    criterion='entropy',
    max_depth=4,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)
# model_random_forest.fit(...)a
model_random_forest.fit(X_train,Y_train)
Y_pred_rf=model_random_forest.predict(X_test)

6. Train model_adaboost

- Library: sklearn.ensemble.AdaBoostClassifier
- Data: X_train, Y_train
- **Essential**: explore and optimize AdaBoostClassifier options

In [180]:
from sklearn.ensemble import AdaBoostClassifier

# model_adaboost = AdaBoostClassifier(...)
model_adaboost = AdaBoostClassifier(

    n_estimators=50,
    learning_rate=1.0,
    algorithm='SAMME.R',
    random_state=42
)

# model_adaboost.fit(...)
model_adaboost.fit(X_train,Y_train)
Y_pred = model_adaboost.predict(X_test)




7. Evaluate model_decision_tree, model_random_forest, model_adaboost

- Library: sklearn.metrics
- Data: X_test, Y_test
- **Calculate** and **print** results of each classifier
- **Choose** the decisive metric
- **Compare** between the classifiers and declare the winner


In [181]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
def eval_model(metric, y_test, y_pred):
    print(f"--- {metric} Evaluation ---")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\n")


y_pred_dt = model_decision_tree.predict(X_test)
eval_model("Decision Tree", Y_test, y_pred_dt)


y_pred_rf = model_random_forest.predict(X_test)
eval_model("Random Forest", Y_test, y_pred_rf)


y_pred_ab = model_adaboost.predict(X_test)
eval_model("AdaBoost", Y_test, y_pred_ab)


f1_dt = f1_score(Y_test, y_pred_dt)
f1_rf = f1_score(Y_test, y_pred_rf)
f1_ab = f1_score(Y_test, y_pred_ab)

f1_scores = {"Decision Tree": f1_dt, "Random Forest": f1_rf, "AdaBoost": f1_ab}
winner = max(f1_scores, key=f1_scores.get)

print("Decisive Metric: F1 Score")
print("Winner:", winner)
#

--- Decision Tree Evaluation ---
Accuracy: 0.9649122807017544
Precision: 0.9466666666666667
Recall: 1.0
F1 Score: 0.9726027397260274
Confusion Matrix:
 [[39  4]
 [ 0 71]]


--- Random Forest Evaluation ---
Accuracy: 0.9649122807017544
Precision: 0.958904109589041
Recall: 0.9859154929577465
F1 Score: 0.9722222222222222
Confusion Matrix:
 [[40  3]
 [ 1 70]]


--- AdaBoost Evaluation ---
Accuracy: 0.9736842105263158
Precision: 0.9722222222222222
Recall: 0.9859154929577465
F1 Score: 0.9790209790209791
Confusion Matrix:
 [[41  2]
 [ 1 70]]


Decisive Metric: F1 Score
Winner: AdaBoost
