Mbhele S; 22339019

Loading the Dataset

In [11]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X = data.data
y = data.target

print("Features shape:", X.shape)
print("Labels shape:", y.shape)

print("Feature names:", data.feature_names)


Features shape: (569, 30)
Labels shape: (569,)
Feature names: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


Spliting the Data and Train the Model.

In [12]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

log_model = LogisticRegression(max_iter=10000)
log_model.fit(X_train, y_train)

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)


Predictions and Evaluating the Models.


In [13]:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred_log = log_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)

print("🔎 Logistic Regression Evaluation:")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))
print("Classification Report:\n", classification_report(y_test, y_pred_log))
print("Accuracy:", accuracy_score(y_test, y_pred_log))

print("\n🌲 Random Forest Evaluation:")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))
print("Accuracy:", accuracy_score(y_test, y_pred_rf))


🔎 Logistic Regression Evaluation:
Confusion Matrix:
 [[ 61   2]
 [  2 106]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97        63
           1       0.98      0.98      0.98       108

    accuracy                           0.98       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171

Accuracy: 0.9766081871345029

🌲 Random Forest Evaluation:
Confusion Matrix:
 [[ 59   4]
 [  2 106]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.94      0.95        63
           1       0.96      0.98      0.97       108

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171

Accuracy: 0.9649122807017544


### Model Evaluation Summary

Two classification models were evaluated using the breast cancer dataset:

---

####  Logistic Regression:
- **Accuracy**: 97.66%
- **Precision**:
  - Class 0 (benign): 97%
  - Class 1 (malignant): 98%
- **Recall**:
  - Class 0: 97%
  - Class 1: 98%
- **F1-score**: Both classes around 97–98%
- **Confusion Matrix**:
  - Very few misclassifications (2 false positives and 2 false negatives)

**Interpretation**:  
Logistic Regression performed **very well**, with **high accuracy**, **balanced precision and recall**, and very few errors. It's especially impressive given that it's a simple model.



####  Random Forest:
- **Accuracy**: 96.49%
- **Precision**:
  - Class 0: 97%
  - Class 1: 96%
- **Recall**:
  - Class 0: 94%
  - Class 1: 98%
- **F1-score**: 95–97%
- **Confusion Matrix**:
  - 4 false positives (benign wrongly predicted as malignant)
  - 2 false negatives

 **Interpretation**:  
Random Forest also performed well, especially in **detecting malignant tumors** (98% recall). However, it made **more mistakes on benign cases** compared to Logistic Regression.


While both models gave **high accuracy**, the **Logistic Regression model slightly outperformed Random Forest** in this case. It had **higher precision and recall balance**, **fewer false positives**, and a better overall F1-score.

 **Preferred Model**: Logistic Regression  
Because it gives simpler, faster, and slightly more accurate results for this dataset.


### Note:
In real-world applications like medical diagnosis, minimizing **false negatives** is critical. Both models performed well in that regard, but Logistic Regression provided a **more balanced performance** across both classes.
