# Breast Cancer Classification - Supervised Learning Assessment

## Step 1: Loading Required Libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

## Step 2: Loading and Preprocessing the Dataset

In [4]:
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
print(X.head())
print(y.head())


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0           

In [6]:
# Check for missing values
print("Missing values in dataset:", X.isnull().sum().sum()) 


Missing values in dataset: 0


In [12]:
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Splitting dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print("Dataset loaded and preprocessed successfully.")


Dataset loaded and preprocessed successfully.


### Preprocessing Steps and Justification

1. **Loading the Dataset**:
   - The breast cancer dataset was loaded from `sklearn.datasets`, which includes 569 samples with 30 numeric features used to predict whether a tumor is malignant or benign.

2. **Checking for Missing Values**:
   - `X.isnull().sum().sum()` was used to check for any missing values. The result was 0, which means the dataset is clean and does not require imputation.

3. **Feature Scaling**:
   - `StandardScaler` was applied to scale the features so they have a mean of 0 and a standard deviation of 1.
   - **Why?** Many machine learning algorithms (like SVM, k-NN, and logistic regression) are sensitive to the scale of the input features. Scaling ensures that all features contribute equally to distance or optimization calculations.

4. **Train-Test Split**:
   - The dataset was split into 80% training and 20% testing data using `train_test_split` with a fixed random state for reproducibility.


## Step 3: Implementing Classification Algorithms

 **1.Logistic Regression**:
   - **How it works**: Logistic Regression models the probability that a given input belongs to a certain class using a logistic function (sigmoid).
   - **Suitability**: It is suitable for binary classification problems like this one (malignant vs. benign). It is simple, interpretable, and often serves as a strong baseline.

In [22]:
results = {}

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_lr = logreg.predict(X_test)
results['Logistic Regression'] = accuracy_score(y_test, y_pred_lr)
print("\n1. Logistic Regression:")
print("Logistic Regression models the probability of default class using a sigmoid function.")
print(classification_report(y_test, y_pred_lr))



1. Logistic Regression:
Logistic Regression models the probability of default class using a sigmoid function.
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



**2.Decision Tree Classifier**:
   - **How it works**: This algorithm splits the data recursively based on feature thresholds, forming a tree where each internal node represents a decision based on a feature.
   - **Suitability**: It can handle non-linear relationships and does not require feature scaling. It is easy to visualize and interpret.

In [34]:
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)
y_pred_dt = dtree.predict(X_test)
results['Decision Tree'] = accuracy_score(y_test, y_pred_dt)

print("\n2. Decision Tree Classifier:")
print("Decision Trees split the data recursively based on feature thresholds to make predictions.")
print(classification_report(y_test, y_pred_dt))


2. Decision Tree Classifier:
Decision Trees split the data recursively based on feature thresholds to make predictions.
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



**3.Random Forest Classifier**:
   - **How it works**: An ensemble method that builds multiple decision trees and combines their outputs (usually by majority voting) to improve accuracy and reduce overfitting.
   - **Suitability**: Random Forests are robust, handle noise well, and are generally one of the most accurate classification methods.


In [46]:
rforest = RandomForestClassifier(random_state=42)
rforest.fit(X_train, y_train)
y_pred_rf = rforest.predict(X_test)
results['Random Forest'] = accuracy_score(y_test, y_pred_rf)

print("\n3. Random Forest Classifier:")
print("Random Forest uses multiple decision trees and aggregates their output for better performance.")
print(classification_report(y_test, y_pred_rf))



3. Random Forest Classifier:
Random Forest uses multiple decision trees and aggregates their output for better performance.
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



**4.Support Vector Machine (SVM)**:
   - **How it works**: SVM finds the hyperplane that best separates the classes by maximizing the margin between them.
   - **Suitability**: SVM is effective in high-dimensional spaces, which is useful here since the dataset has 30 features. It performs well with clear margin separation.


In [48]:
svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
results['SVM'] = accuracy_score(y_test, y_pred_svm)

print("\n4. Support Vector Machine (SVM):")
print("SVM finds the optimal hyperplane that separates classes with maximum margin.")
print(classification_report(y_test, y_pred_svm))



4. Support Vector Machine (SVM):
SVM finds the optimal hyperplane that separates classes with maximum margin.
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



**5.k-Nearest Neighbors (k-NN)**:
   - **How it works**: Classifies a sample based on the most common class among its k nearest neighbors in the feature space.
   - **Suitability**: Simple and intuitive. However, it can be sensitive to the choice of `k` and requires feature scaling due to distance-based calculation.

In [50]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
results['k-NN'] = accuracy_score(y_test, y_pred_knn)

print("\n5. k-Nearest Neighbors (k-NN):")
print("k-NN classifies a data point based on majority class among its k nearest neighbors.")
print(classification_report(y_test, y_pred_knn))




5. k-Nearest Neighbors (k-NN):
k-NN classifies a data point based on majority class among its k nearest neighbors.
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



## Step 4: Model Comparison


In [53]:
print("\nModel Comparison (Accuracy Scores):")
for model, acc in results.items():
    print(f"{model}: {acc:.4f}")



Model Comparison (Accuracy Scores):
Logistic Regression: 0.9737
Decision Tree: 0.9474
Random Forest: 0.9649
SVM: 0.9737
k-NN: 0.9474


In [55]:
# Identifying best and worst
best_model = max(results, key=results.get)
worst_model = min(results, key=results.get)

print(f"\nBest Performing Model: {best_model} with accuracy {results[best_model]:.4f}")
print(f"Worst Performing Model: {worst_model} with accuracy {results[worst_model]:.4f}")


Best Performing Model: Logistic Regression with accuracy 0.9737
Worst Performing Model: Decision Tree with accuracy 0.9474
