# Loading and Preprocessing

In [2]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

In [3]:
# Check for missing values
print(X.isnull().sum().sum())  # Should be 0 for this dataset


0


In [4]:
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Explanation:

The Breast Cancer dataset from sklearn does not contain missing values, so no imputation is needed.

Feature scaling using StandardScaler is essential for models like SVM and k-NN, which are sensitive to the scale of input features. It also helps improve convergence in Logistic Regression.

# Classification Algorithm Implementation

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


## Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_pred)


Description: Logistic Regression models the probability of class membership using the logistic function. It works well for binary classification problems like this one.

## Decision Tree Classifier

In [7]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)


Description: Decision Trees split the data based on feature thresholds. They are interpretable and handle non-linear relationships but can overfit easily.

##  Random Forest Classifier

In [8]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)


Description: Random Forest is an ensemble method combining many Decision Trees. It improves accuracy and reduces overfitting.

# Support Vector Machine (SVM)

In [9]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
svm_acc = accuracy_score(y_test, svm_pred)


Description: SVM finds a hyperplane that best separates the classes. It performs well with high-dimensional data and when classes are separable.

## k-Nearest Neighbors (k-NN)

In [10]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_acc = accuracy_score(y_test, knn_pred)


# Model Comparison

In [11]:
results = {
    "Logistic Regression": lr_acc,
    "Decision Tree": dt_acc,
    "Random Forest": rf_acc,
    "SVM": svm_acc,
    "k-NN": knn_acc
}

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy']).sort_values(by='Accuracy', ascending=False)
print(results_df)


                 Model  Accuracy
0  Logistic Regression  0.973684
3                  SVM  0.973684
2        Random Forest  0.964912
1        Decision Tree  0.947368
4                 k-NN  0.947368


## Best Performing Model: Most likely Random Forest or SVM, depending on dataset split.

## Worst Performing Model: Typically, Decision Tree or k-NN, due to overfitting or sensitivity to data.

