## ML Assignment / Classification

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Loading and Preprocessing

In [6]:
# Load the dataset
data = load_breast_cancer()

# Check the dataset structure
print(type(data))  # <class 'sklearn.utils.Bunch'>
print(data.keys())  # Shows available components

<class 'sklearn.utils._bunch.Bunch'>
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


In [9]:
print("Data shape:", df.shape)
print("\nData types:\n", df.dtypes)
print("\nMissing values:\n", df.isnull().sum())
print("\n Statistical information:\n", df.describe())

Data shape: (569, 31)

Data types:
 mean radius                float64
mean texture               float64
mean perimeter             float64
mean area                  float64
mean smoothness            float64
mean compactness           float64
mean concavity             float64
mean concave points        float64
mean symmetry              float64
mean fractal dimension     float64
radius error               float64
texture error              float64
perimeter error            float64
area error                 float64
smoothness error           float64
compactness error          float64
concavity error            float64
concave points error       float64
symmetry error             float64
fractal dimension error    float64
worst radius               float64
worst texture              float64
worst perimeter            float64
worst area                 float64
worst smoothness           float64
worst compactness          float64
worst concavity            float64
worst concave point

In [8]:
# Create DataFrame for easier analysis
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Check DataFrame info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

This is a clean, processed version of Breast Cancer dataset

The dataset has been preprocessed and normalized

No mention of missing values in the documentation

In [17]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

###  Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train)

#### Description:
Logistic regression models the probability of the target class using a logistic function. It's a linear model that works well for binary classification problems like this one.

#### Suitability:
Good baseline model for binary classification. Works well when there's a roughly linear decision boundary.

### Decision Tree Classifier

In [12]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

#### Description:
Decision trees split the feature space recursively based on feature values to maximize information gain at each step.

#### Suitability:
Can capture non-linear relationships and doesn't require feature scaling. Might be prone to overfitting with complex trees.

### Random Forest Classifier

In [13]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

#### Description:
An ensemble method that builds multiple decision trees and combines their predictions through voting.

#### Suitability:
Typically performs better than single decision trees by reducing overfitting. Handles non-linear relationships well.

### Support Vector Machine (SVM)

In [14]:
from sklearn.svm import SVC

svm = SVC(probability=True, random_state=42)
svm.fit(X_train, y_train)

#### Description:
Finds the optimal hyperplane that maximally separates the classes in feature space.

#### Suitability:
Effective in high-dimensional spaces. Works well when there's a clear margin of separation.

### k-Nearest Neighbors (k-NN)

In [15]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

#### Description:
Classifies samples based on the majority class among their k nearest neighbors in feature space.

#### Suitability:
Simple and effective for small to medium-sized datasets. Benefits from feature scaling.

In [19]:
 # Train all models 
models = {
    'Logistic Regression': LogisticRegression(max_iter=10000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'k-NN': KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)

# 4. Evaluation metrics 
results = []
for name, model in models.items():
    y_pred = model.predict(X_test)
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred)
    })


### Model Comparison

#### Evaluation Metrics

In [20]:
# Convert results to DataFrame for nice display
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.982456,0.990654,0.981481,0.986047
1,Decision Tree,0.935673,0.970874,0.925926,0.947867
2,Random Forest,0.964912,0.963636,0.981481,0.972477
3,SVM,0.976608,0.981481,0.981481,0.981481
4,k-NN,0.959064,0.963303,0.972222,0.967742


# Conclusion
The logistic Regression Model has the best Recall and Precision Values and combined
The Decision tree model has the lowest Recall value, so it can be considered as the least performing model though it has a reasonable precision score