<a href="https://colab.research.google.com/github/Sarath-I/Classification-Algorithm/blob/main/Classification_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Classification Algorithms Model Building**

## **Loading and Preprocessing**

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

In [2]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
print('X Shape:', X.shape)
print('y Shape:', y.shape)
X.head()

X Shape: (569, 30)
y Shape: (569,)


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
print('Missing values per column:')
print(X.isnull().sum().sort_values(ascending=False))

Missing values per column:
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print('Train shape:', X_train_scaled.shape)
print('Test shape :', X_test_scaled.shape)

Train shape: (455, 30)
Test shape : (114, 30)


**Explanation:**

Missing values: The dataset contains no missing values.

Feature scaling: StandardScaler is applied because algorithms such as Logistic Regression, SVM, and k-NN depend on feature scales for distance calculations.

Train-test split: Stratified split ensures class distribution is preserved in train and test sets.

## **Classification Algorithm**

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
models = {
    'Logistic Regression': LogisticRegression(max_iter=500, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (rbf)': SVC(kernel='rbf', probability=True, random_state=42),
    'K-NN (k=5)': KNeighborsClassifier(n_neighbors=5)
}

In [6]:
results = []
for name, model in models.items():
    if name in ['Decision Tree', 'Random Forest']:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_proba = model.predict_proba(X_test)[:, 1]
    else:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        # Some classifiers (SVM) require probability=True to get predict_proba
        if hasattr(model, 'predict_proba'):
            y_proba = model.predict_proba(X_test_scaled)[:, 1]
        else:
            y_scores = model.decision_function(X_test_scaled)
            # minmax scale to [0,1]
            y_proba = (y_scores - y_scores.min()) / (y_scores.max() - y_scores.min())
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba)
    results.append({'Model': name, 'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1': f1, 'ROC_AUC': auc})
results_df = pd.DataFrame(results).set_index('Model')
results_df

Unnamed: 0_level_0,Accuracy,Precision,Recall,F1,ROC_AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Logistic Regression,0.982456,0.986111,0.986111,0.986111,0.99537
Decision Tree,0.912281,0.955882,0.902778,0.928571,0.915675
Random Forest,0.95614,0.958904,0.972222,0.965517,0.993717
SVM (rbf),0.982456,0.986111,0.986111,0.986111,0.99504
K-NN (k=5),0.95614,0.958904,0.972222,0.965517,0.978836


**Description:**

Logistic Regression: Linear model estimating probability via sigmoid. Suitable for binary classification and interpretable coefficients.

Decision Tree: Tree-based model splitting on features. Handles non-linear relationships and is interpretable but can overfit.

Random Forest: Ensemble of decision trees, reduces overfitting and often yields strong performance.

SVM: Finds a maximum-margin hyperplane, effective in high-dimensional spaces.

k-NN: Instance-based, predicts from nearest neighbors, simple and works well with scaled data.

## **Model Comparison**

In [7]:
# Which algorithm performed the best and which one performed the worst?
results_df.sort_values('F1', ascending=False)
# %% [markdown]
# %%
best_by_f1 = results_df['F1'].idxmax()
worst_by_f1 = results_df['F1'].idxmin()
print('Best model (by F1):', best_by_f1)
print('Worst model (by F1):', worst_by_f1)
results_df

Best model (by F1): Logistic Regression
Worst model (by F1): Decision Tree


Unnamed: 0_level_0,Accuracy,Precision,Recall,F1,ROC_AUC
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Logistic Regression,0.982456,0.986111,0.986111,0.986111,0.99537
Decision Tree,0.912281,0.955882,0.902778,0.928571,0.915675
Random Forest,0.95614,0.958904,0.972222,0.965517,0.993717
SVM (rbf),0.982456,0.986111,0.986111,0.986111,0.99504
K-NN (k=5),0.95614,0.958904,0.972222,0.965517,0.978836
