##Problem Framing

This is a supervised binary classification problem where the goal is to predict whether a breast tumor is malignant or benign based on diagnostic features. Since misclassification of malignant tumors can have serious consequences, model generalization and recall are important evaluation considerations.

In [6]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


In [7]:
columns = [
    'id', 'diagnosis',
    'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
    'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean',
    'fractal_dimension_mean',
    'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
    'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
    'fractal_dimension_se',
    'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
    'smoothness_worst', 'compactness_worst', 'concavity_worst',
    'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst'
]

data = pd.read_csv("data.csv", header=None, names=columns)


In [8]:
data.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [9]:
data.shape


(569, 32)

In [10]:
data.isnull().sum()


Unnamed: 0,0
id,0
diagnosis,0
radius_mean,0
texture_mean,0
perimeter_mean,0
area_mean,0
smoothness_mean,0
compactness_mean,0
concavity_mean,0
concave_points_mean,0


In [11]:
data = data.drop(columns=['id'])
data.head()


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [12]:
X = data.drop(columns=['diagnosis'])
y = data['diagnosis'].map({'M': 1, 'B': 0})


In [13]:
X.shape, y.value_counts()


((569, 30),
 diagnosis
 0    357
 1    212
 Name: count, dtype: int64)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [15]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [16]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)


In [17]:
y_train_pred_lr = lr.predict(X_train_scaled)
y_test_pred_lr = lr.predict(X_test_scaled)


In [18]:
train_error_lr = 1 - accuracy_score(y_train, y_train_pred_lr)
test_error_lr = 1 - accuracy_score(y_test, y_test_pred_lr)

print("Logistic Regression")
print("Train Error:", train_error_lr)
print("Test Error:", test_error_lr)
print("Accuracy:", accuracy_score(y_test, y_test_pred_lr))
print("Precision:", precision_score(y_test, y_test_pred_lr))
print("Recall:", recall_score(y_test, y_test_pred_lr))
print("F1 Score:", f1_score(y_test, y_test_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred_lr))


Logistic Regression
Train Error: 0.01318681318681314
Test Error: 0.03508771929824561
Accuracy: 0.9649122807017544
Precision: 0.975
Recall: 0.9285714285714286
F1 Score: 0.9512195121951219
Confusion Matrix:
 [[71  1]
 [ 3 39]]


In [19]:
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)


In [20]:
y_train_pred_dt = dt.predict(X_train)
y_test_pred_dt = dt.predict(X_test)


In [21]:
train_error_dt = 1 - accuracy_score(y_train, y_train_pred_dt)
test_error_dt = 1 - accuracy_score(y_test, y_test_pred_dt)

print("Decision Tree")
print("Train Error:", train_error_dt)
print("Test Error:", test_error_dt)
print("Accuracy:", accuracy_score(y_test, y_test_pred_dt))
print("Precision:", precision_score(y_test, y_test_pred_dt))
print("Recall:", recall_score(y_test, y_test_pred_dt))
print("F1 Score:", f1_score(y_test, y_test_pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred_dt))


Decision Tree
Train Error: 0.01318681318681314
Test Error: 0.07894736842105265
Accuracy: 0.9210526315789473
Precision: 0.9459459459459459
Recall: 0.8333333333333334
F1 Score: 0.8860759493670886
Confusion Matrix:
 [[70  2]
 [ 7 35]]


##Generalization Error & Overfitting Analysis

Logistic Regression shows similar training and test errors, indicating good generalization and a balanced bias–variance tradeoff. Its linear nature and regularization prevent it from fitting noise in the training data.

The Decision Tree achieves very low training error but higher test error, indicating overfitting. Due to its high variance, it memorizes training patterns that do not generalize well to unseen data.

##ML Issues Relevant to This Problem

Feature Scaling: Required for Logistic Regression to ensure fair coefficient learning

Data Leakage: Scaling applied only after train-test split

Class Imbalance: Precision and recall are more informative than accuracy

Feature Correlation: Many features are correlated, affecting linear models