## Importing Libraries 

In [1]:
import pandas as pd 
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np 
from sklearn.model_selection import GridSearchCV

## Load Dataset 

In [2]:
data = load_breast_cancer()

In [3]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
df = pd.DataFrame(data.data, columns=data.feature_names)

In [5]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [6]:
df["Target"] = data.target
X,y = df.drop("Target", axis=1), df.Target
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

## 🧑‍💻 Scenario 1: Original Features (Baseline)
We'll start with the original dataset containing all features, and we'll perform hyperparameter tuning using Logistic Regression.

### Hyperparameter grid for Logistic Regression

In [8]:
param_grid = {'C': [0.01, 0.1,1,10,100], # Regularization parameters 
              'solver': ["liblinear", "saga"] # Optimization algorithm
              }

In [9]:
logistic_reg_model = LogisticRegression(max_iter=10000)
grid_search_model = GridSearchCV(logistic_reg_model, param_grid, cv=5)
grid_search_model.fit(X_train_scaled, y_train)
y_pred = grid_search_model.predict(X_test_scaled)

In [10]:
# Best hyperparameters and evaluation
print("Best Hyperparameters (Original Features):", grid_search_model.best_params_)
y_pred = grid_search_model.predict(X_test_scaled)
print("Accuracy (Original Features):", accuracy_score(y_test, y_pred))

Best Hyperparameters (Original Features): {'C': 0.1, 'solver': 'liblinear'}
Accuracy (Original Features): 0.9824561403508771


### Explanation:
**Feature Set:** All original features are used.

**Hyperparameter Tuning:** We use GridSearchCV to tune the regularization parameter C and the solver for the Logistic Regression model.

The result is the best combination of hyperparameters that minimize overfitting or underfitting.

**Effect of Features:** As the feature set is relatively manageable in this case, regularization (C) is tuned to prevent overfitting. The number of features is not very large here.

## 🧑‍💻 Scenario 2: Adding Noisy Features
Now, we'll add some noisy features to the dataset. Noisy features increase the model's complexity, and proper regularization becomes even more critical.

In [11]:
X_noisy = np.hstack((X, np.random.randn(X.shape[0], 500)))
X.shape, X_noisy.shape

((569, 30), (569, 530))

In [12]:
pd.DataFrame(X).head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [13]:
pd.DataFrame(X_noisy).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,520,521,522,523,524,525,526,527,528,529
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,0.414004,-0.489939,-0.921796,0.638564,1.54046,-0.761651,-0.230016,-0.22004,-0.153252,-1.324632
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,-1.003317,2.265019,-0.768769,-0.615758,0.796203,-1.843241,-0.139062,1.862789,-0.643894,-0.189425
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,0.733086,-0.122692,1.220189,-0.792799,-0.10606,0.778413,-0.03123,0.645287,-0.378387,-0.097456
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,-1.106638,0.014557,-0.004681,-0.242764,-0.215467,-0.387234,-0.906176,1.682662,0.982661,-1.009722
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,-0.910099,0.237844,-1.037534,0.229552,-0.149861,-0.920872,-0.381326,-1.043273,0.865814,-1.011791


In [14]:
X_train_noisy, X_test_noisy, y_train, y_test = train_test_split(X_noisy, y, test_size=0.3, random_state=42)

In [15]:
X_train_noisy_scaled = scaler.fit_transform(X_train_noisy)
X_test_noisy_scaled = scaler.fit_transform(X_test_noisy)

In [16]:
param_grid

{'C': [0.01, 0.1, 1, 10, 100], 'solver': ['liblinear', 'saga']}

In [17]:
grid_search_noisy_model = GridSearchCV(logistic_reg_model, param_grid, cv=5)
grid_search_noisy_model.fit(X_train_noisy_scaled, y_train)
# grid_search_noisy_model.fit(X_train_noisy_scaled, y_train)

In [18]:
# Best hyperparameters and evaluation with noisy features
print("Best Hyperparameters (Noisy Features):", grid_search_noisy_model.best_params_)
y_pred_noisy = grid_search_noisy_model.predict(X_test_noisy_scaled)
print("Accuracy (Noisy Features):", accuracy_score(y_test, y_pred_noisy))

Best Hyperparameters (Noisy Features): {'C': 10, 'solver': 'saga'}
Accuracy (Noisy Features): 0.9532163742690059


Explanation:
Feature Set: 50 noisy features are added to the original dataset.

Impact on Hyperparameter Tuning: With noisy features, the model has more dimensions to deal with, which can lead to overfitting if regularization is not properly tuned. The regularization parameter C will be important to prevent this overfitting.

Effect of Noisy Features: As the feature space increases with noise, the model becomes more complex, and we may need stronger regularization to maintain a balance between fitting the data and avoiding overfitting.



## 🧑‍💻 Scenario 3: PCA-Reduced Features (Dimensionality Reduction)
To reduce the number of features, we can use Principal Component Analysis (PCA). This technique reduces the dimensionality of the data while retaining most of the variance in the data.

In [19]:
from sklearn.decomposition import PCA

# Apply PCA to reduce dimensions (let's say to 10 components)
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_noisy)

# Train-test split with PCA-reduced features
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Standardizing the PCA data
X_train_pca_scaled = scaler.fit_transform(X_train_pca)
X_test_pca_scaled = scaler.transform(X_test_pca)

# Hyperparameter tuning with PCA-reduced features
grid_search_pca = GridSearchCV(lr, param_grid, cv=5)
grid_search_pca.fit(X_train_pca_scaled, y_train)

# Best hyperparameters and evaluation with PCA
print("Best Hyperparameters (PCA):", grid_search_pca.best_params_)
y_pred_pca = grid_search_pca.predict(X_test_pca_scaled)
print("Accuracy (PCA Features):", accuracy_score(y_test, y_pred_pca))


NameError: name 'lr' is not defined

Explanation:
Feature Set: The dataset is reduced to 10 principal components using PCA, which removes noise and reduces the dimensionality.

Impact on Hyperparameter Tuning: With fewer features, the model has less complexity, and the tuning of regularization (C) becomes simpler. The model is less likely to overfit, and cross-validation is more stable.

Effect of PCA: PCA helps to focus on the most significant features, thus reducing the impact of irrelevant dimensions. This makes the hyperparameter tuning process more effective, as fewer features typically result in a less complex model with lower risk of overfitting.

Summary of Key Points
Original Features: Regularization (C) is tuned to avoid overfitting as we try to balance model complexity and performance.

Noisy Features: The additional features add complexity, increasing the risk of overfitting. Regularization becomes more important to prevent the model from fitting noise rather than patterns in the data.

PCA-Reduced Features: By reducing the number of features, we simplify the model, making it easier to tune hyperparameters like regularization. PCA helps the model to generalize better and reduces overfitting.