<a href="https://colab.research.google.com/github/idiocter/Concept-of-AI/blob/workshop07/Workshop07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task Todo

## Task 1


### • Use the California Housing dataset from sklearn.datasets.
###• Split the dataset into **training (80%)** and **test (20%)** sets.




In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("camnugent/california-housing-prices")

print("Path to dataset files:", path)


Using Colab cache for faster access to the 'california-housing-prices' dataset.
Path to dataset files: /kaggle/input/california-housing-prices


In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler


In [4]:

# x, y = fetch_california_housing(return_X_y=True)
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

housing_data_path = "/kaggle/input/california-housing-prices/housing.csv"
df = pd.read_csv(housing_data_path)
df = df.dropna() # remove the rows that contains null values
df.head()



Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
X = df.drop("median_house_value", axis=1)  # features
Y = df["median_house_value"]               # target

# One-hot encode categorical column
X = pd.get_dummies(X, columns=['ocean_proximity'], drop_first=True)

print("Features after encoding:")
print(X.columns.tolist())
print(f"\nX shape: {X.shape}")  # (rows, features)
print(f"y shape:{Y.shape}")
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.2, random_state=42
)


Features after encoding:
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity_INLAND', 'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN']

X shape: (20433, 12)
y shape:(20433,)


## 3.2 Task 2: Complete all the Task
### • Regression Task (California Housing):

### – Step 1: Baseline Model (No Regularization) Build a Linear Regression model without any regularization.
####∗ Train the model on the training set.
####∗ Observe the coefficients of the model.
####∗ Compute the Mean Squared Error (MSE) on both training and test sets.
####∗ This step helps to understand how a simple linear model behaves on the dataset and serves as a baseline for comparison.

In [7]:
# Scale the features AND target variable
scaler_X = StandardScaler() #for features
scaler_y = StandardScaler() # for target
x_train_scaled = scaler_X.fit_transform(x_train)
x_test_scaled = scaler_X.transform(x_test)
y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1)).ravel() # StandardScaler expects a 2D array but y_train is a Pandas Series → 1D
y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1)).ravel()  # ravel. Converts 2D array back to 1D


linear_model = LinearRegression()
linear_model.fit(x_train_scaled, y_train_scaled)


In [8]:
# Make predictions
y_train_pred = linear_model.predict(x_train_scaled) # To check how well the model fit the data it learned from and Used to detect underfitting / overfitting
y_test_pred = linear_model.predict(x_test_scaled)

In [9]:
# Calculate MSE on scaled data
train_mse = mean_squared_error(y_train_scaled, y_train_pred)
test_mse = mean_squared_error(y_test_scaled, y_test_pred)

print(f"\nTraining MSE: {train_mse:.4f}")
print(f"Test MSE:     {test_mse:.4f}")


Training MSE: 0.3544
Test MSE:     0.3628


In [10]:
# Display coefficients
coefficients_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': linear_model.coef_
})

print("\nTop 10 Model Coefficients:")
print(coefficients_df.sort_values('Coefficient', key=abs, ascending=False).head(10))


Top 10 Model Coefficients:
                  Feature  Coefficient
7           median_income     0.647872
1                latitude    -0.476378
0               longitude    -0.472620
4          total_bedrooms     0.373744
5              population    -0.357397
8  ocean_proximity_INLAND    -0.158489
6              households     0.141739
3             total_rooms    -0.118322
2      housing_median_age     0.118209
9  ocean_proximity_ISLAND     0.025155


In [11]:
print(f"\nTraining set size: {x_train_scaled.shape[0]}") #x_train_scaled.shape returns a tuple: (number_of_rows, number_of_features) shape[0] → number of samples (rows)
print(f"Test set size: {x_test_scaled.shape[0]}")


Training set size: 16346
Test set size: 4087


### – Step 2: Hyperparameter Tuning Use GridSearchCV or RandomizedSearchCV to tune hyperparameters for Ridge and Lasso regression models.
#### ∗ Define a grid of alpha values (regularization strength).
####∗ Run cross-validation on the training set to find the optimal alpha.
####∗ Evaluate the model performance on the test set using MSE.
####∗ This step demonstrates the importance of selecting appropriate hyperparameters to improve generalization.

In [12]:
# Define the grid of alpha values (regularization strength)
# Using a wide range from very small to very large values
# Alpha controls how strongly the model penalizes large coefficients.
alpha_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
}

print("\nAlpha values to test:", alpha_grid['alpha'])


Alpha values to test: [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]


In [13]:
# ============================================
# RIDGE REGRESSION TUNING
# ============================================

print("\n" + "-"*70)
print("RIDGE REGRESSION")
print("-"*70)

# Create Ridge model
ridge = Ridge()

# Perform GridSearchCV with 5-fold cross-validation
ridge_grid_search = GridSearchCV(
    estimator=ridge,
    param_grid=alpha_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("\nTraining Ridge with GridSearchCV...")
ridge_grid_search.fit(x_train_scaled, y_train_scaled)


----------------------------------------------------------------------
RIDGE REGRESSION
----------------------------------------------------------------------

Training Ridge with GridSearchCV...
Fitting 5 folds for each of 8 candidates, totalling 40 fits


In [14]:
# Get the best value of alpha (Parameter)
best_ridge_alpha = ridge_grid_search.best_params_['alpha']
print(f"\n✓ Best alpha for Ridge: {best_ridge_alpha}")

# Get cross-validation results
ridge_cv_results = pd.DataFrame(ridge_grid_search.cv_results_)
ridge_cv_results['mean_mse'] = -ridge_cv_results['mean_test_score']
ridge_cv_results['std_mse'] = ridge_cv_results['std_test_score']

print("\nCross-Validation Results for Ridge:")
print(ridge_cv_results[['param_alpha', 'mean_mse', 'std_mse']].to_string(index=False))




✓ Best alpha for Ridge: 10

Cross-Validation Results for Ridge:
 param_alpha  mean_mse  std_mse
       0.001  0.357540 0.029032
       0.010  0.357540 0.029032
       0.100  0.357540 0.029031
       1.000  0.357534 0.029015
      10.000  0.357493 0.028856
     100.000  0.357787 0.027482
    1000.000  0.369099 0.022582
   10000.000  0.476420 0.019969


In [15]:
# Train the best Ridge model
best_ridge_model = Ridge(alpha=best_ridge_alpha)
best_ridge_model.fit(x_train_scaled, y_train_scaled)

# Make predictions
ridge_train_pred = best_ridge_model.predict(x_train_scaled)
ridge_test_pred = best_ridge_model.predict(x_test_scaled)

# Calculate MSE
ridge_train_mse = mean_squared_error(y_train_scaled, ridge_train_pred)
ridge_test_mse = mean_squared_error(y_test_scaled, ridge_test_pred)

print(f"\nRidge Model Performance:")
print(f"Training MSE: {ridge_train_mse:.4f}")
print(f"Test MSE:     {ridge_test_mse:.4f}")


Ridge Model Performance:
Training MSE: 0.3544
Test MSE:     0.3628


## – Step 3: Regularization Experiments (L1 vs L2) Train L1 (Lasso) and L2 (Ridge) regression models using the optimal hyperparameters.

#### ∗ Compare the coefficients learned by both models. Notice that L1 tends to produce sparse coefficients (some set to zero), while L2 shrinks coefficients without zeroing them.
####∗ Evaluate and compare the MSE on training and test sets.
####∗ Discuss the effect of regularization on the bias-variance tradeoff:
#####· How L1/L2 reduces variance and prevents overfitting.
#####· How excessive regularization may increase bias and underfit the data.
####∗ Visualize the effect by plotting coefficients or training/test error versus alpha values (optional but recommended).

In [16]:
# ============================================
# LASSO REGRESSION TUNING
# ============================================


# Create Lasso model
lasso = Lasso(max_iter=10000)  # Increased max_iter for convergence

# Perform GridSearchCV with 5-fold cross-validation
lasso_grid_search = GridSearchCV(
    estimator=lasso,
    param_grid=alpha_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("\nTraining Lasso with GridSearchCV...")
lasso_grid_search.fit(x_train_scaled, y_train_scaled)


Training Lasso with GridSearchCV...
Fitting 5 folds for each of 8 candidates, totalling 40 fits


In [17]:
# Get the best parameters
best_lasso_alpha = lasso_grid_search.best_params_['alpha']
print(f"\n✓ Best alpha for Lasso: {best_lasso_alpha}")

# Get cross-validation results
lasso_cv_results = pd.DataFrame(lasso_grid_search.cv_results_)
lasso_cv_results['mean_mse'] = -lasso_cv_results['mean_test_score']
lasso_cv_results['std_mse'] = lasso_cv_results['std_test_score']

print("\nCross-Validation Results for Lasso:")
print(lasso_cv_results[['param_alpha', 'mean_mse', 'std_mse']].to_string(index=False))


✓ Best alpha for Lasso: 0.001

Cross-Validation Results for Lasso:
 param_alpha  mean_mse  std_mse
       0.001  0.357523 0.028480
       0.010  0.364981 0.024084
       0.100  0.431730 0.021229
       1.000  1.000163 0.036360
      10.000  1.000163 0.036360
     100.000  1.000163 0.036360
    1000.000  1.000163 0.036360
   10000.000  1.000163 0.036360


In [18]:
# Train the best Lasso model
best_lasso_model = Lasso(alpha=best_lasso_alpha, max_iter=10000)
best_lasso_model.fit(x_train_scaled, y_train_scaled)

# Make predictions
lasso_train_pred = best_lasso_model.predict(x_train_scaled)
lasso_test_pred = best_lasso_model.predict(x_test_scaled)

# Calculate MSE
lasso_train_mse = mean_squared_error(y_train_scaled, lasso_train_pred)
lasso_test_mse = mean_squared_error(y_test_scaled, lasso_test_pred)

print(f"\nLasso Model Performance:")
print(f"Training MSE: {lasso_train_mse:.4f}")
print(f"Test MSE:     {lasso_test_mse:.4f}")

# Count non-zero coefficients (feature selection by Lasso)
non_zero_coefs = np.sum(best_lasso_model.coef_ != 0)
print(f"Number of non-zero coefficients: {non_zero_coefs} out of {len(best_lasso_model.coef_)}")


Lasso Model Performance:
Training MSE: 0.3545
Test MSE:     0.3629
Number of non-zero coefficients: 12 out of 12


In [19]:
# HYPERPARAMETER TUNING SUMMARY
# ============================================

print("\n" + "="*70)
print("HYPERPARAMETER TUNING SUMMARY")
print("="*70)

summary_data = {
    'Model': ['Ridge', 'Lasso'],
    'Best Alpha': [best_ridge_alpha, best_lasso_alpha],
    'Training MSE': [f"{ridge_train_mse:.4f}", f"{lasso_train_mse:.4f}"],
    'Test MSE': [f"{ridge_test_mse:.4f}", f"{lasso_test_mse:.4f}"],
    'Non-zero Coefs': [len(best_ridge_model.coef_), non_zero_coefs]
}

summary_df = pd.DataFrame(summary_data)
print("\n", summary_df.to_string(index=False))

print("\n" + "="*70)
print("Key Insights:")
print("-"*70)
print(f"• Ridge selected alpha = {best_ridge_alpha}, keeping all features")
print(f"• Lasso selected alpha = {best_lasso_alpha}, keeping {non_zero_coefs}/{len(best_lasso_model.coef_)} features")
print(f"• Lasso performs feature selection by setting {len(best_lasso_model.coef_) - non_zero_coefs} coefficients to zero")

if ridge_test_mse < lasso_test_mse:
    print(f"• Ridge has lower test MSE ({ridge_test_mse:.4f} vs {lasso_test_mse:.4f})")
else:
    print(f"• Lasso has lower test MSE ({lasso_test_mse:.4f} vs {ridge_test_mse:.4f})")

print("="*70)


HYPERPARAMETER TUNING SUMMARY

 Model  Best Alpha Training MSE Test MSE  Non-zero Coefs
Ridge      10.000       0.3544   0.3628              12
Lasso       0.001       0.3545   0.3629              12

Key Insights:
----------------------------------------------------------------------
• Ridge selected alpha = 10, keeping all features
• Lasso selected alpha = 0.001, keeping 12/12 features
• Lasso performs feature selection by setting 0 coefficients to zero
• Ridge has lower test MSE (0.3628 vs 0.3629)


## Step 3: Regularization Experiments (L1 vs L2) Train L1 (Lasso) and L2 (Ridge) regres- sion models using the optimal hyperparameters.

####∗ Compare the coefficients learned by both models. Notice that L1 tends to produce sparse coefficients (some set to zero), while L2 shrinks coefficients without zeroing them.
#### ∗ Evaluate and compare the MSE on training and test sets.
#### ∗ Discuss the effect of regularization on the bias-variance tradeoff:
#####· How L1/L2 reduces variance and prevents overfitting.
#####· How excessive regularization may increase bias and underfit the data.
####∗ Visualize the effect by plotting coefficients or training/test error versus alpha values (optionalbut recommended).

In [20]:
# Scale features and target
scaler_X = StandardScaler()
scaler_y = StandardScaler()

x_train_scaled = scaler_X.fit_transform(x_train)
x_test_scaled = scaler_X.transform(x_test)
y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1)).ravel()
y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1)).ravel()

feature_names = X.columns.tolist()

print("="*80)
print("STEP 3: REGULARIZATION EXPERIMENTS (L1 vs L2)")
print("="*80)

STEP 3: REGULARIZATION EXPERIMENTS (L1 vs L2)


In [21]:
# ============================================
# PART 1: TRAIN MODELS WITH OPTIMAL ALPHAS
# ============================================

print("\n" + "-"*80)
print("PART 1: TRAINING MODELS WITH OPTIMAL HYPERPARAMETERS")
print("-"*80)

# Use optimal alphas (you can adjust these based on your Step 2 results)
# These are typical good values, but replace with your actual best alphas
optimal_ridge_alpha = 10  # Replace with your best alpha from Step 2
optimal_lasso_alpha = 0.01  # Replace with your best alpha from Step 2

print(f"\nUsing optimal alphas:")
print(f"  Ridge (L2): α = {optimal_ridge_alpha}")
print(f"  Lasso (L1): α = {optimal_lasso_alpha}")


--------------------------------------------------------------------------------
PART 1: TRAINING MODELS WITH OPTIMAL HYPERPARAMETERS
--------------------------------------------------------------------------------

Using optimal alphas:
  Ridge (L2): α = 10
  Lasso (L1): α = 0.01


In [22]:
# Train baseline model (no regularization)
baseline_model = LinearRegression()
baseline_model.fit(x_train_scaled, y_train_scaled)

# Train Ridge model (L2 regularization)
ridge_model = Ridge(alpha=optimal_ridge_alpha)
ridge_model.fit(x_train_scaled, y_train_scaled)

# Train Lasso model (L1 regularization)
lasso_model = Lasso(alpha=optimal_lasso_alpha, max_iter=10000)
lasso_model.fit(x_train_scaled, y_train_scaled)

print("\n✓ All models trained successfully!")


✓ All models trained successfully!


In [23]:
# ============================================
# PART 2: COEFFICIENT COMPARISON
# ============================================

print("\n" + "-"*80)
print("PART 2: COEFFICIENT ANALYSIS")
print("-"*80)

# Create coefficient comparison dataframe
coef_comparison = pd.DataFrame({
    'Feature': feature_names,
    'Baseline': baseline_model.coef_,
    'Ridge (L2)': ridge_model.coef_,
    'Lasso (L1)': lasso_model.coef_
})

# Count zero coefficients
baseline_zeros = np.sum(np.abs(baseline_model.coef_) < 1e-10)
ridge_zeros = np.sum(np.abs(ridge_model.coef_) < 1e-10)
lasso_zeros = np.sum(lasso_model.coef_ == 0)

print(f"\nCoefficient Statistics:")
print(f"  Total features: {len(feature_names)}")
print(f"  Baseline - Zero coefficients: {baseline_zeros}")
print(f"  Ridge (L2) - Zero coefficients: {ridge_zeros}")
print(f"  Lasso (L1) - Zero coefficients: {lasso_zeros}")
print(f"\n  → Lasso eliminated {lasso_zeros} features (sparse solution)")
print(f"  → Ridge kept all {len(feature_names)} features (dense solution)")

# Show top features by absolute coefficient value
print("\n" + "="*80)
print("TOP 10 FEATURES BY ABSOLUTE COEFFICIENT VALUE")
print("="*80)

for model_name in ['Baseline', 'Ridge (L2)', 'Lasso (L1)']:
    print(f"\n{model_name}:")
    # Create absolute value column for sorting
    coef_comparison['abs_coef'] = coef_comparison[model_name].abs()
    top_features = coef_comparison.nlargest(10, 'abs_coef')
    print(top_features[['Feature', model_name]].to_string(index=False))


--------------------------------------------------------------------------------
PART 2: COEFFICIENT ANALYSIS
--------------------------------------------------------------------------------

Coefficient Statistics:
  Total features: 12
  Baseline - Zero coefficients: 0
  Ridge (L2) - Zero coefficients: 0
  Lasso (L1) - Zero coefficients: 2

  → Lasso eliminated 2 features (sparse solution)
  → Ridge kept all 12 features (dense solution)

TOP 10 FEATURES BY ABSOLUTE COEFFICIENT VALUE

Baseline:
               Feature  Baseline
         median_income  0.647872
              latitude -0.476378
             longitude -0.472620
        total_bedrooms  0.373744
            population -0.357397
ocean_proximity_INLAND -0.158489
            households  0.141739
           total_rooms -0.118322
    housing_median_age  0.118209
ocean_proximity_ISLAND  0.025155

Ridge (L2):
               Feature  Ridge (L2)
         median_income    0.647006
              latitude   -0.466173
             longi

In [24]:
# ============================================
# PART 3: MODEL PERFORMANCE COMPARISON
# ============================================

print("\n" + "-"*80)
print("PART 3: MODEL PERFORMANCE EVALUATION")
print("-"*80)

# Calculate predictions and MSE for all models
models = {
    'Baseline': baseline_model,
    'Ridge (L2)': ridge_model,
    'Lasso (L1)': lasso_model
}

results = []

for name, model in models.items():
    train_pred = model.predict(x_train_scaled)
    test_pred = model.predict(x_test_scaled)

    train_mse = mean_squared_error(y_train_scaled, train_pred)
    test_mse = mean_squared_error(y_test_scaled, test_pred)

    # Calculate difference (overfitting measure)
    overfit_gap = test_mse - train_mse
    overfit_pct = (overfit_gap / train_mse) * 100

    results.append({
        'Model': name,
        'Train MSE': train_mse,
        'Test MSE': test_mse,
        'Gap': overfit_gap,
        'Gap %': overfit_pct
    })

results_df = pd.DataFrame(results)
print("\nModel Performance Comparison:")
print(results_df.to_string(index=False))


--------------------------------------------------------------------------------
PART 3: MODEL PERFORMANCE EVALUATION
--------------------------------------------------------------------------------

Model Performance Comparison:
     Model  Train MSE  Test MSE      Gap    Gap %
  Baseline   0.354352  0.362787 0.008436 2.380601
Ridge (L2)   0.354359  0.362798 0.008439 2.381501
Lasso (L1)   0.362951  0.372429 0.009478 2.611396


In [25]:
# ============================================
# PART 4: BIAS-VARIANCE TRADEOFF ANALYSIS
# ============================================

print("\n" + "-"*80)
print("PART 4: BIAS-VARIANCE TRADEOFF ANALYSIS")
print("-"*80)

print("\nKey Observations:")

# Analyze each model
for idx, row in results_df.iterrows():
    print(f"\n{row['Model']}:")
    print(f"  Training MSE: {row['Train MSE']:.4f}")
    print(f"  Test MSE:     {row['Test MSE']:.4f}")
    print(f"  Gap:          {row['Gap']:.4f} ({row['Gap %']:.2f}%)")

    if row['Gap %'] > 5:
        print(f"  → Shows signs of overfitting (high variance)")
    elif row['Gap %'] < -5:
        print(f"  → Shows signs of underfitting (high bias)")
    else:
        print(f"  → Well-balanced bias-variance tradeoff")

print("\n" + "="*80)
print("REGULARIZATION EFFECTS:")
print("="*80)
print("""
L1 (Lasso) Regularization:
  • Produces SPARSE solutions (sets some coefficients to zero)
  • Performs automatic feature selection
  • Reduces variance by eliminating irrelevant features
  • Useful when you suspect many features are irrelevant

L2 (Ridge) Regularization:
  • Produces DENSE solutions (shrinks all coefficients)
  • Keeps all features but reduces their magnitude
  • Reduces variance by penalizing large coefficients
  • Useful when all features may contribute to prediction

Bias-Variance Tradeoff:
  • Too little regularization (small α) → High variance, overfitting
  • Optimal regularization → Balanced bias and variance
  • Too much regularization (large α) → High bias, underfitting
""")


--------------------------------------------------------------------------------
PART 4: BIAS-VARIANCE TRADEOFF ANALYSIS
--------------------------------------------------------------------------------

Key Observations:

Baseline:
  Training MSE: 0.3544
  Test MSE:     0.3628
  Gap:          0.0084 (2.38%)
  → Well-balanced bias-variance tradeoff

Ridge (L2):
  Training MSE: 0.3544
  Test MSE:     0.3628
  Gap:          0.0084 (2.38%)
  → Well-balanced bias-variance tradeoff

Lasso (L1):
  Training MSE: 0.3630
  Test MSE:     0.3724
  Gap:          0.0095 (2.61%)
  → Well-balanced bias-variance tradeoff

REGULARIZATION EFFECTS:

L1 (Lasso) Regularization:
  • Produces SPARSE solutions (sets some coefficients to zero)
  • Performs automatic feature selection
  • Reduces variance by eliminating irrelevant features
  • Useful when you suspect many features are irrelevant

L2 (Ridge) Regularization:
  • Produces DENSE solutions (shrinks all coefficients)
  • Keeps all features but reduce

# Part 2: Classification Task (Breast Cancer)

## 4.1 Task 1: Load and Split Dataset
#####• Use the Breast Cancer dataset from sklearn.datasets.
#####• Treat it as a binary classification task.
#####• Split into training (80%) and test (20%) sets.


In [26]:

from sklearn import datasets
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [31]:
X, Y = load_breast_cancer(return_X_y=True)
print(X.shape)
print(Y.shape)




(569, 30)
(569,)


In [32]:
# ============================================
# SPLITTING DATASET
# ============================================

X_train, X_test, Y_train , Y_test = train_test_split(X, Y, test_size=0.2, random_state=30)



# 4.2 Task 2: Complete all the Task
###• Classification Task (Diabetes):
####  Step 1: Baseline Model (No Regularization) Build a Logistic Regression model without specifying any regularization (default settings).
####∗ Train the model on the training set.
####∗ Observe the coefficients of the model.
####∗ Compute the accuracy on both training and test sets.
####∗ This step serves as a baseline for comparison and helps to understand the behavior of a standard logistic regression model on the dataset.

In [33]:
# ============================================
# LOGISTIC REGRESSION
# ============================================

baseModel = LogisticRegression(max_iter=3000)
baseModel.fit(X_train,Y_train)

In [39]:
# ============================================
# OBSERVE THE COEFFICIENT OF THE MODEL
# ============================================


coefficient = pd.DataFrame({
    'Feature': data.feature_names,
    'Coefficient': baseModel.coef_[0]
})

print(f"Coefficient of the model:")
print(coefficient.to_string(index=False))

Coefficient of the model:
                Feature  Coefficient
            mean radius     0.781768
           mean texture     0.133793
         mean perimeter    -0.242255
              mean area     0.036599
        mean smoothness    -0.083891
       mean compactness    -0.157471
         mean concavity    -0.383470
    mean concave points    -0.185278
          mean symmetry    -0.107122
 mean fractal dimension    -0.026240
           radius error    -0.012104
          texture error     0.675176
        perimeter error     1.057930
             area error    -0.211218
       smoothness error    -0.015809
      compactness error    -0.005888
        concavity error    -0.076169
   concave points error    -0.028368
         symmetry error    -0.021347
fractal dimension error    -0.000496
           worst radius     0.374870
          worst texture    -0.402317
        worst perimeter    -0.286185
             worst area    -0.012772
       worst smoothness    -0.192463
      worst 

In [36]:
# ============================================
# COMPUTING ACCURACY
# ============================================

train_ = baseModel.predict(X_train)
test_accuracy = baseModel.predict(X_test)


train_predictions = baseModel.predict(X_train)
test_predictions = baseModel.predict(X_test)



print(f"Train Accuracy: {accuracy_score(Y_train, train_predictions)}")
print(f"Test Accuracy:  {accuracy_score(Y_test, test_predictions)}")



Train Accuracy: 0.9714285714285714
Test Accuracy:  0.8947368421052632


# Step 2: Hyperparameter Tuning Use GridSearchCV or RandomizedSearchCV to tune hyperparameters for logistic regression models with regularization.
####* Focus on tuning the C parameter (inverse of regularization strength) and the penalty type (l1 or l2).
####* Run cross-validation on the training set to identify the optimal hyperparameters.
####* Evaluate the model performance on the test set using accuracy.

####∗ This step demonstrates the importance of hyperparameter selection for improving model generalization.

In [40]:


from sklearn.model_selection import GridSearchCV

# ============================================
# TUNING THE C PARAMETER AND PENALTY
# ============================================

model_to_tune = LogisticRegression(solver='liblinear', max_iter=3000)

param_grid = {
    'C': [0.01, 0.1 , 1 , 10, 100],
    'penalty': ['l1', 'l2']
}

In [43]:
# ============================================
# CROSS VALIDATION
# ============================================


grid_search = GridSearchCV(model_to_tune, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, Y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print(f"Optimal Hyperparameters found: {best_params}")
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

Optimal Hyperparameters found: {'C': 100, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9648


In [45]:
# ============================================
# EVALUATION BASED ACCURACY
# ============================================


test_preds = best_model.predict(X_test)
test_acc = accuracy_score(Y_test, test_preds)

print(f"Test Set Accuracy with Optimal Model: {test_acc:.4f}")

Test Set Accuracy with Optimal Model: 0.9561


# Step 3: Regularization Experiments (L1 vs L2) Train separate logistic regression models using L1 (Lasso-like) and L2 (Ridge-like) regularization with the optimal hyperparameters.
####∗ Compare the coefficients for L1 and L2. Observe that L1 tends to produce sparse coefficients(some exactly zero), whereas L2 shrinks all coefficients but rarely sets them exactly to zero.
####∗ Evaluate and compare accuracy on training and test sets.
####∗ Discuss the effect of regularization on the bias-variance tradeoff:
#####· How L1/L2 reduces variance and mitigates overfitting.
#####· How overly strong regularization may increase bias, reducing accuracy.
####∗ Optional: Visualize the coefficients or plot accuracy vs C values to better understand the impact of L1 and L2 regularization.

In [46]:

l1_model = LogisticRegression(C=100, penalty='l1', solver='liblinear', max_iter=3000)
l2_model = LogisticRegression(C=100, penalty='l2', solver='liblinear', max_iter=3000)

l1_model.fit(X_train, Y_train)
l2_model.fit(X_train, Y_train)

In [47]:

# ============================================
# EVALUATION BASED ACCURACY
# ============================================
print(f"L1 (Lasso) Non-Zero Coefficients: {sum(l1_model.coef_[0] != 0)} out of 30")
print(f"L2 (Ridge) Non-Zero Coefficients: {sum(l2_model.coef_[0] != 0)} out of 30")

L1 (Lasso) Non-Zero Coefficients: 23 out of 30
L2 (Ridge) Non-Zero Coefficients: 30 out of 30


In [49]:
# ============================================
# COMPARING ACCURACY
# ============================================


l1_acc = accuracy_score(Y_test, l1_model.predict(X_test))
l2_acc = accuracy_score(Y_test, l2_model.predict(X_test))


print(f"L1 Test Accuracy: {l1_acc:.4f}")
print(f"L2 Test Accuracy: {l2_acc:.4f}")

L1 Test Accuracy: 0.9561
L2 Test Accuracy: 0.9386
