# Lab 1 - Logistic Regression with Regularization
## COSC 4337 - Data Science II
***

### 1) Introduction

In this lab, we will build upon our knowledge of Logistic Regression by introducing **regularization**. Regularization is a technique used to prevent overfitting by penalizing large coefficients in the model. This helps create a more generalizable model that performs better on unseen data.

We will explore two common types of regularization:
- **L1 Regularization (Lasso Regression)**: Adds a penalty equal to the *absolute value* of the magnitude of coefficients. This can shrink some coefficients to exactly zero, effectively performing feature selection.
- **L2 Regularization (Ridge Regression)**: Adds a penalty equal to the *square* of the magnitude of coefficients. This shrinks coefficients towards zero but rarely sets them exactly to zero.

We will use the **Wisconsin Breast Cancer dataset**, where the goal is to predict whether a tumor is malignant or benign based on various measurements.

#### Dataset Variable Information
| Feature             | Description                                    |
|---------------------|------------------------------------------------|
| **`target`** | Diagnosis (0 = malignant, 1 = benign)          |
| `mean radius`       | Mean of distances from center to perimeter     |
| `mean texture`      | Standard deviation of gray-scale values        |
| `mean perimeter`    | Mean size of the core tumor                    |
| `mean area`         | Mean area of the tumor                         |
| `mean smoothness`   | Mean of local variation in radius lengths      |
| ...                 | (And 25 other geometric features)              |

### 2) Requirements

For this assignment, we will use the breast cancer dataset available directly from the `scikit-learn` library. No external files are needed.

***
**NOTE**: Required actions you need to perform are marked with `***`.

#### 2.1) Initial Imports
***
*** Begin by importing the necessary packages for the exercise.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')

### 3) Data Loading and Preparation
***
We'll load the dataset from `scikit-learn` and place it into a pandas DataFrame for easier manipulation.

In [2]:
# Load the dataset
cancer = load_breast_cancer()

# Create a DataFrame with features
feats = pd.DataFrame(cancer.data, columns=cancer.feature_names)

# Create a DataFrame for the target variable
target = pd.DataFrame(cancer.target, columns=['target'])

#### 3.1) Data Exploration
***
*** Take a look at the feature and target data using `.head()` to understand its structure.

In [5]:
# *** YOUR CODE HERE ***
target.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


#### 3.2) Train-Test Split
***
Now, we will split our data into training and testing sets. This ensures we evaluate our model on data it has never seen before. We will use 20% of the data for testing.

We will also use **cross-validation** on the training set to find the best regularization hyperparameter. `LogisticRegressionCV` handles this automatically.

In [6]:
# Split the data into training and testing sets
test_size = 0.2
random_state = 42
X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size=test_size, random_state=random_state)

*** Let's verify the shapes of our new datasets to ensure the split was successful.

In [7]:
# Print the shapes of the training and testing sets
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_test: {y_test.shape}')

Shape of X_train: (455, 30)
Shape of y_train: (455, 1)
Shape of X_test: (114, 30)
Shape of y_test: (114, 1)


### 4) Model Training with Regularization and Cross-Validation
***
We will now train two Logistic Regression models using `LogisticRegressionCV`, which performs cross-validation to find the best regularization strength `C`.

- **`model_l1`**: Uses L1 regularization (`penalty='l1'`). Requires a compatible solver like `'liblinear'`.
- **`model_l2`**: Uses L2 regularization (`penalty='l2'`), which is the default.

The parameter `Cs` defines the grid of values for `C` (which is the *inverse* of regularization strength, so smaller `C` means stronger regularization) that the cross-validation will test.

*** Instantiate and fit both an L1 and an L2 regularized model.

In [14]:
# *** YOUR CODE HERE ***
# Define the grid of C values to test
Cs = np.logspace(-4, 4, 10)

# L1-regularized logistic regression
model_l1 = LogisticRegressionCV(
    Cs=Cs,
    cv=5,
    penalty='l1',
    solver='liblinear',
    scoring='accuracy',
    max_iter=10000,
    random_state=42
)
model_l1.fit(X_train, y_train.values.ravel())

# L2-regularized logistic regression
model_l2 = LogisticRegressionCV(
    Cs=Cs,
    cv=5,
    penalty='l2',
    solver='lbfgs',
    scoring='accuracy',
    max_iter=10000,
    random_state=42
)
model_l2.fit(X_train, y_train.values.ravel())

# Evaluate both Models:
print("L1 Test Accuracy:", model_l1.score(X_test, y_test))
print("L2 Test Accuracy:", model_l2.score(X_test, y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=10000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


L1 Test Accuracy: 0.9736842105263158
L2 Test Accuracy: 0.9736842105263158


#### 4.1) Optimal Hyperparameters
***
The `LogisticRegressionCV` object automatically stores the best hyperparameter `C` found during cross-validation.

*** Print the best value of `C` for each model.

In [15]:
# *** YOUR CODE HERE ***
print("Best C for L1:", model_l1.C_[0])
print("Best C for L2:", model_l2.C_[0])


Best C for L1: 21.54434690031882
Best C for L2: 166.81005372000558


### 5) Model Evaluation
***
Now we'll use our trained models to make predictions on the held-out test set (`X_test`) and compare them to the true labels (`y_test`).

In [None]:
# *** YOUR CODE HERE ***
# Make predictions using both models




#### 5.1) Evaluation Metrics
***
Let's calculate the key evaluation metrics for both models to see how they performed.

*** Print the accuracy, precision, recall, and f1-score for each model.

In [None]:
# *** YOUR CODE HERE ***




### 6) Feature Importances
***
A key difference between L1 and L2 regularization is their effect on the model's coefficients. Let's examine them.

- **L1** tends to produce **sparse** coefficients, driving the weights of less important features to exactly zero.
- **L2** penalizes large coefficients but only shrinks them towards zero, rarely making them exactly zero.

#### 6.1) L1 Model Coefficients
***
*** Create a DataFrame to view the feature names and their corresponding coefficients from the L1 model. Notice how many have become zero.

In [None]:
# *** YOUR CODE HERE ***




#### 6.2) L2 Model Coefficients
***
*** Now, do the same for the L2 model. Observe that while many coefficients are small, none (or very few) are exactly zero.

In [None]:
# *** YOUR CODE HERE ***




### 7) Conclusion
***
In this lab, we have seen how to create logistic regression models that include regularization. We used cross-validation to tune the regularization hyperparameter `C` and compared the performance of L1 and L2 penalties.

Regularization is a crucial technique for building robust machine learning models that generalize well to new data by preventing overfitting. We also observed the feature-selecting property of L1 regularization, which can be very useful for models with a large number of features.