### Problem Statement

You are a data scientist / AI engineer at a healthcare consulting firm. You have been provided with a dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

Your task is to use this dataset to build a linear regression model to predict the health risk score based on the given predictor variables. Additionally, you will use L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance.

**Import Necessary Libraries**

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score

### Task 1: Data Preparation and Exploration

1. Import the data from the **`"patient_health_data.csv"`** file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset and handle them appropriately.
5. Encode the categorical variable `'smoking_status'` by converting 'Yes' to 1 and 'No' to 0.

In [25]:
# Step 1: Import the data from the "patient_health_data.csv" file and store it in a variable 'df'
df=pd.read_csv('/content/patient_health_data.csv')

# Step 2: Display the number of rows and columns in the dataset

df.shape
# Step 3: Display the first few rows of the dataset to get an overview
display(df.head())

Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [27]:
# Step 4: Check for any missing values in the dataset and handle them appropriately
df.isnull().sum()

Unnamed: 0,0
age,0
bmi,0
blood_pressure,0
cholesterol,0
glucose,0
insulin,0
heart_rate,0
activity_level,0
diet_quality,0
smoking_status,0


In [28]:
# Step 5: Encode the categorical variable 'smoking_status' by converting 'Yes' to 1 and 'No' to 0.
df['smoking_status'] = df['smoking_status'].str.strip().str.lower().replace({'yes': 1, 'no': 0})
display(df['smoking_status'].head())

  df['smoking_status'] = df['smoking_status'].str.strip().str.lower().replace({'yes': 1, 'no': 0})


Unnamed: 0,smoking_status
0,0
1,0
2,1
3,0
4,0


In [29]:
df['smoking_status'].unique()

array([0, 1])

### Task 2: Train Linear Regression Models

1. Select the features and the target variable for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Linear Regression model, and evaluate its performance using R-squared.
4. Initialize and train a Lasso Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.
5. Initialize and train a Ridge Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.

In [22]:
# Step 1: Select the features and target variable for modeling
X = df.drop('health_risk_score', axis=1)
y = df['health_risk_score']

# Step 2: Split the data into training and test sets with a test size of 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [23]:
# Step 3: Initialize and train a Linear Regression model, and evaluate its performance using R-squared
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
lr_r2 = r2_score(y_test, lr_pred)
print(f"Linear Regression R-squared: {lr_r2}")

Linear Regression R-squared: 0.764362090675749


In [24]:
# Step 4: Initialize and train a Lasso Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
lasso_alphas = [0.01, 0.1, 1.0, 10.0]
for alpha in lasso_alphas:
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    lasso_pred = lasso_model.predict(X_test)
    lasso_r2 = r2_score(y_test, lasso_pred)
    print(f"Lasso Regression (alpha={alpha}) R-squared: {lasso_r2}")

Lasso Regression (alpha=0.01) R-squared: 0.7645437646395714
Lasso Regression (alpha=0.1) R-squared: 0.7660509914802164
Lasso Regression (alpha=1.0) R-squared: 0.7819763683575137
Lasso Regression (alpha=10.0) R-squared: 0.7873364302158369


In [26]:
# Step 5: Initialize and train a Ridge Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
ridge_alphas = [0.01, 0.1, 1.0, 10.0]
for alpha in ridge_alphas:
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    ridge_pred = ridge_model.predict(X_test)
    ridge_r2 = r2_score(y_test, ridge_pred)
    print(f"Ridge Regression (alpha={alpha}) R-squared: {ridge_r2}")

Ridge Regression (alpha=0.01) R-squared: 0.764363158939054
Ridge Regression (alpha=0.1) R-squared: 0.7643727707489341
Ridge Regression (alpha=1.0) R-squared: 0.7644686367656156
Ridge Regression (alpha=10.0) R-squared: 0.7654030812954534


In [30]:
#elastic regeression


In [31]:
from sklearn.linear_model import ElasticNet

# Initialize and train an Elastic Net Regression model with various alpha and l1_ratio values
elastic_net_alphas = [0.01, 0.1, 1.0, 10.0]
elastic_net_l1_ratios = [0.2, 0.5, 0.8]

for alpha in elastic_net_alphas:
    for l1_ratio in elastic_net_l1_ratios:
        elastic_net_model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
        elastic_net_model.fit(X_train, y_train)
        elastic_net_pred = elastic_net_model.predict(X_test)
        elastic_net_r2 = r2_score(y_test, elastic_net_pred)
        print(f"Elastic Net Regression (alpha={alpha}, l1_ratio={l1_ratio}) R-squared: {elastic_net_r2}")

Elastic Net Regression (alpha=0.01, l1_ratio=0.2) R-squared: 0.7645575791910162
Elastic Net Regression (alpha=0.01, l1_ratio=0.5) R-squared: 0.7645525258365449
Elastic Net Regression (alpha=0.01, l1_ratio=0.8) R-squared: 0.7645473216789129
Elastic Net Regression (alpha=0.1, l1_ratio=0.2) R-squared: 0.7662415574125775
Elastic Net Regression (alpha=0.1, l1_ratio=0.5) R-squared: 0.7661911479732026
Elastic Net Regression (alpha=0.1, l1_ratio=0.8) R-squared: 0.7661182435598584
Elastic Net Regression (alpha=1.0, l1_ratio=0.2) R-squared: 0.7781061715079659
Elastic Net Regression (alpha=1.0, l1_ratio=0.5) R-squared: 0.7793019551765732
Elastic Net Regression (alpha=1.0, l1_ratio=0.8) R-squared: 0.7807786923510525
Elastic Net Regression (alpha=10.0, l1_ratio=0.2) R-squared: 0.7941512141138394
Elastic Net Regression (alpha=10.0, l1_ratio=0.5) R-squared: 0.7915384948216033
Elastic Net Regression (alpha=10.0, l1_ratio=0.8) R-squared: 0.7884079763157471


Great question 🙌 Let’s break down **alpha** and **l1\_ratio** in **Elastic Net** in simple terms.

---

### 🔑 Elastic Net Basics

Elastic Net is a regression method that **combines**:

* **Lasso (L1 regularization)** → pushes some coefficients to **zero** (feature selection)
* **Ridge (L2 regularization)** → shrinks coefficients but doesn’t remove them

So Elastic Net is a **mix** of Ridge and Lasso.

---

### ⚙️ Parameters

#### 1. **alpha**

* Controls the **overall strength** of regularization (both L1 & L2).
* Higher `alpha` = **more shrinkage** = smaller coefficients = simpler model.
* Lower `alpha` = less shrinkage = coefficients closer to ordinary linear regression.

👉 Think of `alpha` as a **knob for how much penalty to apply**.

* `alpha = 0` → ordinary least squares (no penalty).
* Large `alpha` → coefficients get very small, maybe all close to zero.

---

#### 2. **l1\_ratio**

* Controls the **balance between L1 (Lasso) and L2 (Ridge)**.
* Ranges between **0 and 1**:

  * `l1_ratio = 0` → pure Ridge (only L2).
  * `l1_ratio = 1` → pure Lasso (only L1).
  * `0 < l1_ratio < 1` → mix of both (Elastic Net).

👉 Example:

* `l1_ratio = 0.3` → 30% Lasso + 70% Ridge.
* `l1_ratio = 0.7` → 70% Lasso + 30% Ridge.

---

### ⚖️ Summary

* **alpha** → how strong the penalty is (bigger = stronger).
* **l1\_ratio** → what proportion of that penalty is L1 vs L2.

