### Problem Statement

You are a data scientist / AI engineer at a healthcare consulting firm. You have been provided with a dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

Your task is to use this dataset to build a linear regression model to predict the health risk score based on the given predictor variables. Additionally, you will use L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance. 

**Import Necessary Libraries**

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

### Task 1: Data Preparation and Exploration

1. Import the data from the **`"patient_health_data.csv"`** file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset and handle them appropriately.
5. Encode the categorical variable `'smoking_status'` by converting 'Yes' to 1 and 'No' to 0.

In [20]:
# Step 1: Import the data from the "patient_health_data.csv" file and store it in a variable 'df'
df = pd.read_csv("patient_health_data.csv")

# Step 2: Display the number of rows and columns in the dataset
print("Number of rows and columns:", df.shape)

# Step 3: Display the first few rows of the dataset to get an overview
print("First few rows of the dataset:")
df.head()

Number of rows and columns: (250, 12)
First few rows of the dataset:


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [21]:
# Step 4: Check for any missing values in the dataset and handle them appropriately
df.isna().sum()

age                  0
bmi                  0
blood_pressure       0
cholesterol          0
glucose              0
insulin              0
heart_rate           0
activity_level       0
diet_quality         0
smoking_status       0
alcohol_intake       0
health_risk_score    0
dtype: int64

In [23]:
# Step 5: Encode the categorical variable 'smoking_status' by converting 'Yes' to 1 and 'No' to 0.
df["smoking_status"]=df["smoking_status"].apply(lambda x: 1 if x=="Yes" else 0)

In [24]:
df

Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,0,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,0,0.852910,160.320350
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,1,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.192910,2.745585,0,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,0,3.944011,170.609655
...,...,...,...,...,...,...,...,...,...,...,...,...
245,73,33.923019,154.961623,239.257347,175.833417,11.178057,99.249455,8.894246,1.837274,0,3.200992,162.542038
246,51,18.666168,83.047215,200.890687,82.041575,21.733684,78.995462,4.904205,1.264277,0,1.492175,136.146456
247,51,25.105083,166.721498,235.416528,168.392792,2.207699,75.301051,2.447634,8.406626,0,1.201912,156.758986
248,43,34.448869,115.414667,283.119072,179.062287,15.533491,64.793160,6.022381,2.887777,0,1.559495,204.649145


In [25]:
df.sample(3)

Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
154,43,18.66019,100.722765,170.867429,142.782069,24.481727,91.73629,9.894477,5.90595,0,0.496511,128.780001
47,71,21.88535,122.609148,257.768591,146.034404,20.142561,65.02966,5.31419,6.394321,1,4.399313,180.297445
39,47,24.758724,153.703556,284.491539,151.585564,4.954321,60.695192,3.902752,8.868362,0,1.092549,175.756101


### Task 2: Train Linear Regression Models

1. Select the features and the target variable for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Linear Regression model, and evaluate its performance using R-squared.
4. Initialize and train a Lasso Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.
5. Initialize and train a Ridge Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.

In [26]:
# Step 1: Select the features and target variable for modeling
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split
model=LinearRegression()
# Step 2: Split the data into training and test sets with a test size of 25%
X = df.drop(['health_risk_score'], axis=1)
y = df['health_risk_score']

# Step 2: Split the data into training and test sets with a test size of 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


In [30]:
# Step 3: Initialize and train a Linear Regression model, and evaluate its performance using R-squared
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
r2_score(y_test,y_pred)

0.764362090675749

In [35]:
lasso_alphas = [0.01, 0.1, 1.0, 10.0]
for alpha in lasso_alphas:
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    lasso_r2 = lasso_model.score(X_test, y_test)
    print(f"Lasso Regression R-squared (alpha={alpha}):", lasso_r2)

Lasso Regression R-squared (alpha=0.01): 0.7645437646395712
Lasso Regression R-squared (alpha=0.1): 0.7660509914802163
Lasso Regression R-squared (alpha=1.0): 0.7819763683575139
Lasso Regression R-squared (alpha=10.0): 0.7873364302158369


In [37]:
# Step 4: Initialize and train a Lasso Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
from sklearn.linear_model import Lasso ,Ridge
alpha_scr=[0.01, 0.1, 1.0, 10.0]
for i in alpha_scr:
    lass=Lasso(alpha=i)
    lass.fit(X_train,y_train)
#     lass_y_pred=lass.predict(X_test)
    lass_scr=lass.score(X_test,y_test)
    print(lass_scr)
    


0.7645437646395712
0.7660509914802163
0.7819763683575139
0.7873364302158369


In [38]:
# Step 5: Initialize and train a Ridge Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
alpha_scr=[0.01, 0.1, 1.0, 10.0]
for i in alpha_scr:
    rdg=Ridge(alpha=i)
    rdg.fit(X_train,y_train)
#     lass_y_pred=lass.predict(X_test)
    rdg_scr=rdg.score(X_test,y_test)
    print(rdg_scr)
    

0.764363158939054
0.7643727707489341
0.7644686367656155
0.7654030812954535


In [39]:
rdg.coef_

array([ 0.34392938,  0.10347539,  0.29308944,  0.18334059,  0.52992792,
        0.39522627, -0.54453356, -0.67791659, -1.11179359,  0.39935093,
       -0.968424  ])

In [40]:
lass.coef_

array([ 0.28820889,  0.        ,  0.30265323,  0.17999586,  0.52419905,
        0.14579411, -0.4822855 , -0.        , -0.        ,  0.        ,
       -0.        ])

In [41]:
model.coef_

array([ 0.34418045,  0.10327706,  0.29276989,  0.18342512,  0.53001696,
        0.39607504, -0.54400426, -0.68687515, -1.12172705,  0.49876073,
       -1.00163997])