### Problem Statement

You are a data scientist / AI engineer at a healthcare consulting firm. You have been provided with a dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

Your task is to use this dataset to build a linear regression model to predict the health risk score based on the given predictor variables. Additionally, you will use L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance. 

**Import Necessary Libraries**

In [22]:
# Import necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import scipy.stats as stats

### Task 1: Data Preparation and Exploration

1. Import the data from the **`"patient_health_data.csv"`** file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset and handle them appropriately.
5. Encode the categorical variable `'smoking_status'` by converting 'Yes' to 1 and 'No' to 0.

In [23]:
# Step 1: Import the data from the "patient_health_data.csv" file and store it in a variable 'df'
df = pd.read_csv('patient_health_data.csv')

# Step 2: Display the number of rows and columns in the dataset


# Step 3: Display the first few rows of the dataset to get an overview


In [24]:
df.shape

(250, 12)

In [25]:
df.head()

Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [26]:
# Step 4: Check for any missing values in the dataset and handle them appropriately
df.isnull().sum()

age                  0
bmi                  0
blood_pressure       0
cholesterol          0
glucose              0
insulin              0
heart_rate           0
activity_level       0
diet_quality         0
smoking_status       0
alcohol_intake       0
health_risk_score    0
dtype: int64

In [27]:
# Step 5: Encode the categorical variable 'smoking_status' by converting 'Yes' to 1 and 'No' to 0.
df.smoking_status.replace({'Yes':1, 'No':0}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.smoking_status.replace({'Yes':1, 'No':0}, inplace=True)
  df.smoking_status.replace({'Yes':1, 'No':0}, inplace=True)


In [28]:
df.smoking_status.unique()

array([0, 1])

### Task 2: Train Linear Regression Models

1. Select the features and the target variable for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Linear Regression model, and evaluate its performance using R-squared.
4. Initialize and train a Lasso Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.
5. Initialize and train a Ridge Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.

In [38]:
# Step 1: Select the features and target variable for modeling
x = df.drop('health_risk_score', axis=1)
y = df['health_risk_score']

# Step 2: Split the data into training and test sets with a test size of 25%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

In [40]:
print("Train: " + str(x_train.shape[0]))
print("Test: " + str(x_test.shape[0]))

Train: 187
Test: 63


In [41]:
# Step 3: Initialize and train a Linear Regression model, and evaluate its performance using R-squared

model = LinearRegression()
model.fit(x_train, y_train)

In [52]:
model.score(x_test, y_test)

0.764362090675749

In [59]:
# Step 4: Initialize and train a Lasso Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
alphas = [0.01, 0.1, 1.0, 10.0]

for alpha in alphas:
    model_lasso = Lasso(alpha=alpha)
    model_lasso.fit(x_train, y_train)
    score = model_lasso.score(x_test, y_test)
    print(f"alpha {alpha}:- {score}")

alpha 0.01:- 0.7645437646395714
alpha 0.1:- 0.766050991480216
alpha 1.0:- 0.7819763683575135
alpha 10.0:- 0.7873364302158369


In [60]:
# Step 5: Initialize and train a Ridge Regression model with various alpha values provided in a list, and evaluate its performance using R-squared

alphas = [0.01, 0.1, 1.0, 10.0]

for alpha in alphas:
    model_ridge = Ridge(alpha=alpha)
    model_ridge.fit(x_train, y_train)
    score = model_ridge.score(x_test, y_test)
    print(f"alpha {alpha}:- {score}")

alpha 0.01:- 0.764363158939054
alpha 0.1:- 0.7643727707489341
alpha 1.0:- 0.7644686367656158
alpha 10.0:- 0.7654030812954533
