### **Question 1: Linear Regression**
a) Load the "Boston Housing" dataset from scikit-learn's built-in datasets.

b) Split the data into training and testing sets.

if your roll number is even then
(80% training, 20% testing).

if your roll number is odd then
(70% training, 30% testing).

c) Train a linear regression model on the training data and make predictions on the testing data.

d) Calculate the mean squared error (MSE) between the predicted and actual values.

In [1]:
from sklearn import datasets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

X = data
y = target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a linear regression model
reg = LinearRegression().fit(X_train, y_train)

# Make predictions on the testing data
y_pred = reg.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# print("Mean Squared Error:", mse)
print("Mean Squared Error:", mse)

Mean Squared Error: 26.081847631350556


# **Question 2: L1 Regularization (Lasso)**
a) Load the "Diabetes" dataset from scikit-learn's built-in datasets.

b) Split the data into training and testing sets.

if your roll number is even then (80% training, 20% testing).

if your roll number is odd then (70% training, 30% testing).

c) Train a Lasso regression model on the training data with an alpha value of 0.1.

***Model name should be your first name***

d) Evaluate the model's performance using the mean squared error (MSE) on the testing data.

e) Identify the features that were selected (non-zero coefficients) by the Lasso model.

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

# Load the "Diabetes" dataset from scikit-learn's built-in datasets.
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a Lasso regression model on the training data with an alpha value of 0.1.
Mudit = Lasso(alpha=0.1)
Mudit.fit(X_train, y_train)

# Evaluate the model's performance using the mean squared error (MSE) on the testing data.
y_pred = Mudit.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Identify the features that were selected (non-zero coefficients) by the Lasso model.
selected_features = [diabetes.feature_names[i] for i in range(len(Mudit.coef_)) if Mudit.coef_[i] != 0]
print("Selected Features:", selected_features)

Mean Squared Error: 3360.338021314688
Selected Features: ['sex', 'bmi', 'bp', 's1', 's3', 's5', 's6']


# **Question 3: L2 Regularization (Ridge)**
a) Load the "California Housing" dataset from an online source (e.g., Kaggle).
*housing.csv* written

b) Perform any necessary preprocessing steps, such as handling missing values or scaling the features.

c) Split the data into training and testing sets.

if your roll number is prime (last two digits) then (85% training, 15% testing).

if your roll number is not prime (last two digits) then (75% training, 35% testing).

d) Train a Ridge regression model on the training data with an alpha value of 0.01.

e) Calculate the mean squared error (MSE) on the testing data to assess the model's performance.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Load the dataset
df = pd.read_csv('housing.csv')

# Preprocessing steps
df.dropna(inplace=True)
df['price_scaled'] = df['price'] / 10000

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis=1), df['area'], test_size=0.35, random_state=42)

# Train a Ridge regression model on the training data with an alpha value of 0.01
ridge = Ridge(alpha=0.01)
ridge.fit(X_train, y_train)

# Calculate the mean squared error (MSE) on the testing data to assess the model's performance
y_pred = ridge.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")