### Project Report: Linear Regression and Regularized Models

#### Overview

I employed multiple approaches to predict house prices, including standard Linear Regression and several combinations of regularized models: Ridge, Lasso, and ElasticNet. The data was split into training and validation sets, scaled, and used for model training and evaluation.

#### Results

- **Linear Regression**:
  - **MSE**: 23.48673519542582
  - **R² Score**: 0.7390315860425438

- **Average of Ridge, Lasso, and ElasticNet**:
  - **MSE**: 23.48673519542582
  - **R² Score**: 0.7390315860425438

**Kaggle Scores:**
- **Linear Regression**:
  - Private Score: 4.86599
  - Public Score: 4.81636

- **Average of Ridge, Lasso, and ElasticNet**:
  - Private Score: 4.84934
  - Public Score: 4.75777

#### Report and Analysis

To improve performance, I experimented with various combinations of Ridge, Lasso, and ElasticNet models. The best results were achieved by averaging the predictions from these regularized models.

The slight improvement in performance with the average of the regularized models, compared to the standard Linear Regression, is due to the nature of regularization. Ridge, Lasso, and ElasticNet incorporate penalties that help control model complexity and prevent overfitting. By averaging the predictions from these models, I leveraged their individual strengths, which led to a marginal improvement in prediction accuracy. However, the improvement was modest, indicating that while regularization helps, its effect in this case was not dramatic.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load training data
train_file_path = # Add training data file path here
train_data = pd.read_csv(train_file_path)

# Load testing data
test_file_path = # Add testing data file path here
test_data = pd.read_csv(test_file_path)

# Extract features (X) and target (y) from the training data
X = train_data.drop(columns=['ID', 'medv'])
y = train_data['medv']  # Target variable

# Extract features (X) from the test data (excluding ID column)
X_test = test_data.drop(columns=['ID'], errors='ignore')

# Split the training data into train and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

# Initialize and fit the Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate the model
y_valid_pred = model.predict(X_valid_scaled)
mse = mean_squared_error(y_valid, y_valid_pred)
r2 = r2_score(y_valid, y_valid_pred)

print(f"Linear Regression - MSE: {mse}, R2 Score: {r2}")

# Predict on the test data
y_test_pred = model.predict(X_test_scaled)

# Save predictions to CSV
output = pd.DataFrame({'ID': test_data['ID'], 'medv': y_test_pred})
output_file_path = # Add the path to which you want to save the file
output.to_csv(output_file_path, index=False)

print(f"Predictions saved to {output_file_path}")


In [None]:
import pandas as pd # For data management
from sklearn.model_selection import train_test_split # For splitting the data
from sklearn.preprocessing import StandardScaler # I used this for scaling the data
from sklearn.linear_model import Ridge, Lasso, ElasticNet # As mentioned in the article, these can be used to improve our performance, I took average as explained later
from sklearn.metrics import mean_squared_error, r2_score # For error identification

# We load the training data present in a csv file use a standard pandas function
train_file_path = # Add training data file path here 
train_data = pd.read_csv(train_file_path)

# We load the testing data in the same manner
test_file_path = # Add testing data file path here 
test_data = pd.read_csv(test_file_path)

# We extract the features (X) and target (y) from the training data
X = train_data.drop(columns=['ID', 'medv'])  # We drop 'ID' and 'medv' columns, ID is not useful for our purpose predicting the values on the testing data
y = train_data['medv']  # We set the target variable, i.e, the expected output, our code will try to create a linear function with multiple variables being the columns other than the house price and the house price is the output.

# We extract the features (X) from the test data, excluding the ID column
X_test = test_data.drop(columns=['ID'], errors='ignore')

# We split the training data into training and validation sets in the standard 80% to 20% ratio
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42) 

# We scale the features for standardisation of the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

# We initialize the models we want to use, we can use the average of all three to get the best method, I uploaded all individually and in different combinations but got the best result using the average of all 3.
# Alpha and l1_ratio are regularization parameters used to keep overfitting in check, I used chat-gpt suggested values of alpha here.
models = {
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}  

# We define a function to evaluate each model, mse stands for mean-squared error and focuses on how different the predicted values are from the expected values
# The r2_score focuses more on how the data works with variability of the input

def evaluate_model(model, X_train, y_train, X_valid, y_valid):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    mse = mean_squared_error(y_valid, y_pred)
    r2 = r2_score(y_valid, y_pred)
    return mse, r2

# We create a DataFrame to store the predictions from each model
predictions_df = pd.DataFrame()
for name, model in models.items():
    # We train the model on the training set and make predictions on the test set after our model has been trained
    model.fit(X_train_scaled, y_train)
    pred_test = model.predict(X_test_scaled)
    
    # We store the predictions in the DataFrame for the purpose of saving to csv
    predictions_df[name] = pred_test

# We calculate the average of the predictions from the three models
average_prediction = predictions_df.mean(axis=1)

# We save the average predictions to a CSV file
output = pd.DataFrame({'ID': test_data['ID'], 'medv': average_prediction})
output_file_path = # Add the path to which you want to save the file
output.to_csv(output_file_path, index=False)

print(f"Average of all 3 - MSE: {mse}, R2 Score: {r2}")

# We print confirmation that the predictions have been saved
print(f"Predictions saved to {output_file_path}")
