![Example image](https://upload.wikimedia.org/wikipedia/commons/0/02/Northeastern_Wordmark.svg)

# Linear Regression

Copyright: Prof. Shanu Sushmita

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [2]:
# Load California housing dataset
california_housing = fetch_california_housing()

In [11]:
# We can have a first look at the available description of the dataset

print(california_housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [3]:
# Convert to pandas DataFrame for easier manipulation
data = pd.DataFrame(data=california_housing.data, columns=california_housing.feature_names)
target = pd.DataFrame(data=california_housing.target, columns=['target'])

In [4]:
# Concatenate features and target into a single DataFrame
df = pd.concat([data, target], axis=1)

In [5]:
# Split data into features and target
X = df.drop('target', axis=1)
y = df['target']

In [6]:
# Initialize Linear Regression model
model = LinearRegression()

In [7]:
# Initialize k-fold cross-validation
kf = KFold(n_splits=3, shuffle=True, random_state=42)

In [8]:
# Initialize lists to store metrics
rmse_scores = []
mae_scores = []
r2_scores = []

In [9]:
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate evaluation metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Append scores to lists
    rmse_scores.append(rmse)
    mae_scores.append(mae)
    r2_scores.append(r2)

# Calculate average scores
avg_rmse = np.mean(rmse_scores)
avg_mae = np.mean(mae_scores)
avg_r2 = np.mean(r2_scores)

In [10]:
print("Average RMSE:", avg_rmse)
print("Average MAE:", avg_mae)
print("Average R^2:", avg_r2)

Average RMSE: 0.7265250485193588
Average MAE: 0.531505251553053
Average R^2: 0.6035381394351562


### Assignment 1

- For the same dataset,
- Train a linear regression model using stats model
    - import statsmodels.api as sm
    - https://www.statsmodels.org/stable/regression.html
    
- Find the best model (set of features based on p-values) that would give the highest 
    - RMSE 
    - MAE
    - R^2

In [1]:
## Your code here