In [1]:
#Import libraries
import pandas as pd
import numpy as np

In [2]:
# Load dataset
df =pd.read_csv("energydata_complete.csv")
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


## Attribute Information:
- Date, time year-month-day hour:minute:second
- Appliances, energy use in Wh
- lights, energy use of light fixtures in the house in Wh
- T1, Temperature in kitchen area, in Celsius
- RH_1, Humidity in kitchen area, in %
- T2, Temperature in living room area, in Celsius
- RH_2, Humidity in living room area, in %
- T3, Temperature in laundry room area
- RH_3, Humidity in laundry room area, in %
- T4, Temperature in office room, in Celsius
- RH_4, Humidity in office room, in %
- T5, Temperature in bathroom, in Celsius
- RH_5, Humidity in bathroom, in %
- T6, Temperature outside the building (north side), in Celsius
- RH_6, Humidity outside the building (north side), in %
- T7, Temperature in ironing room , in Celsius
- RH_7, Humidity in ironing room, in %
- T8, Temperature in teenager room 2, in Celsius
- RH_8, Humidity in teenager room 2, in %
- T9, Temperature in parents room, in Celsius
- RH_9, Humidity in parents room, in %
- To, Temperature outside (from Chievres weather station), in Celsius
- Pressure (from Chievres weather station), in mm Hg
- RH_out, Humidity outside (from Chievres weather station), in %
- Wind speed (from Chievres weather station), in m/s
- Visibility (from Chievres weather station), in km
- Tdewpoint (from Chievres weather station), Â°C
- rv1, Random variable 1, nondimensional
- rv2, Random variable 2, nondimensional


### From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.?

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Extract the required columns
x = df['T2'].values.reshape(-1, 1)
y = df['T6']

# Fit the linear regression model
model = LinearRegression()
model.fit(x, y)

# Make predictions
y_pred = model.predict(x)

# Calculate R-squared value
r2 = r2_score(y, y_pred)

# Print the R-squared value
print(f"R^2 value: {r2:.2f}")


R^2 value: 0.64


### Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:

### What is the Mean Absolute Error (in two decimal places)?

In [4]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

scaler = MinMaxScaler()
df1 = df.drop(columns=['date', 'lights'])
normalized_df = pd.DataFrame(scaler.fit_transform(df1), columns=df1.columns)
features_df = normalized_df.drop(columns=['Appliances'])
target = normalized_df['Appliances']

x_train, x_test, y_train, y_test = train_test_split(features_df, target, test_size=0.3, random_state=42)

linear_model = LinearRegression()
linear_model.fit(x_train, y_train)
predicted_values_Linear = linear_model.predict(x_test)

mae = mean_absolute_error(y_test, predicted_values_Linear)
rounded_mae = round(mae, 3)
print(f"Mean Absolute Error: {rounded_mae}")

Mean Absolute Error: 0.05


### What is the Residual Sum of Squares (in two decimal places)?

In [5]:
from sklearn.metrics import mean_squared_error

rss = mean_squared_error(y_test, predicted_values_Linear) * len(y_test)
rounded_rss = round(rss, 2)
print(f"Residual Sum of Squares: {rounded_rss}")

Residual Sum of Squares: 45.35


### What is the Root Mean Squared Error (in three decimal places)?

In [14]:
mse = mean_squared_error(y_test, predicted_values_Linear)
rmse = (mse ** 0.5)
rounded_rmse = round(rmse, 3)
print(f"Root Mean Squared Error: {rounded_rmse}")

Root Mean Squared Error: 0.088


### What is the Coefficient of Determination (in two decimal places)?

In [7]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, predicted_values_Linear)
rounded_r2 = round(r2, 2)
print(f"Coefficient of Determination (R^2): {rounded_r2}")


Coefficient of Determination (R^2): 0.15


### Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

In [8]:
feature_weights = linear_model.coef_
sorted_indices = np.argsort(feature_weights)
sorted_weights = feature_weights[sorted_indices]

weights_df = pd.DataFrame({'Features': features_df.columns[sorted_indices], 'Linear Model Weights': sorted_weights})

weights_df

Unnamed: 0,Features,Linear Model Weights
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


***Observation***
- Lowest and Highest weights resprectively are RH_2, RH_1

### Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [15]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=0.4)
ridge_model.fit(x_train, y_train)
predicted_values_Ridge = ridge_model.predict(x_test)

mse = mean_squared_error(y_test, predicted_values_Ridge)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse:.3f}")


Root Mean Squared Error (RMSE): 0.088


***Observation***
- There is NO difference in the root mean squared error (RMSE)

### Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [10]:
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.001)
lasso_model.fit(x_train, y_train)

feature_weights = lasso_model.coef_
sorted_indices = np.argsort(feature_weights)
sorted_weights = feature_weights[sorted_indices]

weights_df_lasso = pd.DataFrame({'Features': features_df.columns[sorted_indices], 'Lasso_Model_Weight': sorted_weights})

weights_df_lasso


Unnamed: 0,Features,Lasso_Model_Weight
0,RH_out,-0.049557
1,RH_8,-0.00011
2,T1,0.0
3,Tdewpoint,0.0
4,Visibility,0.0
5,Press_mm_hg,-0.0
6,T_out,0.0
7,RH_9,-0.0
8,T9,-0.0
9,T8,0.0


***Observation***
- Four features have non-zero feature weights, they are RH_out, RH_8, Windspeed, RH_1 respectively

### What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

In [11]:
lasso_model = Lasso(alpha=0.001)
lasso_model.fit(x_train, y_train)
predicted_values_Lasso = lasso_model.predict(x_test)

mse = mean_squared_error(y_test, predicted_values_Lasso)
rmse = np.sqrt(mse)
rounded_rmse = round(rmse, 3)
print(f"Root Mean Squared Error (RMSE): {rounded_rmse}")


Root Mean Squared Error (RMSE): 0.094
