<a href="https://colab.research.google.com/github/Elixirman/my_HYE_Works/blob/main/Regression_Model_Snipps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Information

The dataset for the remainder of this quiz (from question 18) is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.



## Attribute Information

Date, time year-month-day hour:minute:second

Appliances, energy use in Wh

lights, energy use of light fixtures in the house in Wh

T1, Temperature in kitchen area, in Celsius

RH_1, Humidity in kitchen area, in %

T2, Temperature in living room area, in Celsius

RH_2, Humidity in living room area, in %

T3, Temperature in laundry room area

RH_3, Humidity in laundry room area, in %

T4, Temperature in office room, in Celsius

RH_4, Humidity in office room, in %

T5, Temperature in bathroom, in Celsius

RH_5, Humidity in bathroom, in %

T6, Temperature outside the building (north side), in Celsius

RH_6, Humidity outside the building (north side), in %

T7, Temperature in ironing room , in Celsius

RH_7, Humidity in ironing room, in %

T8, Temperature in teenager room 2, in Celsius

RH_8, Humidity in teenager room 2, in %

T9, Temperature in parents room, in Celsius

RH_9, Humidity in parents room, in %

To, Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station), in mm Hg

RH_out, Humidity outside (from Chievres weather station), in %

Wind speed (from Chievres weather station), in m/s

Visibility (from Chievres weather station), in km

Tdewpoint (from Chievres weather station), Â°C

rv1, Random variable 1, nondimensional

rv2, Random variable 2, nondimensional



# Setup

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler


Using matplotlib backend: agg


# Data Importation

In [2]:
# Step 1: Load Data from CSV
data = pd.read_csv("/content/energydata_complete.csv")

# Data Preprocessing

In [3]:
# Step 2: Data Preprocessing (if needed)
# Example: Handling missing values, encoding categorical variables, etc.

In [4]:
#Preview Data Head
data.head(2)

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.48,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195


## (SIS) - Shape, Info & Stats Analysis

In [5]:
# Preview Data Shape
data.shape

(19735, 29)

In [6]:
# General Information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

In [7]:
# Descriptive Statistics
data.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,...,19.485828,41.552401,7.41258,755.522602,79.750418,4.039752,38.330834,3.760995,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,...,2.014712,4.151497,5.318464,7.399441,14.901088,2.451221,11.794719,4.195248,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,...,18.0,38.5,3.67,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,...,19.39,40.9,6.92,756.1,83.666667,3.666667,40.0,3.43,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,...,20.6,44.338095,10.4,760.933333,91.666667,5.5,40.0,6.57,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,...,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


## Numerical VS Categorical Variables

In [8]:
# Check data types of each column
print("Data Types:")
print(data.dtypes, end=",")
print()

Data Types:
date            object
Appliances       int64
lights           int64
T1             float64
RH_1           float64
T2             float64
RH_2           float64
T3             float64
RH_3           float64
T4             float64
RH_4           float64
T5             float64
RH_5           float64
T6             float64
RH_6           float64
T7             float64
RH_7           float64
T8             float64
RH_8           float64
T9             float64
RH_9           float64
T_out          float64
Press_mm_hg    float64
RH_out         float64
Windspeed      float64
Visibility     float64
Tdewpoint      float64
rv1            float64
rv2            float64
dtype: object,


In [9]:
# Separate numerical and categorical variables
numerical_vars = data.select_dtypes(include=['int64', 'float64']).columns
categorical_vars = data.select_dtypes(include=['object']).columns

# Display numerical and categorical variables
print("Numerical Variables:")
print(numerical_vars)
print()

print("Categorical Variables:")
print(categorical_vars)

Numerical Variables:
Index(['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4',
       'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9',
       'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility',
       'Tdewpoint', 'rv1', 'rv2'],
      dtype='object')

Categorical Variables:
Index(['date'], dtype='object')


## Missing Values

In [10]:
data.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

## Correlation

In [None]:
data.corr()

# Feature Classification

In [None]:
# Step 3: Split Data into Features Independent-(X) and Target Dependent-(Y)
X = data.drop(columns=["target_column"])
Y = data["target_column"]

# Data Splitting

In [None]:
# Step 4: Split Data into Training and Testing Sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


# Standardization

## Standard Scaler

In [None]:
"""
# Standadize the input feature (Independent)
#By Inintializing
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## MinMax Scaler

In [None]:
"""
# Initialize the MinMaxScaler
mm_scaler = MinMaxScaler()

# Fit and transform the data
X_train = mm_scaler.fit_transform(X_train)
X_test = mm_scaler.transform(X_test)



# Model Initialization

In [None]:
# Step 5: Initialize Regression Model
regression = LinearRegression(n_jobs= -1)

# Model Training

In [None]:
# Step 6: Train the Model on Training Data
regression.fit(X_train, Y_train)


# Slope/Coefficient & Intercept

In [None]:
#finding the coefficient/Slope
regression.coef_

In [None]:
#Calculating intercept
regression.intercept_

# Bset Fit Line - Intercept Visualization

In [None]:
#Scatter Plot
plt.scatter(X_train, Y_train)
plt.plot(X_train, regression.predict(X_train))
#Best fit Line (Intercept)

# Model Prediction

## Prediction Theory

In [None]:
"""
Y = mX + C
Y = Prediction/Output
m = Slope/Coefficient
X = X_test
c = Intercept
Output =

In [None]:
# Step 7: Make Predictions on Test Data
y_pred = regression.predict(X_test)


# Model Evaluation

In [None]:
# Step 8: Evaluate Model Performance on Test Data
mse = mean_squared_error(Y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(Y_test, y_pred)
r2 = r2_score(Y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared (R2):", r2)


# OLS-Ordinary Least Squares Regression

In [None]:
model = sm.OLS(X_train, Y_train).fit()

prediction = model.predict(X_test)

print(prediction)

In [None]:
print(model.summary())

# New Data Prediction

In [None]:
regression.predict(scaler.transform([[?]]))

# Cross Validation

In [None]:
# Step 9: Perform Cross-Validation
cv_scores = cross_val_score(regression, X, Y, cv=5, scoring="neg_mean_squared_error")

In [None]:
# Step 10: Compute Average Cross-Validation Score
avg_cv_score = np.mean(cv_scores)
print("Average Cross-Validation Score:", avg_cv_score)


# XXX

In [None]:
# Step 11: Hyperparameter Tuning (if needed)
# Example: GridSearchCV or RandomizedSearchCV

# Step 12: Model Selection (Choose the best model)

# Repeat steps 6-12 as needed for different models or hyperparameters

# Additional Steps: Save the Trained Model, Make Predictions on New Data, etc.

# Questions 17 - 25

17

In [13]:

X = data[['T2']]  # Independent variable (living room temperature)
y = data['T6']     # Dependent variable (temperature outside)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fitting the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predicting the values of T6 using the model
y_pred = model.predict(X_test)

# Calculating RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Printing the RMSE rounded to three decimal places
print("Root Mean Squared Error (RMSE): {:.3f}".format(rmse))

Root Mean Squared Error (RMSE): 3.633


18

In [14]:


# Remove specified columns
data = data.drop(columns=["date", "lights"])

# Define target variable and features
X = data.drop(columns=["Appliances"])
y = data["Appliances"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Normalize the dataset
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit a multiple linear regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict target variable for training set
y_train_pred = model.predict(X_train_scaled)

# Calculate Mean Absolute Error for training set
mae_train = mean_absolute_error(y_train, y_train_pred)

# Print Mean Absolute Error for training set rounded to three decimal places
print("Mean Absolute Error (training set): {:.3f}".format(mae_train))


Mean Absolute Error (training set): 53.743


19

In [16]:



# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Normalize the dataset
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit a multiple linear regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict target variable for training set
y_train_pred = model.predict(X_train_scaled)

# Calculate Root Mean Squared Error for training set
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))

# Print Root Mean Squared Error for training set rounded to three decimal places
print("Root Mean Squared Error (training set): {:.3f}".format(rmse_train))


Root Mean Squared Error (training set): 95.215


20

In [17]:
# Predict target variable for test set
y_test_pred = model.predict(X_test_scaled)

# Calculate Mean Absolute Error for test set
mae_test = mean_absolute_error(y_test, y_test_pred)

# Print Mean Absolute Error for test set rounded to three decimal places
print("Mean Absolute Error (test set): {:.3f}".format(mae_test))


Mean Absolute Error (test set): 53.642


21

In [21]:

# Predict target variable for test set
y_test_pred = model.predict(X_test_scaled)

# Calculate Root Mean Squared Error for test set
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Print Root Mean Squared Error for test set rounded to three decimal places
print("Root Mean Squared Error (test set): {:.3f}".format(rmse_test))


Root Mean Squared Error (test set): 93.641


22

In [27]:
# Print Root Mean Squared Error for training set with Lasso Regression model
print("Root Mean Squared Error (training set) with Lasso Regression model: {:.3f}".format(rmse_train))

# Print Root Mean Squared Error for test set with Lasso Regression model
print("Root Mean Squared Error (test set) with Lasso Regression model: {:.3f}".format(rmse_test))


Root Mean Squared Error (training set) with Lasso Regression model: 95.215
Root Mean Squared Error (test set) with Lasso Regression model: 93.641


23

In [20]:
from sklearn.linear_model import Ridge

# Train a Ridge regression model with default parameters
ridge_model = Ridge()
ridge_model.fit(X_train_scaled, y_train)

# Predict target variable for test set using the Ridge regression model
y_test_pred_ridge = ridge_model.predict(X_test_scaled)

# Calculate Root Mean Squared Error for test set with Ridge regression model
rmse_test_ridge = np.sqrt(mean_squared_error(y_test, y_test_pred_ridge))

# Print Root Mean Squared Error for test set with Ridge regression model rounded to three decimal places
print("Root Mean Squared Error (test set) with Ridge regression model: {:.3f}".format(rmse_test_ridge))


Root Mean Squared Error (test set) with Ridge regression model: 93.709


In [22]:
print("Root Mean Squared Error (test set) with linear regression model: {:.3f}".format(rmse_test))
print("Root Mean Squared Error (test set) with Ridge regression model: {:.3f}".format(rmse_test_ridge))


Root Mean Squared Error (test set) with linear regression model: 93.641
Root Mean Squared Error (test set) with Ridge regression model: 93.709


24

In [23]:
from sklearn.linear_model import Lasso

# Train a Lasso regression model with default parameters
lasso_model = Lasso()
lasso_model.fit(X_train_scaled, y_train)

# Get the feature weights from the trained Lasso regression model
feature_weights = lasso_model.coef_

# Count the number of features with non-zero feature weights
non_zero_features = sum(feature_weights != 0)

# Print the number of features with non-zero feature weights
print("Number of features with non-zero feature weights:", non_zero_features)


Number of features with non-zero feature weights: 4


25

In [25]:
# Predict target variable for test set using the Lasso Regression model
y_test_pred_lasso = lasso_model.predict(X_test_scaled)

# Calculate Root Mean Squared Error for test set with Lasso Regression model
rmse_test_lasso = np.sqrt(mean_squared_error(y_test, y_test_pred_lasso))

# Print Root Mean Squared Error for test set with Lasso Regression model rounded to three decimal places
print("Root Mean Squared Error (test set) with Lasso Regression model: {:.3f}".format(rmse_test_lasso))


Root Mean Squared Error (test set) with Lasso Regression model: 99.424
