# Regression in Machine Learning
### Chrispine Tot
#### Student ID: 1637159abc01f000

### Question 1:
The percent of the total variation of the dependent variable Y explained by the set of independent variables X is measured by:

**Answer:** Coefficient of Determination

### Question 2:
How do you define a Residual?

**Answer:** y-y'

### Question 3:
The straight line graph of the equation Y = a + BX, the slope is horizontal if:

**Answer:** b = 0

### Question 4:
Which of the one is true about Heteroskedasticity?

**Answer:** Linear Regression with varying error terms

### Question 5:
Generally, which of the following method(s) is used for predicting continuous dependent variables?

1. Linear Regression

2. Logistic Regression

**Answer:** 1 only

### Question 6:
From the following options below, which of these is/are true about “Ridge” or “Lasso” regression methods in case of feature selection?

**Answer:** Lasso regression uses subset selection of features

### Question 7:
Which of the following sentences is/are true about outliers in Linear Regression:

**Answer:** Linear regression is sensitive to outliers

### Question 8:
Which of the following metrics can be used for evaluating regression models?
1. R Squared

2. Adjusted R Squared

3. F Statistics

4. RMSE / MSE / MAE

**Answer:** 1, 2, 3 and 4

### Question 9:
A best fit line relating X and Y has a R-Squared value of 0.75. How do I interpret this information?

**Answer:** 75% of the variance in Y is explained by X

### Question 10:
Which of the following measures is optimal for comparing the goodness of the fit of competing regression models involving the same dependent variable?

**Answer:** Standard deviation of the residuals

### Question 11:
The Lasso can be interpreted as least-squares linear regression where:

**Answer:** Weights are regularized with the L1 norm

### Question 12:
From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.??

**Answer:** 0.64

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (r2_score, mean_absolute_error, mean_squared_error)
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import (Ridge, Lasso)

In [2]:
data = pd.read_csv("energy.csv", parse_dates=["date"])
data.head().T

Unnamed: 0,0,1,2,3,4
date,2016-01-11 17:00:00,2016-01-11 17:10:00,2016-01-11 17:20:00,2016-01-11 17:30:00,2016-01-11 17:40:00
Appliances,60,60,50,50,60
lights,30,30,30,40,40
T1,19.89,19.89,19.89,19.89,19.89
RH_1,47.596667,46.693333,46.3,46.066667,46.333333
T2,19.2,19.2,19.2,19.2,19.2
RH_2,44.79,44.7225,44.626667,44.59,44.53
T3,19.79,19.79,19.79,19.79,19.79
RH_3,44.73,44.79,44.933333,45.0,45.0
T4,19.0,19.0,18.926667,18.89,18.89


- Checking for missing values in our dataframe

In [3]:
data.isna().any()

date           False
Appliances     False
lights         False
T1             False
RH_1           False
T2             False
RH_2           False
T3             False
RH_3           False
T4             False
RH_4           False
T5             False
RH_5           False
T6             False
RH_6           False
T7             False
RH_7           False
T8             False
RH_8           False
T9             False
RH_9           False
T_out          False
Press_mm_hg    False
RH_out         False
Windspeed      False
Visibility     False
Tdewpoint      False
rv1            False
rv2            False
dtype: bool

- Splitting the data into training and testing data using a ***70-30*** train-test split

In [4]:
x= data.drop("T6", axis=1)
y= data["T6"]
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, random_state=42)

In [5]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((13814, 28), (13814,), (5921, 28), (5921,))

- Instantiating our linear regression model
- Using the linear regression model to predict the target variable ***T6*** using the independent variable ***T2***

In [6]:
model = LinearRegression()
model.fit(x_train[["T2"]], y_train)
pred = model.predict(x_test[["T2"]])
val = round(r2_score(y_test, pred), 2)
print(f"The r2_score is {val}")

The r2_score is 0.64


### Multiple Linear Regression
###### Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:***

- Creating a deep copy of our dataframe

In [7]:
d_copy = data.copy()

In [8]:
# Dropping the columns "date" and "lights"
d_copy = d_copy.drop(["date","lights"], axis=1)

# Instantiating min-max scaler
scaler = MinMaxScaler()

# Creating a pandas dataframe from the normalized data
# Original names of the columns restored the kwarg "columns= d_copy.columns" 
data_norm = pd.DataFrame(scaler.fit_transform(d_copy), columns= d_copy.columns)

features= data_norm.drop("Appliances", axis=1) # independent variables
target= data_norm["Appliances"] # target variable
x_train, x_test, y_train, y_test = train_test_split(features,target, test_size=0.3,random_state=42)

### Question 13:
What is the Mean Absolute Error (in two decimal places)?

**Answer:** 0.05

In [9]:
model_2 = LinearRegression()
model_2.fit(x_train, y_train)
preds= model_2.predict(x_test)
mae = round(mean_absolute_error(y_test, preds),2)
print(f"The Mean Absolute Error is {mae}")

The Mean Absolute Error is 0.05


### Question 14:
What is the Residual Sum of Squares (in two decimal places)?

**Answer:** 45.35

In [10]:
rss = round(np.sum(np.square(preds - y_test)), 2)
print(f"The Residual Sum of Squares is {rss}")

The Residual Sum of Squares is 45.35


### Question 15:
What is the Root Mean Squared Error (in three decimal places)?

**Answer:** 0.088

In [11]:
rmse = round(mean_squared_error(y_test, preds, squared=False), 3)
print(f"The Root Mean Squared Error is {rmse}")

The Root Mean Squared Error is 0.088


### Question 16:
What is the Coefficient of Determination (in two decimal places)?

**Answer:** 0.15

In [12]:
cod = round(r2_score(y_test, preds), 2)
print(f"The Coefficient of Determination is {cod}")

The Coefficient of Determination is 0.15


### Question 17:
Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

**Lowest Value:** RH_2

**Highest Value:** RH_1

In [13]:
def weight (model, feat, col_name):
    weights = pd.Series(model.coef_, feat.columns).sort_values()
    df = pd.DataFrame(weights).reset_index()
    df.columns = ["Feature", col_name]
    df[col_name].round(2)
    return df

model_weights = weight(model_2, x_train, "Weight")
model_weights

Unnamed: 0,Feature,Weight
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


In [14]:
model_weights.iloc[[0,25]]

Unnamed: 0,Feature,Weight
0,RH_2,-0.456698
25,RH_1,0.553547


### Question 18:
Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

**Answer:** No

In [15]:
ridge_reg = Ridge(alpha=0.4)
ridge_reg.fit(x_train, y_train)
ridge_pred = ridge_reg.predict(x_test)
rdg_rmse = round(mean_squared_error(y_test, ridge_pred, squared=False), 3)
print(f"The Ridge Regression Model has a RMSE of {rdg_rmse}\nThere is no change in the RMSE value")

The Ridge Regression Model has a RMSE of 0.088
There is no change in the RMSE value


### Question 19:
Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

**Answer:** 4

In [16]:
lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(x_train, y_train)

lasso_weights = weight(lasso_reg, x_train, "Lasso_Weight")
(lasso_weights[lasso_weights["Lasso_Weight"] != 0]).count()

Feature         4
Lasso_Weight    4
dtype: int64

### Question 20:
What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

**Answer:** 0.094

In [17]:
lasso_pred = lasso_reg.predict(x_test)
lass_rmse = round(mean_squared_error(y_test, lasso_pred, squared=False), 3)
print(f"The Lasso Regression RMSE is {lass_rmse}")

The Lasso Regression RMSE is 0.094
