#Q-4
#Regression

* Performed regression over the dataset of global
active power values.

* Implemented **Multilayer Perceptron(MLP)** as well as a **linear regression** model for
this question. 

* Compared and contrasted the performance of both the models on metrics like **Root Mean Squared Error(RMSE), Mean Absolute Percentage Error(MAPE) score**.

* Considered only the **Global active power field**.

* Experimented with different architectures(number of hidden layers, activation functions etc) and see the impact on performance.

* Also experimented on taking some more window of past power values and reported the
performance (For example taking a window of two hours instead of one).



## Import libraries

In [0]:
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

## Load Data

In [0]:
path="/content/drive/My Drive/IIIT-H/Statistical_Methods_in_AI/Assignment-III/"
#path=sys.argv[1]
file_name=path+'household_power_consumption.txt'
dataset = pd.read_csv(file_name, sep=';', header=0, low_memory=False, infer_datetime_format=True,na_values=['nan','?'], parse_dates={'datetime':[0,1]}, index_col=['datetime'])
# summarize
print(dataset.shape)
print(dataset.head())

(2075259, 7)
                     Global_active_power  ...  Sub_metering_3
datetime                                  ...                
2006-12-16 17:24:00                4.216  ...            17.0
2006-12-16 17:25:00                5.360  ...            16.0
2006-12-16 17:26:00                5.374  ...            17.0
2006-12-16 17:27:00                5.388  ...            17.0
2006-12-16 17:28:00                3.666  ...            17.0

[5 rows x 7 columns]


In [0]:
df=dataset.copy()

## Data Analysis

In [0]:
df.describe()

Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
count,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0
mean,1.091615,0.1237145,240.8399,4.627759,1.121923,1.29852,6.458447
std,1.057294,0.112722,3.239987,4.444396,6.153031,5.822026,8.437154
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0
25%,0.308,0.048,238.99,1.4,0.0,0.0,0.0
50%,0.602,0.1,241.01,2.6,0.0,0.0,1.0
75%,1.528,0.194,242.89,6.4,0.0,1.0,17.0
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0


In [0]:
df.dtypes

Global_active_power      float64
Global_reactive_power    float64
Voltage                  float64
Global_intensity         float64
Sub_metering_1           float64
Sub_metering_2           float64
Sub_metering_3           float64
dtype: object

In [0]:
#data = df.resample('H').sum()
data=df.copy()
data2=df.copy()

In [0]:

data=data.drop(['Global_reactive_power', 'Voltage','Global_intensity','Sub_metering_2','Sub_metering_3','Sub_metering_1'],axis=1)
test_data=data.fillna(value=0)

In [0]:
def fill_missing(data):
    one_day = 24*60*30
    for row in range(data.shape[0]):
        for col in range(data.shape[1]):
            if np.isnan(data[row, col]):
                data[row, col] = data[row-one_day, col]
fill_missing(data2.values)
data_train=data2.drop(['Global_reactive_power', 'Voltage','Global_intensity','Sub_metering_2','Sub_metering_3','Sub_metering_1'],axis=1)

In [0]:
time_frame=60
window_size=1
data_train=np.array(data_train)
ws=time_frame*window_size
X_test=[]
y_true=[]
missing_index=[]
test_data=np.array(test_data)
for i in range(ws,data.shape[0]):
  if(test_data[i]==0):
    test_=data_train[i-ws:i]
    X_test.append(test_)
    y_true.append(data_train[i])
    missing_index.append(i)

setting window size=2 hours
now input will be

(t0,t1,t2......t120)

and output

t121


In [0]:

X_train, y_train = [], []
for i in range(0, data_train.shape[0]-ws):
    train_=data_train[i:i+ws]
    X_train.append(train_)
    y_train.append(data_train[i+ws])


In [0]:
X_train, y_train = np.asarray(X_train), np.asarray(y_train)
X_test= np.asarray(X_test)

### Model Implementation

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [0]:
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1])

In [0]:
reg = LinearRegression()
#reg = Ridge(alpha=10)
reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
X_test.shape

(25979, 60, 1)

In [0]:
X_test = X_test.reshape(X_test.shape[0],X_test.shape[1])

In [0]:
X_test.min()

0.102

In [0]:
y_pred=reg.predict(X_test)

In [0]:
y_pred.min()

0.0891552446707519

In [0]:
print('Mean squared error: %f'
      % mean_squared_error(y_true,y_pred))
# The coefficient of determination: 1 is perfect prediction

print('R2 Score: %f'
      % r2_score(y_true, y_pred))


print('Mean Absolute Percentage Error: %f'
      % mean_absolute_percentage_error(y_true, y_pred))

Mean squared error: 0.054007
R2 Score: 0.941438
Mean Absolute Percentage Error: 10.915352


Mean squared error: 0.054007

R2 Score: 0.941438

Mean Absolute Percentage Error: 10.915352

### Multi Layer Perceptron

In [0]:
X_train = X_train.reshape(X_train.shape[0], 1,ws)
X_test = X_test.reshape(X_test.shape[0],1, ws)

In [0]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Dropout

In [0]:
reg = Sequential()
reg.add(Dense(ws,activation = 'relu',input_shape=(1,ws)))                         
reg.add(Dense(ws,activation = 'relu'))
reg.add(Dropout(0.2))
reg.add(Dense(1))

created a MLP with (windowsize) number of nodes in first layer
(windowsize) number of nodes in second layer
(windowsize) number of nodes in third layer
a 20% dropout layer 
and final output layer with 1 node
all nodes are fully connected

In [0]:
reg.compile(loss='mean_squared_error', optimizer='adam')
reg.fit(X_train, y_train,shuffle=False,batch_size=200, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fda83d59b38>

In [0]:
 y_pred = reg.predict(X_test)


In [0]:
y_pred=y_pred.reshape(y_pred.shape[0], 1)

In [0]:
print('Mean squared error: %f'
      % mean_squared_error(y_true,y_pred))
# The coefficient of determination: 1 is perfect prediction

print('R2 Score: %f'
      % r2_score(y_true, y_pred))


print('Mean Absolute Percentage Error: %f'
      % mean_absolute_percentage_error(y_true, y_pred))

Mean squared error: 0.051717
R2 Score: 0.943920
Mean Absolute Percentage Error: 9.937234


In [0]:
import sys
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
def fill_missing(data):
    one_day = 24*60
    for row in range(data.shape[0]):
        for col in range(data.shape[1]):
            if np.isnan(data[row, col]):
                data[row, col] = data[row-one_day, col]
def window(data,ws):
    X_train, y_train = [], []
    for i in range(ws, len(data_train)-ws-1):
        X_train.append(data_train[i:i+ws])
        y_train.append(data_train[i+ws])
    X_test, y_test = [], []
    for i in range(ws, len(data_test)-ws-1):
        X_test.append(data_test[i:i+ws])
        y_test.append(data_test[i+ws])
    return X_train,y_train,X_test,y_test

path=sys.argv[1]
file_name=path+'household_power_consumption.txt'
dataset = pd.read_csv(file_name, sep=';', header=0, low_memory=False, infer_datetime_format=True,na_values=['nan','?'], parse_dates={'datetime':[0,1]}, index_col=['datetime'])
df=dataset.copy()
data=df.copy()
data2=df.copy()
data=data.drop(['Global_reactive_power', 'Voltage','Global_intensity','Sub_metering_2','Sub_metering_3','Sub_metering_1'],axis=1)
test_data=data.fillna(value=0)
fill_missing(data2.values)
data_train=data2.drop(['Global_reactive_power', 'Voltage','Global_intensity','Sub_metering_2','Sub_metering_3','Sub_metering_1'],axis=1)
time_frame=60
window_size=1
data_train=np.array(data_train)
ws=time_frame*window_size
X_test=[]
y_true=[]
missing_index=[]
test_data=np.array(test_data)
for i in range(ws,data.shape[0]):
  if(test_data[i]==0):
    test_=data_train[i-ws:i]
    X_test.append(test_)
    y_true.append(data_train[i])
    missing_index.append(i)
X_train, y_train = [], []
for i in range(0, data_train.shape[0]-ws):
    train_=data_train[i:i+ws]
    X_train.append(train_)
    y_train.append(data_train[i+ws])
X_train, y_train = np.asarray(X_train), np.asarray(y_train)
X_test= np.asarray(X_test)
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1])
reg = LinearRegression()
reg.fit(X_train, y_train)
X_test = X_test.reshape(X_test.shape[0],X_test.shape[1])
y_pred=reg.predict(X_test)
print(y_pred)

## After experimenting with various architectures of mlp by adding extra layers,decreasing batch size,increasing epoch and experimented with linear regression with and without regularization,it is found linear regression with window size 2 giving least mean square error