# <Font color = Blue> Appliances energy prediction Data Set

## <font color= purple> About Dataset 

<font color= purple>Data Set Information:

The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. 

The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters).</font>
    
Data set Link : https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv

## <font color= Darkgreen>Attribute Information:

</br>date time year-month-day hour:minute:second <br>
Appliances, energy use in Wh<br>
lights, energy use of light fixtures in the house in Wh<br>
T1, Temperature in kitchen area, in Celsius<br>
RH_1, Humidity in kitchen area, in %<br>
T2, Temperature in living room area, in Celsius<br>
RH_2, Humidity in living room area, in %<br>
T3, Temperature in laundry room area<br>
RH_3, Humidity in laundry room area, in %<br>
T4, Temperature in office room, in Celsius<br>
RH_4, Humidity in office room, in %<br>
T5, Temperature in bathroom, in Celsius<br>
RH_5, Humidity in bathroom, in %<br>
T6, Temperature outside the building (north side), in Celsius<br>
RH_6, Humidity outside the building (north side), in %<br>
T7, Temperature in ironing room , in Celsius<br>
RH_7, Humidity in ironing room, in %<br>
T8, Temperature in teenager room 2, in Celsius<br>
RH_8, Humidity in teenager room 2, in %<br>
T9, Temperature in parents room, in Celsius<br>
RH_9, Humidity in parents room, in %<br>
To, Temperature outside (from Chievres weather station), in Celsius<br>
Pressure (from Chievres weather station), in mm Hg<br>
RH_out, Humidity outside (from Chievres weather station), in %<br>
Wind speed (from Chievres weather station), in m/s<br>
Visibility (from Chievres weather station), in km<br>
Tdewpoint (from Chievres weather station), Â°C<br>
rv1, Random variable 1, nondimensional<br>
rv2, Random variable 2, nondimensional</font>

# <FONT COLOR = BLUE>READING THE DATA

In [None]:
import pandas as pd


In [None]:


df =pd.read_csv(r'D:\Hamoye Graded Quiz - Stage B\quiz Data\Appliances energy prediction Data Set_original\energydata_complete.csv',parse_dates=['date'])

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.columns = [x.lower() for x in df.columns]

In [None]:
df = df.set_index('date')

In [None]:
df.head()

In [None]:
df.info()

## <FONT COLOR = BLUE> DATA STRUCTURE 

In [None]:
# check missing values
df.isnull().sum()

In [None]:
df.describe()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
df.hist(bins=50,figsize=(20,15))
plt.savefig("attribute_histograms_plots")
plt.show()

In [None]:
df.corr()

In [None]:
import seaborn as sns


In [None]:
sns.pairplot(df)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(30,30))
sns.heatmap(df.corr(), annot = True, cmap= 'coolwarm')


In [None]:
sorted_appliances = df.sort_values('appliances',ascending=False)
sorted_appliances.head()

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
sorted_appliances=df.sort_values('appliances',ascending=False)
print("The number f the 0,1%top values of appliances' load is", 
      len(sorted_appliances.head(len(sorted_appliances)//1000)),
          "and they have power load higher than",sorted_appliances.appliances[19],"wh.")

# boxplot appliances
sns.set(style="whitegrid")
ax = sns.boxplot(sorted_appliances.appliances)

In [None]:
# Removing Outliers setting appliances more than 790 as outliers and dropping it 
df = df.dropna()
df= df.drop(df[(df.appliances>790)|(df.appliances<0)].index)

In [None]:
sorted_appliances = df.sort_values('appliances',ascending=False)
sorted_appliances.head()

In [None]:
df['hour']=df.index.hour
df['week']=df.index.week
df['weekday']= df.index.weekday
df['month']=df.index.month

In [None]:
import numpy as np
df['log_appliances']=np.log(df.appliances)

In [None]:
df['house_temp']= (df.t1+df.t2+df.t3+df.t4+df.t5+df.t7+df.t8+df.t9)/8
df['house_hum']= (df.rh_1+df.rh_2+df.rh_3+df.rh_4+df.rh_5+df.rh_7+df.rh_8+df.rh_9)/8

In [None]:
df['house_temp'].head()

In [None]:
df.head()

In [None]:
#remove Additive assumptions
df['hour*lights']=df.hour*df.lights
df['t1rh1'] = df.t1 *df.rh_1
df['t2rh2'] = df.t2 *df.rh_2
df['t3rh3'] = df.t3 *df.rh_3
df['t4rh4'] = df.t4 *df.rh_4
df['t5rh5'] = df.t5 *df.rh_5
df['t6rh6'] = df.t6 *df.rh_6
df['t7rh7'] = df.t7 *df.rh_7
df['t8rh8'] = df.t8 *df.rh_8
df['t9rh9'] = df.t9 *df.rh_9

In [None]:
def code_mean(data,cat_feature,real_feature):
    return dict(data.groupby(cat_feature)[real_feature].mean())


In [None]:
df['weekday_avg']=list(map(code_mean(df[:],'weekday',"appliances").get,df.weekday))
df['hour_avg']=list(map(code_mean(df[:],'hour',"appliances").get,df.hour))

In [None]:
df['weekday_avg'].head()

In [None]:
df['hour_avg'].head()

In [None]:
df.head()

In [None]:
df_hour=df.resample('1H').mean()
df_30min =df.resample('30min').mean()

In [None]:
df_hour.head()

In [None]:
df_30min.head()

In [None]:
#setting the assumptions as to lower or higher
#Tryouts
df_hour['low_consum']=(df_hour.appliances+25<(df_hour.hour_avg))*1
df_hour['High_consum']=(df_hour.appliances+25<(df_hour.hour_avg))*1

df_30min['low_consum']=(df_30min.appliances+25<(df_30min.hour_avg))*1
df_30min['High_consum']=(df_30min.appliances+35<(df_30min.hour_avg))*1

In [None]:
def daily(x,df=df):
    return df.groupby('weekday')[x].mean()
def hourly(x,df=df):
    return df.groupby('hour')[x].mean()
def monthly_daily(x,df=df):
    by_day =df.pivot_table(index='weekday',columns=['month'],values=x,aggfunc='mean')
    return round(by_day,ndigits=2)

# <font color = REd> Daily consumption

In [None]:
#plotting the hourly consumption
hourly('appliances').plot(figsize=(10,9))
plt.xlabel('hour')
plt.ylabel('appliances consumption in wh')
ticks = list(range(0,24,1))
plt.title('Mean Energy consumption per hour of a day')
plt.xticks(ticks);

# <font color = Red>weekly consumption

In [None]:
#weekly consumption

daily('appliances').plot(kind='bar',color=['pink','red','green','blue','cyan','yellow','orange'],figsize=(10,7))
ticks = list(range(0,7,1))
labels="Mon Tues Weds Thurs Fri Sat sun".split()
plt.xlabel('Day')
plt.ylabel('appliances consumption in wh')
plt.title('Mean Energy consumption per day of week')
plt.xticks(ticks,labels);

# <font color = Red> Monthly Energy consumption

In [None]:
#monthly consumption
sns.set(rc={'figure.figsize':(10,8)})
ax = sns.heatmap(monthly_daily('appliances').T,cmap="PiYG",
                 xticklabels="Mon Tues Weds Thurs Fri Sat sun".split(),
                 yticklabels="Jan Feb Mar Apr May ".split(),
                 annot=True,fmt='g',
                 cbar_kws={'label':'consumption in wh'}).set_title("Mean applicances consumption(wh)per weekday/month").set_fontsize('20')
plt.show()

In [None]:
f,axes = plt.subplots(1,2,figsize=(10,5))

sns.distplot(df_hour.appliances,hist=True,color='red',hist_kws={'edgecolor':'black'},ax=axes[0])
axes[0].set_title("Appliance's Consumption")
axes[0].set_title("Applicances wH")

sns.distplot(df_hour.log_appliances,hist=True,color='green',hist_kws={'edgecolor':"black"},ax=axes[1])
axes[1].set_title("Log Appliance's consumption")
axes[1].set_xlabel('Appliances Log(wH)')

In [None]:
col = ['log_appliances','lights','t1','rh_1','t2','rh_2','t3',
       'rh_3','t4','rh_4','t5','rh_5','t6','rh_6','t7','rh_7',
       't8','rh_8','t9','rh_9','t_out','press_mm_hg','rh_out',
       'windspeed','visibility','tdewpoint','hour']
corr =df[col].corr()
plt.figure(figsize=(15,15))
sns.set(font_scale=1)
sns.heatmap(corr,annot=True,cmap='RdYlGn',fmt='.2f',xticklabels=col,yticklabels=col)
plt.show()
       

In [None]:
col = ['t6','t2','rh_2','lights','hour','t_out','windspeed','tdewpoint']
sns.set(style='ticks',color_codes=True)
sns.pairplot(df[col])
plt.show()

# <font color =Blue> Training the Model

In [None]:
for cat_feature in ['weekday','hour']:
    df_hour =pd.concat([df_hour,pd.get_dummies(df_hour[cat_feature])],axis=1)
    df_30min =pd.concat([df_30min,pd.get_dummies(df_30min[cat_feature])],axis=1)
    df=pd.concat([df,pd.get_dummies(df[cat_feature])],axis=1)


In [None]:
lin_model = ['low_consum','High_consum','hour','t6','rh_6','lights','hour*lights','windspeed','t6rh6']

In [None]:
df_hour.lights =df_hour.lights.astype(float)
df_hour.log_appliances =df_hour.log_appliances.astype(float)
df_hour.hour =df_hour.hour.astype(float)
df_hour.low_consum =df_hour.low_consum.astype(float)
df_hour.High_consum =df_hour.High_consum.astype(float)
df_hour.t6rh6 =df_hour.t6rh6.astype(float)

In [None]:
test_size=0.2
test_index = int(len(df_hour.dropna())*(1-test_size))
X1_train,X1_test= df_hour[lin_model].iloc[:test_index,],df_hour[lin_model].iloc[test_index:,]
y1_train,y1_test = df_hour.log_appliances.iloc[:test_index,],df_hour.log_appliances.iloc[test_index:,]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X1_train)
X1_train = scaler.transform(X1_train)
X1_test = scaler.transform(X1_test)

In [None]:
from sklearn import linear_model

lin_model = linear_model.LinearRegression()
lin_model.fit(X1_train,y1_train)

# <font color = Blue> Model evaluation and selection

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics


In [None]:
def evaluate(model,test_features,test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions-test_labels)
    mape = 100* np.mean(errors/test_labels)
    r_score = 100*r2_score(test_labels, predictions)
    accuracy = 100-mape
    print(model,'\n')
    print('MAPE               :{:0.2f}%'.format(mape))
    print('Average Error      :{:0.4f}%'.format(np.mean(errors)))
    print('variance score R^2 :{:0.2f}%'.format(r_score))
    print('Accuracy           :{:0.2f}%'.format(accuracy))

In [None]:
evaluate(lin_model,X1_test,y1_test)

In [None]:
cv = TimeSeriesSplit(n_splits = 10)

print('Linear Model')
scores = cross_val_score(lin_model,X1_train,y1_train,cv=cv,scoring ='neg_mean_absolute_error')
print("Accuracy : %0.2f(+/- %0.2f)degrees" % (100+scores.mean(),scores.std()*2))
scores = cross_val_score(lin_model,X1_train,y1_train,cv=cv,scoring ='r2')
print("R^2 : %0.2f(+/- %0.2f)degrees" % (scores.mean(),scores.std()*2))