<a href="https://www.kaggle.com/kamaljp/modeling-site-eui-wids?scriptVersionId=86680462" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Purpose of the Notebook:

Purely Data Science 

Data can be used to generate useful insights, which is only the 1st part. The important part is ensuring the insights are statiscally valid. The best way is to test it on the real data. 

We will be checking multiple ideas in this notebook, and understand how the results change based on that ideas. The modeling and testing harness will remain root mean squared error. [You will learn](#FinCon) that the insights and models that are finally presented are done after lots of analysis. 

### What to Expect


[Lasso](#lassoset) and [Decision Tree](#modDT) Grid search, followed by [Neural Network modeling](#Nnet) are explored. The notebook tries the bruteforce method of finding the insight out of the WIDS Data. 

After all this, the root mean square error is still above [**56**](#NNres). 


### Sneek Peek

There will [Lasso grid search result](#lassores) and the [Decision Tree Grid Search result](#DTres) was an interesting exercise, without any improvement at this moment. Following the Neural Network was instantiated, which has [given further worse result](#NNres)

PS: Use the blue links to go the exact location of the code and related activity

In [None]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import os
import numpy as np 
import pandas as pd 
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode
from plotly.subplots import make_subplots
import plotly.graph_objects as go
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 5000)
import warnings
warnings.filterwarnings("ignore")
#os.mkdir('/kaggle/working/individual_charts/')
import matplotlib.pyplot as plt
# Load the data
#Will come in handy to wrap the lengthy texts
import textwrap
#useful libraries and functions
from itertools import repeat
#Libraries that give a different visual possibilities
from pandas import option_context 
from plotly.subplots import make_subplots

def long_sentences_seperate(sentence, width=30):
    try:
        splittext = textwrap.wrap(sentence,width)
        text = '<br>'.join(splittext)#whitespace is removed, and the sentence is joined
        return text
    except:
        return sentence

def load_csv(base_dir,file_name):
    """Loads a CSV file into a Pandas DataFrame"""
    file_path = os.path.join(base_dir,file_name)
    df = pd.read_csv(file_path,low_memory=False)
    return df    

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


## <a id='contents'>  Contents </a>

### [Missing Value assumptions](#misval)

   #### [Result Visualisation](#Aresvis)
    
   #### [Final prediction](#Afinpre)

### [Filling the Median Values in Energy Rating](#medianval)
    
   #### [Result Visualisation](#Bresvis)
    
   #### [Final prediction](#Bfinpre)
   
### [Lasso Grid Search results](#lassores)
### [Decision tree Grid Search results](#DTres)
### [Neural Networks results](#NNres)
### [XGB GridSearching](#xgb)
   #### [XGB Results](#xgbres)
### [Lessons of the Exercise ](#FinCon)


In [None]:
base_dir = '../input/widsdatathon2022/'
file_name = 'train.csv'
dataset_main = load_csv(base_dir,file_name)

In [None]:
#Single Factor Bar Graphs
def uni_factor(factor):
    grp_factor = dataset_main.groupby(factor)['floor_area'].count().reset_index()
    grp_factor.sort_values(by='floor_area',inplace=True,ascending = False)
    grp_factor[factor] = grp_factor[factor].astype('category')
    vis = px.bar(data_frame=grp_factor,y = factor,x ='floor_area',color= factor)
    vis.update_layout(yaxis = {'categoryorder' : 'total ascending'},
                      title = 'Number of Buildings based on ' + factor,
                     height = 800)
    vis.show()

#Single Factor Histogram Graphs    
def uni_hist_plot(independent,dependent):
    vis = px.histogram(data_frame=dataset_main,x=dependent,color=independent)
    vis.update_layout(title='Distribution of '+ dependent + ' based on '+ independent)
    vis.show()

#Single Factor Box Plot Graphs    
def uni_box_plot(independent,dependent):
    vis = px.box(data_frame=dataset_main,x=dependent,color=independent)
    vis.update_layout(title='Box plot of '+ dependent + ' based on '+ independent)
    vis.show()

#Two Factor Scatter Plot Graphs 
def two_factor(factor1, factor2,independent):
    vis = px.scatter(data_frame=dataset_main,y = factor1,x =factor2,color=independent,
                     facet_col=independent,facet_col_wrap=3)
    vis.update_layout(title = 'Relation between ' + factor1 + ' and ' +factor2 + ' in ' + independent + 'condition',
                     height = 1000)
    vis.show()

#Single factor target average Bar Graph    
def avg_on_factor(factor,target):
    grp_factor = dataset_main.groupby(factor)[target].mean().reset_index()
    grp_factor.sort_values(by=target,inplace=True,ascending = False)
    grp_factor[factor] = grp_factor[factor].astype('category')
    vis = px.bar(data_frame=grp_factor,y = factor,x =target,color= factor)
    vis.update_layout(yaxis = {'categoryorder' : 'total ascending'},
                      title = 'Average of '+target +' on basis of ' + factor,
                     height = 800)
    vis.show()
    

def corr_heat_map(df,title):

    #Building the dataset with column that are numerical
    df = df[df.columns[df.dtypes != 'object']]
    df_corr_mat = df.corr() #building the correlation matrix
    #Building the lower triagle of the correlation matrix
    df_corr_mat_lt = df_corr_mat.where(np.tril(np.ones(df_corr_mat.shape)).astype(np.bool))
    vis = px.imshow(df_corr_mat_lt,text_auto=True, aspect="auto",
                    height=1000,color_continuous_scale='spectral',width=900)
    vis.update_layout(title=title)
    vis.show()

In [None]:
#Collecting garbage memory and deleting unwanted Dataframes, that have served their purpose earlier
import gc
gc.collect()

#### [Back to Contents](#contents)

#### <a id='misval'> Missing Value assumptions </a>

There are many ways to handle the missing values, out of which we will handle them in 2 ways in this analysis. 

    a) Will entirely drop the missing value rows
    
    b) Will fill the missing values with "Median" values
    
Will have to check this for both the test and train datasets, and make appropriate decision. Since in test set, we won't be allowed to drop the entire rows. There some form of imputation is necessary.    

[1) Data prep](#Adata)

[2) Model Set](#Amodset)

[3) Result Visualisation](#Aresvis)

[4) Final prediction](#Afinpre)

##### <a id='Adata'> Data prep</a>

In [None]:
#Let us start checking the missing value columns

dataset_withMissing = dataset_main[['direction_max_wind_speed','direction_peak_wind_speed',
                                    'max_wind_speed','days_with_fog','energy_star_rating']]

#Need to find an effective way to deal with all the columns, so lets try describe
dataset_withMissing.describe()

#direction_max_wind_speed,  direction_peak_wind_speed,  max_wind_speed all can be safely filled with 1

#days with fog is best to be filled with median values.

In [None]:
del dataset_withMissing

#Below variables are made 0, since the sensors would have malfunctioned. So forward filling 

dataset_main.direction_max_wind_speed = dataset_main.direction_max_wind_speed.fillna(1)
dataset_main.direction_peak_wind_speed = dataset_main.direction_peak_wind_speed.fillna(1)
dataset_main.max_wind_speed = dataset_main.max_wind_speed.fillna(1)
dataset_main.days_with_fog = dataset_main.days_with_fog.fillna(dataset_main.days_with_fog.median())

#That leaves the Energy star rating. We need to check the test set provided for deciding

In [None]:
#loading test_set
test_main = load_csv(base_dir='../input/widsdatathon2022',file_name='test.csv')

#Need to find an effective way to deal with all the columns, so lets try describe
test_main.describe()

test_main.direction_max_wind_speed = test_main.direction_max_wind_speed.fillna(1)
test_main.direction_peak_wind_speed = test_main.direction_peak_wind_speed.fillna(1)
test_main.max_wind_speed = test_main.max_wind_speed.fillna(1)
#direction_max_wind_speed,  direction_peak_wind_speed,  max_wind_speed all can be safely filled with 1
test_main.days_with_fog = test_main.days_with_fog.fillna(0)

In [None]:
print('By removing the row with energy_star_rating null we lose {}% data'.format(100-(49048/75757)*100))
print('Site EUI the target variable is having {}% correlation'.format(dataset_main.corrwith(dataset_main.energy_star_rating)[-2]*100))

#### The correlation of energy_star_rating is considerable. So we try the predictive power of the model with the energy_star_rating null rows removed

In [None]:
#We take only the rows that have energy star rating available for the modeling.
dataset_A = dataset_main[~dataset_main.energy_star_rating.isna()]
dataset_A.drop('Year_Factor',axis=1,inplace=True) #Since the test set has only one year factor, so will not be useful

In [None]:
# Segregating the columns with categorical value
data_Acat = dataset_A[dataset_A.columns[dataset_A.dtypes == 'object']]

data_Anum = dataset_A[dataset_A.columns[dataset_A.dtypes != 'object']]

#A simple and straight_forward one-hot encoding using Pandas' Get_Dummies
data_Acat = pd.get_dummies(data_Acat,columns=data_Acat.columns)

data_Amodel = pd.merge(left=data_Acat,right=data_Anum,left_index=True,right_index=True)
print(data_Amodel.shape)


In [None]:
#https://stackoverflow.com/a/46581125/16388185
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

In [None]:
#Creating the X the Independent variables and Y the Target or Dependent variables
data_Amodel = clean_dataset(data_Amodel)
#We lost another 1000 entries to the big number and infinity issues
X = data_Amodel.iloc[:,:-2]
Y = data_Amodel.site_eui

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor

# Error Metrics
from sklearn.metrics import mean_squared_error

In [None]:
# split out validation dataset for the end

validation_size = 0.2

#In case the data is not dependent on the time series, then train and test split randomly
seed = 7
# X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

#In case the data is not dependent on the time series, then train and test split should be done based on sequential sample
#This can be done by selecting an arbitrary split point in the ordered list of observations and creating two new datasets.

train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]

In [None]:
# test options for regression
num_folds = 10
scoring = 'neg_mean_squared_error'
#scoring ='neg_mean_absolute_error'
#scoring = 'r2'

#### [Back to Contents](#contents)

##### <a id='Amodset'> Model Set </a>

In [None]:
# spot check the traditional Machine Learning algorithms
models = []
#models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
#models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
#models.append(('SVR', SVR()))

#Ensable Models 
# Boosting methods
#models.append(('ABR', AdaBoostRegressor()))
#models.append(('GBR', GradientBoostingRegressor()))
# Bagging methods
#models.append(('RFR', RandomForestRegressor()))
#models.append(('ETR', ExtraTreesRegressor()))

In [None]:
names = []
kfold_results = []
test_results = []
train_results = []
for name, model in models:
    names.append(name)
    
    ## K Fold analysis:
    
    kfold = KFold(n_splits=num_folds, random_state=seed)
    #converted mean square error to positive. The lower the beter
    cv_results = -1* cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    kfold_results.append(cv_results)
    

    # Full Training period
    print('{} model fit Started'.format(model))
    res = model.fit(X_train, Y_train)
    print('{} model fit Completed'.format(model))
    #The error function is root of mean_squared_error
    train_result = np.sqrt(mean_squared_error(res.predict(X_train), Y_train))
    train_results.append(train_result)
    
    # Test results
    #The error function is root of mean_squared_error
    test_result = np.sqrt(mean_squared_error(res.predict(X_test), Y_test))
    test_results.append(test_result)
    
    msg = "%s: %f (%f) %f %f" % (name, cv_results.mean(), cv_results.std(), train_result, test_result)
    print(msg)

#### [Back to Contents](#contents)

##### <a id='Aresvis'> Result Visualisation </a>

In [None]:
kfold = pd.DataFrame(kfold_results,columns=range(1,11)).T
kfold.columns = ['LASSO','EN','CART']

In [None]:
visA = go.Figure()
for kf in kfold.columns:
    visA.add_trace(go.Box(x=kfold[kf],name=kf))
visA.update_xaxes(type='log')
visA.update_layout(title='Kfold Error Results')
visA.show()

In [None]:
# compare algorithms
fig = plt.figure()

ind = np.arange(len(names))  # the x locations for the groups
width = 0.35  # the width of the bars

fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.bar(ind - width/2, train_results,  width=width, label='Train Error')
plt.bar(ind + width/2, test_results, width=width, label='Test Error')
fig.set_size_inches(15,8)
plt.legend()
ax.set_xticks(ind)
ax.set_xticklabels(names)
plt.show()

#### [Back to Contents](#contents)

##### <a id='Afinpre'> Final Observation </a>

The idea of removing the rows with null energy_star_ratings has not significantly improved the results. Time to try next idea. 

The Random forest Regressors are all over fitting very badly. 

The DecisionTree Regressors are having the lowest training error, which is fully overfitting. If the numbers of leaves are pruned, there is a possibility for the validation accuracy to improve. May be a grid search algorithm might help. 

Same way Lasso regressor grid search also might lead to better validation accuracy. Before that we will run another analysis with Energy star rating filled with the "Median Values

#### [Back to Contents](#contents)

#### <a id='medianval'> Updating the Energy Star nulls with Median values </a>

We will prepare the data with everything same, except the energy-star rating missing values updated to median values

[1) Data prep](#Bdata)

[2) Model Set](#Bmodset)

[3) Result Visualisation](#Bresvis)

[4) Final Observations](#Bfinpre)

##### <a id='Bdata'> Data prep</a>

In [None]:
dataset_main.direction_max_wind_speed = dataset_main.direction_max_wind_speed.fillna(1)
dataset_main.direction_peak_wind_speed = dataset_main.direction_peak_wind_speed.fillna(1)
dataset_main.max_wind_speed = dataset_main.max_wind_speed.fillna(1)
dataset_main.days_with_fog = dataset_main.days_with_fog.fillna(dataset_main.days_with_fog.median())
dataset_main.loc[dataset_main.energy_star_rating.isna(),'energy_star_rating'] = dataset_main.loc[~dataset_main.energy_star_rating.isna(),'energy_star_rating'].median()

#That leaves the Energy star rating. We need to check the test set provided for deciding

In [None]:
# Segregating the columns with categorical value
data_Bcat = dataset_main[dataset_main.columns[dataset_main.dtypes == 'object']]

data_Bnum = dataset_main[dataset_main.columns[dataset_main.dtypes != 'object']]
data_Bnum.drop('Year_Factor',axis=1,inplace=True)
#A simple and straight_forward one-hot encoding using Pandas' Get_Dummies
data_Bcat = pd.get_dummies(data_Bcat,columns=data_Bcat.columns)

data_Bmodel = pd.merge(left=data_Bcat,right=data_Bnum,left_index=True,right_index=True)
print(data_Bmodel.shape)


In [None]:
#Creating the X the Independent variables and Y the Target or Dependent variables
data_Bmodel = clean_dataset(data_Bmodel)
#We lost another 2500 entries to the big number and infinity issues
X = data_Bmodel.iloc[:,:-2]
Y = data_Bmodel.site_eui

In [None]:
# split out validation dataset for the end

validation_size = 0.2

#In case the data is not dependent on the time series, then train and test split randomly
seed = 7
# X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

#In case the data is not dependent on the time series, then train and test split should be done based on sequential sample
#This can be done by selecting an arbitrary split point in the ordered list of observations and creating two new datasets.

train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]

In [None]:
# test options for regression
num_folds = 10
scoring = 'neg_mean_squared_error'
#scoring ='neg_mean_absolute_error'
#scoring = 'r2'

#### [Back to Contents](#contents)

##### <a id='Bmodset'> Model Set </a>

In [None]:
# spot check the traditional Machine Learning algorithms
models = []
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('CART', DecisionTreeRegressor()))

In [None]:
names = []
kfold_results = []
test_results = []
train_results = []
for name, model in models:
    names.append(name)
    
    ## K Fold analysis:
    
    kfold = KFold(n_splits=num_folds, random_state=seed)
    #converted mean square error to positive. The lower the beter
    cv_results = -1* cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    kfold_results.append(cv_results)
    

    # Full Training period
    print('{} model fit Started'.format(model))
    res = model.fit(X_train, Y_train)
    print('{} model fit Completed'.format(model))
    #The error function is root of mean_squared_error
    train_result = np.sqrt(mean_squared_error(res.predict(X_train), Y_train))
    train_results.append(train_result)
    
    # Test results
    #The error function is root of mean_squared_error
    test_result = np.sqrt(mean_squared_error(res.predict(X_test), Y_test))
    test_results.append(test_result)
    
    msg = "%s: %f (%f) %f %f" % (name, cv_results.mean(), cv_results.std(), train_result, test_result)
    print(msg)

#### [Back to Contents](#contents)

##### <a id='Bresvis'> Result Visualisation </a>

In [None]:
kfold = pd.DataFrame(kfold_results,columns=range(1,11)).T
kfold.columns = ['LASSO','EN','CART']

In [None]:
visB = go.Figure()
for kf in kfold.columns:
    visB.add_trace(go.Box(x=kfold[kf],name=kf))
visB.update_xaxes(type='log')
visB.update_layout(title='Kfold Error Results')
visB.show()

In [None]:
# compare algorithms
fig = plt.figure()

ind = np.arange(len(names))  # the x locations for the groups
width = 0.35  # the width of the bars

fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.bar(ind - width/2, train_results,  width=width, label='Train Error')
plt.bar(ind + width/2, test_results, width=width, label='Test Error')
fig.set_size_inches(15,8)
plt.legend()
ax.set_xticks(ind)
ax.set_xticklabels(names)
plt.show()

#### [Back to Contents](#contents)

##### <a id='Bfinpre'> Final Observation </a>

The idea of changing null energy star rating to median values has not improved the results. We have resort to the next grid searching the Decision Tree and Lasso Regressor. 

#### <a id='cvlasso'> Searching the Parameters of Lasso model </a>

Dataset with the missing values updated as per the assumption  

[2) Grid searching Lasso Space](#lassoset)

[3) Grid Searching Result Visualisation](#GresSearch)

[4) Final Observations](#Lfinpre)

#### [Back to Contents](#contents)

##### <a id='lassoset'> Grid searching Lasso Space </a>

In [None]:
alpha_space = np.logspace(-4, 0, 30)   # Checking for alpha from .0001 to 1 and finding the best value for alpha

In [None]:
lasso_scores = []
lasso = Lasso(normalize = True)
for alpha in alpha_space:
    lasso.alpha = alpha
    val = np.mean(cross_val_score(lasso, X_train, Y_train, cv = 10))
    lasso_scores.append(val)

##### <a id='GresSearch'> Result of Grid Searching </a>

In [None]:
plt.figure(figsize=(8, 8))
plt.plot(alpha_space, lasso_scores, marker = 'D', label = "Lasso")
plt.legend()
plt.show()

In [None]:
# Performing GridSearchCV with Cross Validation technique on Lasso Regression and finding the optimum value of alpha

from sklearn.model_selection import GridSearchCV
np.set_printoptions(suppress=True)

params = {'alpha': (np.logspace(-8, 8, 100))} # It will check from 1e-08 to 1e+08
lasso = Lasso(normalize=True)
lasso_model = GridSearchCV(lasso, params, cv = 10)
lasso_model.fit(X_train, Y_train)
print(lasso_model.best_params_)
print(lasso_model.best_score_)

In [None]:
Training_Accuracy_Before = []
Testing_Accuracy_Before =[]

# Using value of alpha as 0.0000171 to get best accuracy for Lasso Regression
lasso = Lasso(alpha = 1.3530477745798075e-07, normalize = True)
lasso.fit(X_train, Y_train)

train_score = lasso.score(X_train, Y_train)
print(train_score)
test_score = lasso.score(X_test, Y_test)
print(test_score)

test_result = np.sqrt(mean_squared_error(lasso.predict(X_test), Y_test))
print(test_result)

#### [Back to Contents](#contents)

#### <a id='ModUnd'> Model understanding </a>

In [None]:
coefficients = lasso.coef_
#The factors that has higher influence
col1 = X_train.columns[coefficients > 30]
col2 = X_train.columns[coefficients < -40]

In [None]:
major_factor = col1.append(col2)

#Creating data using the parameters that have major impact.
X_lasso = X_train[major_factor]
X_lasso_test = X_test[major_factor]

#Grid Searching with updated data frame

params = {'alpha': (np.logspace(-8, 8, 100))} # It will check from 1e-08 to 1e+08
lasso = Lasso(normalize=True)
lasso_model = GridSearchCV(lasso, params, cv = 10)
lasso_model.fit(X_lasso, Y_train)
print(lasso_model.best_params_)
print(lasso_model.best_score_)

##### <a id='lassores'> Result of Lasso Grid Searching </a>

The result is obtained after taking the most influential parameters. Even after that, the results have not seen significant improvement

In [None]:
lasso = Lasso(alpha = 1e-08, normalize = True)
lasso.fit(X_lasso, Y_train)

test_result = np.sqrt(mean_squared_error(lasso.predict(X_lasso_test), Y_test))
print(test_result)

#### [Back to Contents](#contents)

1) Lasso Grid Searching, and further working on the model parameter optimisation has not provided any improvement the testing validation. 

2) Searching for the neural networks to see how they improve the validation result

#### [Back to Contents](#contents)

#### <a id='modDT'> Setting up the Decision Tree Search </a>


We have an opportunity perform a ensemble analysis here. Using the Lasso coefficient, we can select the parameters that have highest impact, and use those for grid searching Decision Tree models.

In [None]:
#Using the same dataset that is created using the major factors from Lasso Coefficients
X_DT = X_lasso
X_DT_test = X_lasso_test

In [None]:
dtm = DecisionTreeRegressor(max_depth=4,
                           min_samples_split=5,
                           max_leaf_nodes=10)

dtm.fit(X_DT,Y_train)
print("R-Squared on train dataset={}".format(dtm.score(X_DT_test,Y_test)))

dtm.fit(X_DT_test,Y_test)   
print("R-Squared on test dataset={}".format(dtm.score(X_DT_test,Y_test)))

##### <a id='setDT'> Grid searching Decision Tree </a>


In [None]:
param_grid = {"criterion": ["mse"],
              "min_samples_split": [10, 20, 40],
              "max_depth": [2,6,8], #[2, 6, 8],
              "min_samples_leaf": [20,40,100], #[20, 40, 100],
              "max_leaf_nodes": [5,20,100], #[5, 20, 100],
              }

## Comment in order to publish in kaggle.

grid_cv_dtm = GridSearchCV(dtm, param_grid, cv=5)

grid_cv_dtm.fit(X_DT,Y_train)

In [None]:
print("R-Squared::{}".format(grid_cv_dtm.best_score_))
print("Best Hyperparameters::\n{}".format(grid_cv_dtm.best_params_))

##### <a id='finDT'> Finalising Decision Tree Model </a>

In [None]:
decision_tree_results = pd.DataFrame(data=grid_cv_dtm.cv_results_)

fig,ax = plt.subplots()
sns.pointplot(data=decision_tree_results[['mean_test_score',
                                          'param_max_leaf_nodes',
                                          'param_max_depth']],
              y='mean_test_score',x='param_max_depth',
              hue='param_max_leaf_nodes',ax=ax)
ax.set(title="Effect of Depth and Leaf Nodes on Model Performance")

##### <a id='DTres'> Result of Decision Tree Grid Searching </a>

In [None]:
final_dtm = DecisionTreeRegressor(max_depth=8,min_samples_split=20,
                                  max_leaf_nodes=100,min_samples_leaf= 20)

final_dtm.fit(X_train,Y_train)

test_result = np.sqrt(mean_squared_error(final_dtm.predict(X_test), Y_test))

print(test_result)

Decision tree regressor and the Lasso Regressor have been grid searched, and the final results have not been a huge improvement. 

Next option is to throw Neural Networks on to this dataset for their power to fit the parameters in a different dimension

#### [Back to Contents](#contents)

##### <a id='Nnet'> Setting Neural Nets </a>

In [None]:
#Libraries for Deep Learning Models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import LSTM
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from sklearn.neural_network import MLPRegressor

In [None]:
#Set the following Flag to 0 if the Deep LEarning Models Flag has to be disabled
def create_model(neurons=127, activation='relu', learn_rate = 0.01, momentum=0):
        # create model
        model = Sequential()
        model.add(Dense(neurons, input_dim=X_train.shape[1], activation=activation))
        #The number of hidden layers can be increased
        model.add(Dense(64, activation=activation))
        model.add(Dense(32, activation=activation))
        # Final output layer
        model.add(Dense(1, kernel_initializer='normal'))
        # Compile model
        optimizer = SGD(lr=learn_rate, momentum=momentum)
        model.compile(loss='mean_squared_error', optimizer='adam')
        return model

In [None]:
#The data set for NN is same as the full data with Energy rating with median values

neural_model = KerasRegressor(build_fn=create_model, epochs=100, batch_size=100, verbose=1)

neural_model_fit = neural_model.fit(X_train, 
                                    Y_train, validation_data=(X_test,Y_test),
                                    epochs=100, batch_size=72, 
                                    verbose=0, shuffle=False)

In [None]:
#Visual plot to check if the error is reducing
plt.plot(neural_model_fit.history['loss'], label='train')
plt.plot(neural_model_fit.history['val_loss'], label='test')
plt.legend()
plt.show()

##### <a id='NNres'> Result of Neural Network Grid Searching </a>

In [None]:
error_Train = np.sqrt(mean_squared_error(Y_train, neural_model.predict(X_train)))
predicted = neural_model.predict(X_test)
error_Test = np.sqrt(mean_squared_error(Y_test,predicted))
error_Test

#### [Back to Contents](#contents)

##### <a id='xgb'> Setting XGB </a>

In [None]:
#Creating the X the Independent variables and Y the Target or Dependent variables
data_XGB = clean_dataset(data_Bmodel)
#We lost another 2500 entries to the big number and infinity issues
X = data_XGB.iloc[:,:-2]
Y = data_XGB.site_eui

In [None]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor

xgb1 = XGBRegressor()
parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['reg:linear'],
              'learning_rate': [.03, 0.05, .07], #so called `eta` value
              'max_depth': [5, 6, 7],
              'min_child_weight': [4],
              'silent': [1],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [500]}

xgb_grid = GridSearchCV(xgb1,parameters,cv = 2,n_jobs = 5,verbose=True)

xgb_grid.fit(X,Y) #Give the full dataset for running the grid search

print(xgb_grid.best_score_)
print(xgb_grid.best_params_)

model =  xgb_grid.best_estimator_

In [None]:
#loading test_set
test_main = load_csv(base_dir='../input/widsdatathon2022',file_name='test.csv')

#Need to find an effective way to deal with all the columns, so lets try describe
test_main.describe()

test_main.direction_max_wind_speed = test_main.direction_max_wind_speed.fillna(1)
test_main.direction_peak_wind_speed = test_main.direction_peak_wind_speed.fillna(1)
test_main.max_wind_speed = test_main.max_wind_speed.fillna(1)
#direction_max_wind_speed,  direction_peak_wind_speed,  max_wind_speed all can be safely filled with 1
test_main.days_with_fog = test_main.days_with_fog.fillna(0)
test_main.loc[test_main.energy_star_rating.isna(),'energy_star_rating'] = np.median(test_main.loc[~test_main.energy_star_rating.isna(),'energy_star_rating'])
test_main.loc[test_main.year_built.isna(),'year_built'] = np.median(test_main.loc[~test_main.year_built.isna(),'year_built'])

# Segregating the columns with categorical value
test_cat = test_main[test_main.columns[test_main.dtypes == 'object']]

test_num = test_main[test_main.columns[test_main.dtypes != 'object']]
test_num.drop('Year_Factor',axis=1,inplace=True)
#A simple and straight_forward one-hot encoding using Pandas' Get_Dummies
test_cat = pd.get_dummies(test_cat,columns=test_cat.columns)

test_model = pd.merge(left=test_cat,right=test_num,left_index=True,right_index=True)

##### <a id='xgbres'> Result of XGB Grid Searching </a>

In [None]:
y_predict_test = xgb_grid.best_estimator_.predict(test_model)

my_submission = pd.DataFrame({'id': test_main.id, 'site_eui': y_predict_test})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

#### [Back to Contents](#contents)

##### <a id='FinCon'> Lessons of the Exercise </a>

There are so many options to solve a single problem, all with different levels of accuracy and results. The notebook explores the breadth of models, and provides the results 

1) Science is challenging, and Data Science is more so. The exercise shows the challenge most of us face.

2) The real world features themselves don't provide insights directly. They need to be engineered to learn more about the data. The next exercise will be exactly to use feature engineering using the KPCA and tSNE. 

3) Neural Networks also needs to be setup with proper understanding to use its full power for fitting the data that we throw at it.

4) Be patient, and master the art of Automation using functions and pipelines.