# Normalization and Tuning Neural Networks - Lab

## Introduction

For this lab on initialization and optimization, let's look at a slightly different type of neural network. This time, we will not perform a classification task as we've done before (Santa vs not santa, bank complaint types), but we'll look at a linear regression problem.

We can just as well use deep learning networks for linear regression as for a classification problem. Do note that getting regression to work with neural networks is a hard problem because the output is unbounded ($\hat y$ can technically range from $-\infty$ to $+\infty$, and the models are especially prone to exploding gradients. This issue makes a regression exercise the perfect learning case!

## Objectives
You will be able to:
* Build a nueral network using keras
* Normalize your data to assist algorithm convergence
* Implement and observe the impact of various initialization techniques

In [2]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras import initializers
from keras import layers
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from keras import optimizers
from sklearn.model_selection import train_test_split

## Loading the data

The data we'll be working with is data related to facebook posts published during the year of 2014 on the Facebook's page of a renowned cosmetics brand.  It includes 7 features known prior to post publication, and 12 features for evaluating the post impact. What we want to do is make a predictor for the number of "likes" for a post, taking into account the 7 features prior to posting.

First, let's import the data set and delete any rows with missing data. Afterwards, briefly preview the data.

In [4]:
#Your code here; load the dataset and drop rows with missing values. Then preview the data.
from pandas_profiling import ProfileReport
df = pd.read_csv('dataset_Facebook.csv',sep=';', header=0)
df.dropna(inplace=True)
report = ProfileReport(df)
report

0,1
Number of variables,20
Number of observations,495
Total Missing (%),0.0%
Total size in memory,77.4 KiB
Average record size in memory,160.2 B

0,1
Numeric,14
Categorical,1
Boolean,1
Date,0
Text (Unique),0
Rejected,4
Unsupported,0

0,1
Distinct count,3
Unique (%),0.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.8869
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,2
Q3,3
95-th percentile,3
Maximum,3
Range,2
Interquartile range,2

0,1
Standard deviation,0.85327
Coef of variation,0.45221
Kurtosis,-1.5938
Mean,1.8869
MAD,0.75608
Skewness,0.2185
Sum,934
Variance,0.72807
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
1,211,42.6%,
3,155,31.3%,
2,129,26.1%,

Value,Count,Frequency (%),Unnamed: 3
1,211,42.6%,
2,129,26.1%,
3,155,31.3%,

Value,Count,Frequency (%),Unnamed: 3
1,211,42.6%,
2,129,26.1%,
3,155,31.3%,

0,1
Distinct count,411
Unique (%),83.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,926.83
Minimum,9
Maximum,11452
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,174.2
Q1,399.0
Median,630.0
Q3,1062.0
95-th percentile,2592.5
Maximum,11452.0
Range,11443.0
Interquartile range,663.0

0,1
Standard deviation,987.71
Coef of variation,1.0657
Kurtosis,33.961
Mean,926.83
MAD,598.83
Skewness,4.5081
Sum,458781
Variance,975580
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
537,4,0.8%,
206,3,0.6%,
735,3,0.6%,
1141,3,0.6%,
1062,3,0.6%,
356,3,0.6%,
517,3,0.6%,
550,3,0.6%,
424,3,0.6%,
909,3,0.6%,

Value,Count,Frequency (%),Unnamed: 3
9,1,0.2%,
15,1,0.2%,
17,1,0.2%,
24,1,0.2%,
25,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4840,1,0.2%,
5352,1,0.2%,
6164,1,0.2%,
8072,1,0.2%,
11452,1,0.2%,

0,1
Distinct count,378
Unique (%),76.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,614.14
Minimum,9
Maximum,4376
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,131.7
Q1,297.5
Median,416.0
Q3,658.5
95-th percentile,1835.5
Maximum,4376.0
Range,4367.0
Interquartile range,361.0

0,1
Standard deviation,614.35
Coef of variation,1.0003
Kurtosis,11.258
Mean,614.14
MAD,392.61
Skewness,2.9814
Sum,303997
Variance,377420
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
403,5,1.0%,
363,4,0.8%,
340,4,0.8%,
327,4,0.8%,
280,3,0.6%,
389,3,0.6%,
305,3,0.6%,
375,3,0.6%,
319,3,0.6%,
408,3,0.6%,

Value,Count,Frequency (%),Unnamed: 3
9,1,0.2%,
15,2,0.4%,
17,1,0.2%,
19,1,0.2%,
32,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
3430,1,0.2%,
3798,1,0.2%,
4104,1,0.2%,
4318,1,0.2%,
4376,1,0.2%,

0,1
Correlation,0.96809

0,1
Distinct count,436
Unique (%),88.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1425.9
Minimum,9
Maximum,19779
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,157.5
Q1,512.5
Median,861.0
Q3,1479.0
95-th percentile,4543.0
Maximum,19779.0
Range,19770.0
Interquartile range,966.5

0,1
Standard deviation,2007.7
Coef of variation,1.408
Kurtosis,31.139
Mean,1425.9
MAD,1091.7
Skewness,4.8009
Sum,705831
Variance,4030700
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
730,3,0.6%,
513,3,0.6%,
599,3,0.6%,
889,3,0.6%,
652,3,0.6%,
795,3,0.6%,
431,3,0.6%,
719,3,0.6%,
264,2,0.4%,
906,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
9,1,0.2%,
19,1,0.2%,
20,1,0.2%,
26,1,0.2%,
31,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
11064,1,0.2%,
12074,1,0.2%,
14974,1,0.2%,
18115,1,0.2%,
19779,1,0.2%,

0,1
Distinct count,487
Unique (%),98.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16916
Minimum,567
Maximum,1107833
Zeros (%),0.0%

0,1
Minimum,567.0
5-th percentile,1740.0
Q1,4073.5
Median,6282.0
Q3,15143.0
95-th percentile,48632.0
Maximum,1107833.0
Range,1107266.0
Interquartile range,11070.0

0,1
Standard deviation,60074
Coef of variation,3.5513
Kurtosis,245.12
Mean,16916
MAD,16793
Skewness,14.656
Sum,8373558
Variance,3608900000
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
5010,2,0.4%,
4911,2,0.4%,
2541,2,0.4%,
2888,2,0.4%,
1284,2,0.4%,
4935,2,0.4%,
5732,2,0.4%,
3675,2,0.4%,
22864,1,0.2%,
4428,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
567,1,0.2%,
721,1,0.2%,
723,1,0.2%,
935,1,0.2%,
951,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
122474,1,0.2%,
160270,1,0.2%,
184270,1,0.2%,
648611,1,0.2%,
1107833,1,0.2%,

0,1
Distinct count,489
Unique (%),98.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29857
Minimum,570
Maximum,1110282
Zeros (%),0.0%

0,1
Minimum,570.0
5-th percentile,2546.8
Q1,5798.0
Median,9084.0
Q3,22503.0
95-th percentile,110570.0
Maximum,1110282.0
Range,1109712.0
Interquartile range,16705.0

0,1
Standard deviation,77143
Coef of variation,2.5837
Kurtosis,93.163
Mean,29857
MAD,32512
Skewness,8.3148
Sum,14779206
Variance,5951000000
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
8745,2,0.4%,
4372,2,0.4%,
8533,2,0.4%,
12735,2,0.4%,
7004,2,0.4%,
6503,2,0.4%,
24917,1,0.2%,
12627,1,0.2%,
64850,1,0.2%,
55633,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
570,1,0.2%,
726,1,0.2%,
746,1,0.2%,
1071,1,0.2%,
1096,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
453213,1,0.2%,
457509,1,0.2%,
497910,1,0.2%,
665792,1,0.2%,
1110282,1,0.2%,

0,1
Distinct count,481
Unique (%),97.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,14028
Minimum,238
Maximum,180480
Zeros (%),0.0%

0,1
Minimum,238.0
5-th percentile,1427.9
Q1,3331.0
Median,5290.0
Q3,13248.0
95-th percentile,54635.0
Maximum,180480.0
Range,180242.0
Interquartile range,9917.0

0,1
Standard deviation,22821
Coef of variation,1.6268
Kurtosis,16.641
Mean,14028
MAD,13781
Skewness,3.6627
Sum,6943910
Variance,520800000
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
32208,2,0.4%,
6692,2,0.4%,
3528,2,0.4%,
2232,2,0.4%,
3358,2,0.4%,
5290,2,0.4%,
5280,2,0.4%,
13544,2,0.4%,
2645,2,0.4%,
2938,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
238,1,0.2%,
391,1,0.2%,
452,1,0.2%,
617,1,0.2%,
619,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
128064,1,0.2%,
139008,1,0.2%,
153536,1,0.2%,
158208,1,0.2%,
180480,1,0.2%,

0,1
Distinct count,464
Unique (%),93.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6641.4
Minimum,236
Maximum,51456
Zeros (%),0.0%

0,1
Minimum,236.0
5-th percentile,989.4
Q1,2213.0
Median,3478.0
Q3,8018.0
95-th percentile,23003.0
Maximum,51456.0
Range,51220.0
Interquartile range,5805.0

0,1
Standard deviation,7700.3
Coef of variation,1.1594
Kurtosis,8.0901
Mean,6641.4
MAD,5286.1
Skewness,2.5989
Sum,3287471
Variance,59294000
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
2044,2,0.4%,
2704,2,0.4%,
2644,2,0.4%,
2162,2,0.4%,
2174,2,0.4%,
2660,2,0.4%,
1640,2,0.4%,
3216,2,0.4%,
8032,2,0.4%,
3230,2,0.4%,

Value,Count,Frequency (%),Unnamed: 3
236,1,0.2%,
380,1,0.2%,
450,1,0.2%,
516,1,0.2%,
521,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
38720,1,0.2%,
39776,1,0.2%,
47488,1,0.2%,
48368,1,0.2%,
51456,1,0.2%,

0,1
Distinct count,90
Unique (%),18.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,123170
Minimum,81370
Maximum,139441
Zeros (%),0.0%

0,1
Minimum,81370
5-th percentile,91084
Q1,112320
Median,129600
Q3,136390
95-th percentile,138900
Maximum,139441
Range,58071
Interquartile range,24069

0,1
Standard deviation,16204
Coef of variation,0.13155
Kurtosis,-0.29043
Mean,123170
MAD,13594
Skewness,-0.97058
Sum,60970768
Variance,262560000
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
124940,17,3.4%,
136393,16,3.2%,
129600,15,3.0%,
139441,14,2.8%,
138895,14,2.8%,
109670,13,2.6%,
107907,13,2.6%,
137177,12,2.4%,
100732,12,2.4%,
117764,11,2.2%,

Value,Count,Frequency (%),Unnamed: 3
81370,3,0.6%,
85093,3,0.6%,
85979,7,1.4%,
86491,5,1.0%,
86909,6,1.2%,

Value,Count,Frequency (%),Unnamed: 3
138353,9,1.8%,
138414,11,2.2%,
138458,3,0.6%,
138895,14,2.8%,
139441,14,2.8%,

0,1
Distinct count,2
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.28081

0,1
0.0,356
1.0,139

Value,Count,Frequency (%),Unnamed: 3
0.0,356,71.9%,
1.0,139,28.1%,

0,1
Distinct count,22
Unique (%),4.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.8444
Minimum,1
Maximum,23
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,3
Median,9
Q3,11
95-th percentile,14
Maximum,23
Range,22
Interquartile range,8

0,1
Standard deviation,4.3851
Coef of variation,0.559
Kurtosis,-0.83448
Mean,7.8444
MAD,3.9204
Skewness,0.21178
Sum,3883
Variance,19.229
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
3,105,21.2%,
10,77,15.6%,
13,52,10.5%,
11,44,8.9%,
2,39,7.9%,
4,34,6.9%,
12,29,5.9%,
9,29,5.9%,
6,15,3.0%,
5,13,2.6%,

Value,Count,Frequency (%),Unnamed: 3
1,4,0.8%,
2,39,7.9%,
3,105,21.2%,
4,34,6.9%,
5,13,2.6%,

Value,Count,Frequency (%),Unnamed: 3
18,3,0.6%,
19,1,0.2%,
20,1,0.2%,
22,1,0.2%,
23,1,0.2%,

0,1
Correlation,0.94089

0,1
Distinct count,7
Unique (%),1.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.1333
Minimum,1
Maximum,7
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,6
95-th percentile,7
Maximum,7
Range,6
Interquartile range,4

0,1
Standard deviation,2.0307
Coef of variation,0.49131
Kurtosis,-1.2787
Mean,4.1333
MAD,1.7611
Skewness,-0.091754
Sum,2046
Variance,4.1239
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
7,80,16.2%,
6,80,16.2%,
4,71,14.3%,
1,68,13.7%,
5,66,13.3%,
2,66,13.3%,
3,64,12.9%,

Value,Count,Frequency (%),Unnamed: 3
1,68,13.7%,
2,66,13.3%,
3,64,12.9%,
4,71,14.3%,
5,66,13.3%,

Value,Count,Frequency (%),Unnamed: 3
3,64,12.9%,
4,71,14.3%,
5,66,13.3%,
6,80,16.2%,
7,80,16.2%,

0,1
Correlation,0.92862

0,1
Distinct count,4
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0

0,1
Photo,421
Status,45
Link,22

Value,Count,Frequency (%),Unnamed: 3
Photo,421,85.1%,
Status,45,9.1%,
Link,22,4.4%,
Video,7,1.4%,

0,1
Distinct count,46
Unique (%),9.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.5576
Minimum,0
Maximum,372
Zeros (%),20.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,1.0
Median,3.0
Q3,7.0
95-th percentile,25.3
Maximum,372.0
Range,372.0
Interquartile range,6.0

0,1
Standard deviation,21.274
Coef of variation,2.815
Kurtosis,181.89
Mean,7.5576
MAD,8.0318
Skewness,11.721
Sum,3741
Variance,452.6
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
0,101,20.4%,
2,71,14.3%,
1,62,12.5%,
4,44,8.9%,
3,36,7.3%,
6,26,5.3%,
5,20,4.0%,
7,20,4.0%,
9,15,3.0%,
10,11,2.2%,

Value,Count,Frequency (%),Unnamed: 3
0,101,20.4%,
1,62,12.5%,
2,71,14.3%,
3,36,7.3%,
4,44,8.9%,

Value,Count,Frequency (%),Unnamed: 3
64,1,0.2%,
103,1,0.2%,
144,1,0.2%,
146,1,0.2%,
372,1,0.2%,

0,1
Distinct count,495
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,249.96
Minimum,0
Maximum,498
Zeros (%),0.2%

0,1
Minimum,0.0
5-th percentile,24.7
Q1,126.5
Median,251.0
Q3,374.5
95-th percentile,473.3
Maximum,498.0
Range,498.0
Interquartile range,248.0

0,1
Standard deviation,144.36
Coef of variation,0.57753
Kurtosis,-1.1981
Mean,249.96
MAD,124.79
Skewness,-0.015147
Sum,123732
Variance,20841
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
1,498,100.6%,
1,157,31.7%,
1,159,32.1%,
1,160,32.3%,
1,161,32.5%,
1,162,32.7%,
1,163,32.9%,
1,165,33.3%,
1,166,33.5%,
1,167,33.7%,

Value,Count,Frequency (%),Unnamed: 3
1,498,100.6%,
1,157,31.7%,
1,159,32.1%,
1,160,32.3%,
1,161,32.5%,

Value,Count,Frequency (%),Unnamed: 3
1,338,68.3%,
1,339,68.5%,
1,340,68.7%,
1,341,68.9%,
1,0,0.0%,

0,1
Distinct count,257
Unique (%),51.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,179.15
Minimum,0
Maximum,5172
Zeros (%),1.0%

0,1
Minimum,0.0
5-th percentile,8.7
Q1,57.0
Median,101.0
Q3,188.0
95-th percentile,534.3
Maximum,5172.0
Range,5172.0
Interquartile range,131.0

0,1
Standard deviation,324.41
Coef of variation,1.8109
Kurtosis,118.52
Mean,179.15
MAD,143.76
Skewness,8.9335
Sum,88677
Variance,105240
Memory size,3.9 KiB

Value,Count,Frequency (%),Unnamed: 3
98.0,7,1.4%,
79.0,6,1.2%,
72.0,6,1.2%,
53.0,6,1.2%,
148.0,6,1.2%,
74.0,5,1.0%,
101.0,5,1.0%,
66.0,5,1.0%,
7.0,5,1.0%,
56.0,5,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,5,1.0%,
1.0,1,0.2%,
2.0,2,0.4%,
3.0,3,0.6%,
4.0,4,0.8%,

Value,Count,Frequency (%),Unnamed: 3
1572.0,1,0.2%,
1622.0,1,0.2%,
1639.0,1,0.2%,
1998.0,1,0.2%,
5172.0,1,0.2%,

0,1
Correlation,0.9041

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,19,325.0,49.0,393


## Initialization

## Normalize the Input Data

Let's look at our input data. We'll use the 7 first columns as our predictors. We'll do the following two things:
- Normalize the continuous variables --> you can do this using `np.mean()` and `np.std()`
- Make dummy variables of the categorical variables (you can do this by using `pd.get_dummies`)

We only count "Category" and "Type" as categorical variables. Note that you can argue that "Post month", "Post Weekday" and "Post Hour" can also be considered categories, but we'll just treat them as being continuous for now.

You'll then use these to define X and Y. 

To summarize, X will be:
* Page total likes
* Post Month
* Post Weekday
* Post Hour
* Paid
along with dummy variables for:
* Type
* Category


Be sure to normalize your features by subtracting the mean and dividing by the standard deviation.  

Finally, y will simply be the "like" column.

In [39]:
import warnings
#Your code here; define X and y.
X = df.iloc[:, :7]
Y = df.like

categories = ['Type', 'Category']

for col in X.columns:
    warnings.filterwarnings('ignore')
    if col in categories:
        X[col] = X[col].astype('category')    
    else:
        X[col] = (X[col]-np.mean(X[col]))/(np.std(X[col]))
        print(f"{col}:\nMean: {round(np.mean(X[col], axis=0),0)}\nstd: {round(np.std(X[col], axis=0),0)}\n")
X = pd.get_dummies(X)
X.head()

Page total likes:
Mean: 0.0
std: 1.0

Post Month:
Mean: 0.0
std: 1.0

Post Weekday:
Mean: -0.0
std: 1.0

Post Hour:
Mean: 0.0
std: 1.0

Paid:
Mean: 0.0
std: 1.0



Unnamed: 0,Page total likes,Post Month,Post Weekday,Post Hour,Paid,Type_Link,Type_Photo,Type_Status,Type_Video,Category_1,Category_2,Category_3
0,1.00496,1.506154,-0.065724,-1.105878,-0.62486,0,1,0,0,0,1,0
1,1.00496,1.506154,-0.558655,0.492065,-0.62486,0,0,1,0,0,1,0
2,1.00496,1.506154,-0.558655,-1.105878,-0.62486,0,1,0,0,0,0,1
3,1.00496,1.506154,-1.051585,0.492065,1.60036,0,1,0,0,0,1,0
4,1.00496,1.506154,-1.051585,-1.105878,-0.62486,0,1,0,0,0,1,0


Our data is fairly small. Let's just split the data up in a training set and a validation set!  The next three code blocks are all provided for you; have a quick review but not need to make edits!

In [46]:
#Code provided; defining training and validation sets
def train_val_split(X,Y):
    data_clean = pd.concat([X, Y], axis=1)
    np.random.seed(123)
    train, validation = train_test_split(data_clean, test_size=0.2)

    X_val = validation.iloc[:,0:12]
    Y_val = validation.iloc[:,12]
    X_train = train.iloc[:,0:12]
    Y_train = train.iloc[:,12]
    return X_train, X_val, Y_train, Y_val
X_train, X_val, Y_train, Y_val = train_val_split(X,Y)

In [41]:
#Code provided; building an initial model
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.


In [42]:
#Code provided; previewing the loss through successive epochs
hist.history['loss'][:10]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Did you see what happend? all the values for training and validation loss are "nan". There could be several reasons for that, but as we already mentioned there is likely a vanishing or exploding gradient problem. recall that we normalized out inputs. But how about the outputs? Let's have a look.

In [43]:
Y_train.head()

208     54.0
290     23.0
286     15.0
0       79.0
401    329.0
Name: like, dtype: float64

Yes, indeed. We didn't normalize them and we should, as they take pretty high values. Let
s rerun the model but make sure that the output is normalized as well!

## Normalizing the output

Normalize Y as you did X by subtracting the mean and dividing by the standard deviation. Then, resplit the data into training and validation sets as we demonstrated above, and retrain a new model using your normalized X and Y data.

In [44]:
Y = (df['like']-np.mean(df['like']))/(np.std(df['like']))

In [47]:
#Your code here; create training and validation sets as before. Use random seed 123.
X_train, X_val, Y_train, Y_val = train_val_split(X, Y)

In [48]:
#Your code here; rebuild a simple model using a relu layer followed by a linear layer. (See our code snippet above!)
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)

Finally, let's recheck our loss function. Not only should it be populated with numerical data as opposed to null values, but we also should expect to see the loss function decreasing with successive epochs, demonstrating optimization!

In [49]:
hist.history['loss'][:10]

[1.4818981940096074,
 1.1884846822781996,
 1.1148223625590103,
 1.0856416412074157,
 1.0701496577022052,
 1.057204525428589,
 1.0487986371824236,
 1.0412088521201202,
 1.0355518850112202,
 1.0304819912922503]

Great! We have a converged model. With that, let's investigate how well the model performed with our good old friend, mean squarred error.

In [50]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)  

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

print("MSE_train:", MSE_train)
print("MSE_val:", MSE_val)

MSE_train: 0.94790494467108
MSE_val: 0.9512222332440762


## Using Weight Initializers

##  He Initialization

Let's try and use a weight initializer. In the lecture, we've seen the He normalizer, which initializes the weight vector to have an average 0 and a variance of 2/n, with $n$ the number of features feeding into a layer.

In [51]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, kernel_initializer= "he_normal",
                activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val),verbose=0)

In [52]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [53]:
print(MSE_train)
print(MSE_val)

0.9180107661847383
0.9785988070348924


The initializer does not really help us to decrease the MSE. We know that initializers can be particularly helpful in deeper networks, and our network isn't very deep. What if we use the `Lecun` initializer with a `tanh` activation?

## Lecun Initialization

In [54]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, 
                kernel_initializer= "lecun_normal", activation='tanh'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose=0)

In [55]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [56]:
print(MSE_train)
print(MSE_val)

0.9374206532494092
0.9842173428760341


Not much of a difference, but a useful note to consider when tuning your network. Next, let's investigate the impace of various optimization algorithms.

## RMSprop

In [57]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "rmsprop" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [58]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [59]:
print(MSE_train)
print(MSE_val)

0.9052797862575565
0.9300485086542933


## Adam

In [60]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= "Adam" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [61]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [62]:
print(MSE_train)
print(MSE_val)

0.9171959085796803
0.9233055914960455


## Learning Rate Decay with Momentum


In [63]:
np.random.seed(123)
sgd = optimizers.SGD(lr=0.03, decay=0.0001, momentum=0.9)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= sgd ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [64]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [65]:
print(MSE_train)
print(MSE_val)

0.8254409471441183
0.9469809514078896


## Additional Resources
* https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb  

* https://catalog.data.gov/dataset/consumer-complaint-database  

* https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/  

* https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/  

* https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/  

* https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network  

* https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/


## Summary  

In this lab, we began to practice some of the concepts regarding normalization and optimization for neural networks. In the final lab for this section, you'll independently practice these concepts on your own in order to tune a model to predict individuals payments to loans.