#  Machine learning project notebook

### Student: G00219132 Susan Hudson - Module: Machine Learning & Statistics, GMIT

### Notebook structure

The notebook is split up into the following sections which are based on the project statement requirements.

* Section One - General setup and importation of necessary python libraries/packages and dataset
* Section Two - Descriptive Statistics
* Section Three - Inferential Statistics
* Section Four - Predictive Statistics
* Section Five - references and conclusion

## Section One - Importing libraries and dataset

Below is a list of python packages used in this notebook and the loading of the dataset and conversion to a pandas dataframe.


In [1]:
# Import all necessary python packages for this vnotebook

import matplotlib.pyplot as plt 
#Matplotlib is a Python plotting library and Pyplot is a matplotlib module which provides a MATLAB type interface.
plt.rcParams['figure.figsize'] = [10, 6]  #sets figure sizes for plots
%matplotlib inline  
#command abovr ensures plots display correctly in the notebook
import seaborn as sns  #Seaborn is a Python package used for plotting data.
import pandas as pd  #Pandas is a Python package for use with data frames.
import scipy.stats as ss #statistical functions package
import numpy as np #NumPy is a Python package for mathematical computing
import sklearn.datasets #dataset location
import keras as kr #deep learning library - used for predictive neural networks 

Using TensorFlow backend.


In [2]:
# dataset is imported and converted to a pandas dataframe
sklearn.datasets.load_boston
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

##  Section Two - Descriptive Statistics

Descriptive Statistics looks at sunmary statistics for the population sample [1], to get a 'feel' for the data. I have chosen both visual and quamtitative analysis methods. 
 

In [None]:
# check data set shape and print first five rows 
print(boston.data.shape)
df.head()

In [None]:
#a table of the summary statistics for all columns in the dataset
df.describe().T

In [None]:
# below outines the characteristcs of the dataset and explains what each variable represents in the dataset.
# Note - this dataset is from the 1970s and some attributes are a reflection of those times, for the purpose of the project
# I imported the dataset as a whole but will not be considering some of the variables when doing evaluations of data subsets!

print(boston.DESCR)

#### Distribution of Median Values

In [None]:
#below I am looking at the distribution of the target variable, MEDV

sns.distplot(df["MEDV"], bins=40)
plt.show()


The plot of Median Value shows a mainly normal distribution. It looks like there is some sort of price capping /banding as there seem to be a disproportionate number of properties with median value 50. 
The box plot of MEDV below appears to back up this opinion. 

In [None]:
# a box plot of MEDV to look at the shape
sns.boxplot(df.MEDV)

#### Correlation
I then looked at whether there was much, or any correlation between the individual dataset variables and in particular whether any strong correlations existed between target variable MEDV and other variables. 

I did both a correlarion heatmap for a visual display and a correlaion table. 

Correlation is a statistical measure of the degree to which changes to the value of one variable predict change to the value of another. A coefficient close to  1 would indicate that the two change in the same direction, e.g as one increases the other increases, as one decreases the other decreases. A value close to -1 indicates a strong negative correlation, e.g as one variable increases the other decreases. 

In [None]:
# a correlation heatmap of the dataset

correlation_heatmap = df.corr().round(2)
fig, ax = plt.subplots(figsize=(10,6))   
sns.heatmap(correlation_heatmap, annot=True,linewidths= 1, ax=ax)

plt.show()

In [None]:
# correlation table for the dataset
df.corr()

Looking at the correlation heatmap/table the strongest correlations with MEDV are a positive correlation with RM, average number of rooms per dwelling and a negative correlation with LSTAT % lower status of the population.
This would make sense as one would imagine that house prices are higher when they are larger and also that they would be lower in poorer areas. This is supported by the negative correlation between LSTAT and RM, the higher the % of 'lower status' the lower the number of rooms. There is also a negative correlation with PTRATIO pupil-teacher ratio by town. It appears that the higher the PTRATIO the lower the MEDV This could imply that areas of lower socio economic status would have higher PTRATIO, it could be attributed to funding or even orevalence of orivate schools in more affluent areas. The posive correlation between LSTAT and PTRATIO could support this observation. 

#### Distribution plots
Below are distribution plots for each variable. Some appear skewed, CRIM, ZN, AGE for example. RM and MEDV are normally distributed.  As CHAS is categorical the distribution shows merely the counts for each value, 1 or 0, worth noting that there are a far greated number at 0. 

In [None]:
df.hist(bins=10, figsize=(15,10), grid=False)
plt.show()

## Section Three - Inferential Statistics

Inferential statistics looks at the sample and infers trends about the larger population from which the sample was drawn. Where populations are large it would be impossible to get data for the entire population so inferences are made based on the statistical sample. The project brief was to analyse whether there was a sigificant difference in median house prices between houses along the Charles river and those that aren't. 

To do this I decided to do a two sample t test and create two subsets to do a t test to see whether the mean of median values is the same for houses bordering the river and houses not near the river. The null hypothesis I am testing being that there is no difference in the average median value of houses bordering the river and houses not near the river.

The alternative hypothesis being that there is a difference in median values of houses bordering the river and houses not near the river.

In [None]:
# create two subsets to do a t test dfnear, houses bordering the river, dffar houses away from the river

# houses bordering the river
dfnear =  df[(df['CHAS'] == 1.0)]
dfnear.reset_index(inplace= True)
#print ( dfnear)

#houses bordering the river with values of 50k removed
dfnear2 =  df[(df['CHAS'] == 1.0)]
dfnearno_out = dfnear2[(dfnear2['MEDV']<40)]
#print (dfnearno_out)

#houses away from the river
dffar =  df[(df['CHAS'] == 0.0)]
dffar.reset_index(inplace= True)
print(dffar)

# t test houses near and houses away from river
from scipy.stats import ttest_ind
n =dfnear['MEDV']
f =dffar['MEDV']
result = ss.ttest_ind(n,f)
print('t test result for CHAS:  ', result)
# result obtained causes me to reject the hypothesis
n_noout = dfnearno_out['MEDV']

# t test houses near the river excluding values of 50 and houses away from river
from scipy.stats import ttest_ind
dfnearno_out['MEDV']
f =dffar['MEDV']
result = ss.ttest_ind(n_noout,f)
print('t test result for CHAS, no 50k values included:  ', result)


The initial result obtained causes me to reject the null hypothesis and conclude that the mean of median value for houses along the river is not equal to the mean of median values of houses away from the river. However, below I took a further look at the two subsets of data for CHAS. Looking at the distributions I am not convinced that the t test is of any value as I don't feel that the data follows the required conitions for being approximately normal [2]. The sample away from the river is OK but those near the river are a appear to have a distribution with outliers, this appears to be mainly the the 'capped' 50k values. 

Having reviewed the plots I repeated the t test, this time removing values of 50k. The resulting p value of 0.51 would mean that 1 cannot conclude that a significant difference exists. 
However I am sticking with the original rejection on the basis that in this instance removal of outliers is not correct as the outliers are capped values so in all liklihood are 50 and above and in the sample size represent 20% of the total sample so are not insignificant. 




In [None]:
# seaborn distribution plots 

sns.distplot(dfnear["MEDV"], bins=10).set_title ("MEDV Distribution where houses near river")
plt.show()
sns.distplot(dffar["MEDV"], bins=10).set_title ("MEDV Distribution where houses away fron river")
plt.show()
sns.distplot(dfnearno_out["MEDV"], bins=10).set_title ("MEDV Distribution where houses near river no 50k values")
plt.show()

#box plot with stripplot overlaid to show data points
df4 = pd.DataFrame(data = dfnear, columns = [ 'MEDV'] )
sns.boxplot(x="variable", y="value", data= pd.melt(df4), palette="bright").set_title("House values near River ")
sns.stripplot(x="variable", y="value", data= pd.melt(df4), palette="dark").set_title("House values near River ")
plt.ylabel('Median house cost')
plt.xlabel('')
plt.show()

#box plot with stripplot overlaid to show data points
df5 = pd.DataFrame(data = dffar, columns = [ 'MEDV'] )
sns.stripplot (x="variable", y="value", data= pd.melt(df5), palette="dark").set_title("House values away from River ")
sns.boxplot(x="variable", y="value", data= pd.melt(df5), palette="bright").set_title("House values away from River ")
plt.ylabel('Median house cost')
plt.xlabel('')
plt.show()

#box plot with stripplot overlaid to show data points
df6 = pd.DataFrame(data = dfnearno_out, columns = [ 'MEDV'] )
sns.boxplot(x="variable", y="value", data= pd.melt(df6), palette="bright").set_title("House values near River no 50k ")
sns.stripplot(x="variable", y="value", data= pd.melt(df6), palette="dark").set_title("House values near River no 50k ")
plt.ylabel('Median house cost')
plt.xlabel('')
plt.show()

## Section four - predictive statistics

The brief for this section of the project was to use Keras to build a neural network that could predict the median house price (MEDV) based on the other values in the dataset. There were no other restrictions and I decided to take the following approach. 
* build a neural network and train on all variables, 
* look at some pre processing of data and again train on all variables.
* Reduce the number of inputs and repeat the above approach.

Select all variables as inputs and MEDV as output

In [9]:
x = df.iloc[:,0:13]
y = df.iloc[:,13]

### Build neural network model
the neural network I built is a sequential model where layers were added one at a time. 
I experimented wirh different layer densities, activitation functions, initializers and optimizers in the model itself and with different numbers of epochs and batch size when training the model. after much trial and error I settled on the model below, two layers of 64 neurons wrelu activation and an output layer of 1 neuron and linear activation.

This gave the lowest loss values although los values are fluctuating. I increased the batch size to the full dataset when training and could see less fluctations so think they are a result of the number of variables and the differences across variables when batch sizes are relatively small. 

In [10]:
from sklearn.model_selection import train_test_split 
from keras import models
from keras import layers
from keras.models import Sequential
from keras.layers import Dense, Activation

m = models.Sequential()
m.add(layers.Dense(64, activation='relu', input_dim =13))
m.add(layers.Dense(64, activation='relu'))
#m.add(Dense(1,kernel_initializer='normal',activation='linear',use_bias=False))
m.add(layers.Dense(1,kernel_initializer='normal',activation='linear',use_bias=False))
m.compile(loss='mse', optimizer='adam',metrics=['accuracy'])
    

In [8]:
# split the dataset into test(20%) and train (80%) values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
m.fit(x_train, y_train, epochs=1000, batch_size=25)

NameError: name 'x' is not defined

In [None]:
print(m.metrics_names)
m.evaluate(x, y)

Using the above trained model I will use the test input variables to predict output variable MEDV for the test portion o f the split dataset. 

In [None]:

p = m.predict(x_test)
#m.summary()

predictval =  np.around(m.predict(x_test).T,2)
print(predictval)
original = (y_test.as_matrix().astype(np.float32))
print(original)

In [None]:
m.evaluate(x_test, y_test)

looking at the predictions versus the actual and remembering that the figures represent thousands on the surface some predictions appear reasonable - 13 input variables and predictions within 5%. however there are some that are way off and no consistency in prediction versus acual. The next step was to do some pre processing and see whether the model output improved. 

#### reducing input variables 
I repeated the above using less input variables and was surprised that performance was worse in terms of ability to be trained, predict. loss evaluation was much higher (53 vs 12) for the prediction.

In [None]:
# now going to repeat the above using the four inputs of most interest
Y=df['MEDV']
#print(Y)
Xfour =df[['RM', 'LSTAT','PTRATIO' ]]
#print(Xfour)

In [None]:
# after an initial run I tweaked the neural network to try to reduce loss
m = models.Sequential()
m.add(layers.Dense(39, activation='relu', input_dim =3))
m.add(layers.Dense(39, activation='relu'))
m.add(Dense(1,kernel_initializer='normal',activation='linear',use_bias=False))

m.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

In [None]:
x_train, x_test, y_train, y_test = train_test_split(Xfour, Y, test_size=0.2)
m.fit(x_train, y_train, epochs=1000, batch_size=22)

In [None]:
print(m.metrics_names)
m.evaluate(Xfour, Y)

In [None]:
m.predict(x_test)
#m.summary()

predictval =  np.around(m.predict(x_test).T,2)
print(predictval)
output = (y_test.as_matrix().astype(np.float32))
print(output)
round(np.sqrt(np.sum((m.predict(x_test).T -output)**2)))

### Pre Processing of Data for Keras
I will now investigate whether pre processing the data (using all 13 inputs) makes any significant difference)

In [11]:
from sklearn.model_selection import train_test_split 

In [12]:
#preprocessing of all dataset[]
# scaling 
import sklearn.preprocessing as pre
xscale = pd.DataFrame(pre.scale(x), columns = x.columns)
xscale
yscale = pd.DataFrame(pre.scale(y))
xscale, yscale

(         CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
 0   -0.419782  0.284830 -1.287909 -0.272599 -0.144217  0.413672 -0.120013   
 1   -0.417339 -0.487722 -0.593381 -0.272599 -0.740262  0.194274  0.367166   
 2   -0.417342 -0.487722 -0.593381 -0.272599 -0.740262  1.282714 -0.265812   
 3   -0.416750 -0.487722 -1.306878 -0.272599 -0.835284  1.016303 -0.809889   
 4   -0.412482 -0.487722 -1.306878 -0.272599 -0.835284  1.228577 -0.511180   
 ..        ...       ...       ...       ...       ...       ...       ...   
 501 -0.413229 -0.487722  0.115738 -0.272599  0.158124  0.439316  0.018673   
 502 -0.415249 -0.487722  0.115738 -0.272599  0.158124 -0.234548  0.288933   
 503 -0.413447 -0.487722  0.115738 -0.272599  0.158124  0.984960  0.797449   
 504 -0.407764 -0.487722  0.115738 -0.272599  0.158124  0.725672  0.736996   
 505 -0.415000 -0.487722  0.115738 -0.272599  0.158124 -0.362767  0.434732   
 
           DIS       RAD       TAX   PTRATIO         B     LST

#### fitting and transforming

In [13]:
scaler=pre.StandardScaler()
scaler.fit(x)
x
scaler.mean_ , x.std()

(array([3.61352356e+00, 1.13636364e+01, 1.11367787e+01, 6.91699605e-02,
        5.54695059e-01, 6.28463439e+00, 6.85749012e+01, 3.79504269e+00,
        9.54940711e+00, 4.08237154e+02, 1.84555336e+01, 3.56674032e+02,
        1.26530632e+01]), CRIM         8.601545
 ZN          23.322453
 INDUS        6.860353
 CHAS         0.253994
 NOX          0.115878
 RM           0.702617
 AGE         28.148861
 DIS          2.105710
 RAD          8.707259
 TAX        168.537116
 PTRATIO      2.164946
 B           91.294864
 LSTAT        7.141062
 dtype: float64)

In [None]:
xscale = pd.DataFrame(scaler.transform(x), columns = x.columns)
xscale

In [None]:
from keras import models
from keras import layers
from keras.models import Sequential
from keras.layers import Dense, Activation

m = models.Sequential()
m.add(layers.Dense(64, activation='relu', input_dim =13))
m.add(layers.Dense(64, activation='relu', input_dim =13))
m.add(Dense(1,kernel_initializer='normal',activation='linear',use_bias=False))

m.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(xscale, y, test_size = 0.2)
m.fit(X_train, Y_train, epochs=1000, batch_size=22)


In [18]:
print(m.metrics_names)
m.evaluate(xscale, y)

['loss', 'accuracy']


[2.685029376636852, 0.09288537502288818]

In [33]:
# evaluate prediction loss
m.predict(X_test_scaled).round().T
Y_test.as_matrix().astype(np.float32)
m.evaluate(X_test, Y_test)



  


[12.47418212890625, 0.019607843831181526]

### whitening data

In [38]:
xwhite_train, xwhite_test, ywhite_train, ywhite_test = train_test_split(x, y, test_size = 0.2)
x.corr()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
CRIM,1.0,-0.200469,0.406583,-0.055892,0.420972,-0.219247,0.352734,-0.37967,0.625505,0.582764,0.289946,-0.385064,0.455621
ZN,-0.200469,1.0,-0.533828,-0.042697,-0.516604,0.311991,-0.569537,0.664408,-0.311948,-0.314563,-0.391679,0.17552,-0.412995
INDUS,0.406583,-0.533828,1.0,0.062938,0.763651,-0.391676,0.644779,-0.708027,0.595129,0.72076,0.383248,-0.356977,0.6038
CHAS,-0.055892,-0.042697,0.062938,1.0,0.091203,0.091251,0.086518,-0.099176,-0.007368,-0.035587,-0.121515,0.048788,-0.053929
NOX,0.420972,-0.516604,0.763651,0.091203,1.0,-0.302188,0.73147,-0.76923,0.611441,0.668023,0.188933,-0.380051,0.590879
RM,-0.219247,0.311991,-0.391676,0.091251,-0.302188,1.0,-0.240265,0.205246,-0.209847,-0.292048,-0.355501,0.128069,-0.613808
AGE,0.352734,-0.569537,0.644779,0.086518,0.73147,-0.240265,1.0,-0.747881,0.456022,0.506456,0.261515,-0.273534,0.602339
DIS,-0.37967,0.664408,-0.708027,-0.099176,-0.76923,0.205246,-0.747881,1.0,-0.494588,-0.534432,-0.232471,0.291512,-0.496996
RAD,0.625505,-0.311948,0.595129,-0.007368,0.611441,-0.209847,0.456022,-0.494588,1.0,0.910228,0.464741,-0.444413,0.488676
TAX,0.582764,-0.314563,0.72076,-0.035587,0.668023,-0.292048,0.506456,-0.534432,0.910228,1.0,0.460853,-0.441808,0.543993


In [44]:
import sklearn.decomposition as dec
pca = dec.PCA(n_components = 13, whiten = True)
pca.fit(xwhite_train)
x_whitenedtrain = pd.DataFrame(pca.transform(xwhite_train), columns=x.columns)
x_whitenedtrain

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,-0.554536,0.339165,-0.436310,-0.522168,-0.048450,-0.887684,-0.694780,-0.335010,0.536407,-0.038723,-0.041052,3.785653,-0.723778
1,-0.730177,-0.062618,-0.067972,-1.115359,0.147648,-0.356882,-0.076542,-0.151480,-0.728876,0.173575,0.682770,-0.172845,0.446336
2,1.664179,-1.516959,0.000787,0.025371,-1.923739,1.297532,3.195349,3.645164,-0.681283,-0.194521,-0.230173,0.037578,-1.277022
3,-0.817228,-0.005165,-0.345135,1.323194,-0.057734,0.371731,1.319661,-1.663135,-0.730084,-0.102496,0.467361,-0.645813,-1.621954
4,1.418404,-1.383860,-0.286932,0.307957,-0.067292,-0.259872,-0.544750,-0.677901,-0.065441,0.315039,-0.290046,-0.359325,0.692215
...,...,...,...,...,...,...,...,...,...,...,...,...,...
399,-0.630367,0.044950,1.217556,-2.741395,0.196047,0.533153,1.069602,-0.408614,-0.164883,-0.010950,-0.609372,-0.114435,-0.426412
400,1.439478,-1.177965,-0.172456,0.184495,-1.022794,-1.417373,-0.131307,-0.931688,-0.326040,0.196653,2.414885,-0.483383,1.759812
401,-0.787380,-0.024167,-0.441772,-0.630024,0.011187,-1.032520,-0.535877,1.303050,0.060603,-0.859208,-1.391527,-0.170314,-0.673719
402,1.979419,3.352224,-0.058084,0.360270,0.616302,1.891122,-1.389529,0.101597,0.172518,-0.317915,1.136648,0.078508,-2.281301


In [46]:
x_whitenedtrain.corr().round(),x_whitenedtrain.mean().round(),x_whitenedtrain.std().round()

(         CRIM   ZN  INDUS  CHAS  NOX   RM  AGE  DIS  RAD  TAX  PTRATIO    B  \
 CRIM      1.0  0.0   -0.0   0.0  0.0  0.0  0.0 -0.0  0.0  0.0     -0.0  0.0   
 ZN        0.0  1.0    0.0   0.0 -0.0  0.0 -0.0 -0.0  0.0 -0.0      0.0 -0.0   
 INDUS    -0.0  0.0    1.0  -0.0 -0.0 -0.0  0.0 -0.0 -0.0  0.0      0.0  0.0   
 CHAS      0.0  0.0   -0.0   1.0  0.0 -0.0  0.0  0.0 -0.0  0.0      0.0 -0.0   
 NOX       0.0 -0.0   -0.0   0.0  1.0  0.0  0.0  0.0 -0.0 -0.0      0.0  0.0   
 RM        0.0  0.0   -0.0  -0.0  0.0  1.0  0.0  0.0 -0.0  0.0      0.0 -0.0   
 AGE       0.0 -0.0    0.0   0.0  0.0  0.0  1.0 -0.0 -0.0  0.0     -0.0 -0.0   
 DIS      -0.0 -0.0   -0.0   0.0  0.0  0.0 -0.0  1.0 -0.0 -0.0     -0.0 -0.0   
 RAD       0.0  0.0   -0.0  -0.0 -0.0 -0.0 -0.0 -0.0  1.0  0.0      0.0 -0.0   
 TAX       0.0 -0.0    0.0   0.0 -0.0  0.0  0.0 -0.0  0.0  1.0      0.0 -0.0   
 PTRATIO  -0.0  0.0    0.0   0.0  0.0  0.0 -0.0 -0.0  0.0  0.0      1.0  0.0   
 B         0.0 -0.0    0.0  -0.0  0.0 -0

### build neural network model


from sklearn.model_selection import train_test_split 
Y=df['MEDV']
print(Y)
X =df[['ZN','RM', 'LSTAT','PTRATIO' ]]
print(X)

In [47]:
from keras import models
from keras import layers
from keras.models import Sequential
from keras.layers import Dense, Activation

m = models.Sequential()
m.add(layers.Dense(64, activation='relu', input_dim =13))
m.add(layers.Dense(64, activation='relu', input_dim =13))
m.add(Dense(1,kernel_initializer='normal',activation='linear',use_bias=False))

m.compile(loss='mse', optimizer='adam',metrics=['accuracy'])
    

In [49]:
#X_train, X_test, Y_train, Y_test = train_test_split(xscale, y, test_size = 0.2)
m.fit(x_whitenedtrain, Y_train, epochs=100, batch_size=22)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100


Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.callbacks.History at 0x16fa97c8>

In [None]:
xscale= X_train
#y = m.predict()

#history = m.fit(X_train,Y_train, validation_split = 0.20,epochs =150, batch_size =25)
#history = m.fit(x_train_white,Y_train, validation_split = 0.20,epochs =150, batch_size =25)
history = m.fit(x_train_white,Y_train, epochs =75, batch_size =25)

#history = m.fit(X_train,Y_train,epochs =150, batch_size =25)
print(history.history.keys())



In [None]:
x_test_white = scaler.transform(X_test)
m.predict(x_test_white).round().T
Y_test.as_matrix().astype(np.float32)
m.evaluate(x_test_white, Y_test)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
m.predict(X_test).T
m.summary()
print (m.predict(X_test).T)


In [None]:
#output = (Y_test)
#print(output)
#np.sqrt(np.sum((m.predict(X_test).T -output)**2))

## references


https://docs.scipy.org/doc/scipy/reference/stats.html
[2] https://stattrek.com/hypothesis-test/difference-in-means.aspx
https://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005
    
    
Pre Processing Data
https://keras.io/models/about-keras-models/
https://scikit-learn.org/stable/modules/preprocessing.html
https://keras.io/
    
### books
[1] Statistics: A very Short Introduction Hand, D., J 2008
test git desktop
Python Data Analysis - Fandango, Armando
    