# Working with data
Data is one of the most important resources of the digital age. You and everyone around you creates an astounding amount of data each day.

Machine learning makes sense of this data through the help of machine learning models.

There are a large amount of free datasets available for machine learning. In this lesson we will work with two datasets. 

## Pima Indians diabetes dataset

In this tutorial, we are going to use the Pima Indians onset of diabetes dataset. This is a standard machine learning dataset from the UCI Machine Learning repository. It describes patient medical record data for Pima Indians and whether they had an onset of diabetes within five years.

As such, it is a binary classification problem (onset of diabetes as 1 or not as 0). All of the input variables that describe each patient are numerical. This makes it easy to use directly with neural networks that expect numerical input and output values, and ideal for our first neural network in Keras.

In [39]:
# Create your first MLP in Keras
from keras.models import Sequential
from keras.layers import Dense
import numpy
# fix random seed for reproducibility
numpy.random.seed(7)
# load pima indians dataset
dataset = numpy.loadtxt("../datasets/pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, Y, epochs=150, batch_size=10)
# evaluate the model
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

In [41]:
# make predictions
from keras.models import Sequential
from keras.layers import Dense
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("../datasets/pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
model.add(Dense(8, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, Y, nb_epoch=150, batch_size=10,  verbose=2)
# calculate predictions
predictions = model.predict(X)
# round predictions
rounded = [round(x[0]) for x in predictions]
print(rounded)



Epoch 1/150
0s - loss: 0.6774 - acc: 0.6510
Epoch 2/150
0s - loss: 0.6593 - acc: 0.6510
Epoch 3/150
0s - loss: 0.6471 - acc: 0.6510
Epoch 4/150
0s - loss: 0.6388 - acc: 0.6510
Epoch 5/150
0s - loss: 0.6320 - acc: 0.6510
Epoch 6/150
0s - loss: 0.6180 - acc: 0.6510
Epoch 7/150
0s - loss: 0.6181 - acc: 0.6510
Epoch 8/150
0s - loss: 0.6129 - acc: 0.6510
Epoch 9/150
0s - loss: 0.6082 - acc: 0.6510
Epoch 10/150
0s - loss: 0.6161 - acc: 0.6510
Epoch 11/150
0s - loss: 0.6055 - acc: 0.6510
Epoch 12/150
0s - loss: 0.6031 - acc: 0.6510
Epoch 13/150
0s - loss: 0.6001 - acc: 0.6510
Epoch 14/150
0s - loss: 0.6025 - acc: 0.6510
Epoch 15/150
0s - loss: 0.5978 - acc: 0.6510
Epoch 16/150
0s - loss: 0.5989 - acc: 0.6510
Epoch 17/150
0s - loss: 0.5982 - acc: 0.6510
Epoch 18/150
0s - loss: 0.6025 - acc: 0.6510
Epoch 19/150
0s - loss: 0.5967 - acc: 0.6510
Epoch 20/150
0s - loss: 0.5986 - acc: 0.6510
Epoch 21/150
0s - loss: 0.5956 - acc: 0.6510
Epoch 22/150
0s - loss: 0.5946 - acc: 0.6510
Epoch 23/150
0s - l

## The iris dataset. 
![Image of Iris](files/iris.jpg)
The iris dataset is a classic example dataset in machine learning. 
The dataset consists of the sepal length in cm, the sepal width in cm, the petal length in cm and the petal width
in cm for three different types of Iris flowers. Iris Setosa, Iris Versicolour and Iris Virginica.


In [36]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

# Pandas
#Pandas is a library for working with data in python
#http://pandas.pydata.org/

## Importing pandas

import pandas as pd

#Is usually how to import pandas to a project.

#When we are working with machine learning we commonly use CSV files.
dataset = pd.read_csv("../datasets/iris.csv")



## add Labels

#Add labels to our dataset
#1. sepal length in cm 
#2. sepal width in cm 
#3. petal length in cm 
#4. petal width in cm 
#5. class: 
#-- Iris Setosa 
#-- Iris Versicolour 
#-- Iris Virginica
dataset.columns = ['SL','SW','PL','PW','Class']

# Make the named classifiers into 0-1-2
dataset["Class"][dataset["Class"] == "Iris-Setosa"] = 0
dataset["Class"][dataset["Class"] == "Iris-Versicolour"] = 1
dataset["Class"][dataset["Class"] == "Iris-Virginica"] = 3


data = dataset.values
X = data[:,0:4].astype(float)
Y = data[:,4]

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

dataset.head

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


<bound method NDFrame.head of       SL   SW   PL   PW           Class
0    4.9  3.0  1.4  0.2     Iris-setosa
1    4.7  3.2  1.3  0.2     Iris-setosa
2    4.6  3.1  1.5  0.2     Iris-setosa
3    5.0  3.6  1.4  0.2     Iris-setosa
4    5.4  3.9  1.7  0.4     Iris-setosa
5    4.6  3.4  1.4  0.3     Iris-setosa
6    5.0  3.4  1.5  0.2     Iris-setosa
7    4.4  2.9  1.4  0.2     Iris-setosa
8    4.9  3.1  1.5  0.1     Iris-setosa
9    5.4  3.7  1.5  0.2     Iris-setosa
10   4.8  3.4  1.6  0.2     Iris-setosa
11   4.8  3.0  1.4  0.1     Iris-setosa
12   4.3  3.0  1.1  0.1     Iris-setosa
13   5.8  4.0  1.2  0.2     Iris-setosa
14   5.7  4.4  1.5  0.4     Iris-setosa
15   5.4  3.9  1.3  0.4     Iris-setosa
16   5.1  3.5  1.4  0.3     Iris-setosa
17   5.7  3.8  1.7  0.3     Iris-setosa
18   5.1  3.8  1.5  0.3     Iris-setosa
19   5.4  3.4  1.7  0.2     Iris-setosa
20   5.1  3.7  1.5  0.4     Iris-setosa
21   4.6  3.6  1.0  0.2     Iris-setosa
22   5.1  3.3  1.7  0.5     Iris-setosa
23   4.8  

When we are working with machine learning we commonly use CSV files.
A CSV file is a comma separated file for storing values.

## Test and training data.

For all supervised learning task you need to separate your data into test and training data.
This can be done using the excellent pandas library.


# Numpy
Numpy is a library for performing math in Python. For the system

In [35]:
# Visualize training history
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.layers import Dropout
from keras.optimizers import RMSprop
from keras import optimizers
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import numpy
import keras

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)


# split into input (X) and output 󰀀 variables

# Remove missing values
dataset = dataset.dropna()

# shuffle the data
dataset = dataset.sample(frac=1)

X = np.array(dataset.ix[:, :'PW'])
Y = np.array(dataset.Class)

# Dropout - the number of neurons removed at each layers, who are readded when testing
# Batch size - the number of data points added at each time, affects training time
# Epochs - the number of training/test sessions

# create model
model = Sequential()
model.add(BatchNormalization(input_shape=(4,)))
model.add(Dropout(0.1))
model.add(Dense(100, kernel_initializer='uniform', activation='relu'))
model.add(Dense(100, kernel_initializer='uniform', activation='relu'))
model.add(Dense(100, kernel_initializer='uniform', activation='relu'))

model.add(Dense(1, kernel_initializer='uniform', activation='softmax'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
# Fit the model
history = model.fit(X, Y, validation_split=0.5, epochs=250, batch_size=25, verbose=0)
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


ValueError: could not convert string to float: 'Iris-versicolor'

## Data from brain sensors
![Image of Brain Sensor](files/brainsensor.jpg)
This data is from a brain sensor that uses lasers to detect blood oxygen activation in the brain.
The data form the brain sensor gives us a timestamp and eight oxygenated and eight deoxygenated values.


In [50]:
# importing our data

#When we are working with machine learning we commonly use CSV files.
dataset = pd.read_csv("../datasets/taskdefault.csv")

# we start out by adding labels to our dataset
dataset.columns = ['Timestamp','Oxy1','DeOxy1','Oxy2','DeOxy2','Oxy3','DeOxy3', 'Oxy4','DeOxy4','Oxy5','DeOxy5','Oxy6','DeOxy6','Oxy7','DeOxy7','Oxy8','DeOxy8', 'Oxy9', 'DeOxy9', 'Oxy10', 'DeOxy10', 'Oxy11', 'DeOxy11', 'Oxy12', 'DeOxy12', 'State']

# check our first 50 data points using dataset.head
dataset.head(50)

# hmm, do you notice anything strange with this dataset
# how can we check if a dataset is any good?


Unnamed: 0,Timestamp,Oxy1,DeOxy1,Oxy2,DeOxy2,Oxy3,DeOxy3,Oxy4,DeOxy4,Oxy5,...,DeOxy8,Oxy9,DeOxy9,Oxy10,DeOxy10,Oxy11,DeOxy11,Oxy12,DeOxy12,State
0,1,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
1,2,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
2,3,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
3,4,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
4,5,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
5,6,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
6,7,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
7,8,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
8,9,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
9,10,97.9,32.47,16.99,5.813,17.44,5.903,17.45,5.905,93.98,...,6.02,72.21,22.5,16.48,5.596,20.35,6.853,12.94,4.45,1.0
