In [1]:
import keras
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.datasets import mnist
from keras.utils.np_utils import to_categorical
import numpy as np

(x_set, y_set), (x_test, y_test) = mnist.load_data()
x_set = x_set/255

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Introduction

Given we will not be having class next week and I cannot reasonably expect you to do work for which we will not have lectured; this weeks sprint will be broken up into two smaller pieces as was lossely voted on in class, with this being part 1.

For this sprint you will be doing a process called K-Fold Cross Validation.

### Instructions

In class you were briefly introduced to Keras, which is a high level machine learning library that can be used to create everything from an introductory model such as what you will be building to very complex models used in industry every day to handle everything from chat bots to object detection and more.

### Section 1

In the last sprint you did some exploration that helped you understand the dataset and what was in it, this time you are going to prepare it for training. 

Professor Memon had talked about in his lecture taking your data and properly holding back some of it so that later you could use it to validate if your model was working or not.

For this section you will be responsible for implementing in python an algorithm called K-Fold

This will be worth **40** points of the sprint


### Section 2

With K = 5 for the number of folds you will do the below:

Now that you have properly segmented your data you will have to train K-1 models and validate them. The code for the model has already been implemented, you do not need to worry about that.

The general procedure is:
    1. Split your dataset into K even sets of data using the k-fold algorithm.
    2. Train a model on set K=0
    3. Validate the model on set K=1
    4. Repeat for K+1 and K+2
    
**Note:** Training the models will take some time depending on your computer, each model will be saved so after you are sure this part is working you should only have to do it once. If you mess something up you can delete the model files and start again.
    
This will be worth **40** points of the sprint

### Section 3
Provide a few sentences about common pitfalls of k-fold-cross validation and training models with it.

This will be worth **20** points of the sprint

### Extra credit

There are very many other validation methods for constructing machine learning models. Find one and implement it.
This is worth **20** extra credit points for the sprint.


#### Note:
Before you begin, you can use the same virtual environments you created last week, but you must pip install h5py into them. h5py is a file format library that will be used to save the trained models. 



In [69]:
def k_fold_split(x_set, y_set, folds=1):
    '''
    Inputs: The x_set data from mnist, the y_set labels from mnist
    Expected Output: The shuffled and K split datasets
    '''
    print("within k fold split function")
    np.random.seed(1)
    fold_size = int(len(x_set) / folds)
    
    x_set_temp = np.reshape(x_set, (60000, 784))
    
    print("x_set type is" + str(type(x_set)))
    print("y_set type is" + str(type(y_set)))
    print("X set is " + str(len(x_set)))
    print("Y set is " + str(len(y_set)))
    
    combined_list = [(x_set_temp[i], y_set[i]) for i in range(len(x_set))]
    np.random.shuffle(combined_list)
    
    five_lists = [combined_list[i::folds] for i in range(folds)]
    
    #one list that holds 5 lists
    x_final_list = [tup[0] for lst in five_lists for tup in lst]
    x_final_list = [x_final_list[i::folds] for i in range(folds)]
    for i in range(len(x_final_list)):
        x_final_list[i] = np.array(x_final_list[i])
    
    y_final_list = [tup[1] for lst in five_lists for tup in lst]
    y_final_list = [y_final_list[i::folds] for i in range(folds)]
    #each item is a numpy array with shape (12000, 784)
    #for i in range(len(y_final_list)):
        #y_final_list[i] = np.array(y_final_list[i])
    
    print("len of x final list " + str(len(x_final_list)))
    print("len of y final list " + str(len(y_final_list)))
    print("x_final_type " + str(type(x_final_list)))
    print("y_final_type " + str(type(y_final_list)))
    print("x_final_type[0] " + str(type(x_final_list[0])))
    print("y_final_type[0] " + str(type(y_final_list[0])))
    print("length of y final[0] " + str(len(y_final_list[0])))
    print("type of y final[0] " + str(type(y_final_list[0])))
    print("x_final_type[0] shape " + str(x_final_list[0].shape))
    #print("y_final_type shape" + str(y_final_list[0].shape))
    
    return x_final_list, y_final_list
    
x_folds, y_folds = k_fold_split(x_set, y_set, 5)

within k fold split function
x_set type is<class 'numpy.ndarray'>
y_set type is<class 'numpy.ndarray'>
X set is 60000
Y set is 60000
len of x final list 5
len of y final list 5
x_final_type <class 'list'>
y_final_type <class 'list'>
x_final_type[0] <class 'numpy.ndarray'>
y_final_type[0] <class 'list'>
length of y final[0] 12000
type of y final[0] <class 'list'>
x_final_type[0] shape (12000, 784)


In [70]:
def construct_model():
    model = Sequential()
    model.add(Dense(512, activation='relu', input_shape=(784,)))
    model.add(Dropout(0.2))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer='RMSprop', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

#Epochs are the number of times the dataset will be iterated over, a good number is 20
def train_model(model, train_dataset, validation_dataset, epochs, name):
    x_set, y_set = train_dataset
    model.fit(x_set, y_set, epochs=epochs, batch_size=128, validation_data=validation_dataset)
    model.save(f'./{name}')

In [94]:
#Hint: Neural Networks can't just handle the lables as they are, they need --categorical-- data
#Note: You must submit the trained models along with the notebook for full credit
def train_validate_k(x_folds , y_folds, num_folds):
    '''
        Inputs: x_folds, the x folds returned from the k_fold algorithm above, 
        y_folds the y folds returned from the k_fold algorithm above
        num_folds, the number of folds used to make the x_folds and y_folds
        Expected Output: Nothing, this function has no explicit output, 
        but there must be num_fold models trained and saved to disk
    '''
    print("The length of x folds is " + str(len(x_folds)))
    print("The length of y folds is " + str(len(y_folds)))
    print("the type of x folds is " + str(type(x_folds)))
    print("the type of y folds is " + str(type(y_folds)))
    print("the type of first x folds is " + str(type(x_folds[0])))
    print("the type of first y folds is " + str(type(y_folds[0])))
    
    
    x_dataset, y_dataset = k_fold_split(x_folds, y_folds, num_folds)
    print("made it past dataset")
    print("length of x dataset" + str(len(x_dataset)))
    print("length of y dataset" + str(len(y_dataset)))
    for i in range(num_folds-1):
        model = construct_model()
        
        x_train_dataset = x_dataset[i]
        y_train_dataset = y_dataset[i]
        train_dataset = (x_train_dataset, y_train_dataset)
        
        print("shape of y dataset is " + str(np.array(y_dataset).shape))
        x_validate_dataset = x_dataset[i+1]
        y_validate_dataset = y_dataset[i+1]
        validate_dataset = (x_validate_dataset, y_validate_dataset)
        print(np.array(y_validate_dataset).shape)
        y_train = to_categorical(train_dataset[1]).reshape((1, -1))
        y_validate = to_categorical(validate_dataset[1]).reshape((1, -1))
        print(to_categorical(train_dataset[1]).reshape(1, -1).shape)
        print(to_categorical(validate_dataset[1]).reshape(1, -1).shape)
        
        print("type of x_train_dataset " + str(type(x_train_dataset)))
        print("x train dataset shape " + str(x_train_dataset.shape))
        print("x trai ndataset[0] shape" + str(x_train_dataset[0].shape))
        print("type of y_train_dataset " + str(type(y_train_dataset)))
        print("type of y_train_dataset " + str(type(y_train_dataset)))
        print("type of train_dataset " + str(type(train_dataset)))
        print("length of train dataset " + str(len(train_dataset)))
        print(y_train)
        print(y_validate)
        
        train_model(model, (train_dataset[0], y_train), (validate_dataset[0], y_validate), 20, 'xy_tv')

        
    

In [95]:
train_validate_k(x_folds, y_folds, 5)

The length of x folds is 5
The length of y folds is 5
the type of x folds is <class 'list'>
the type of y folds is <class 'list'>
the type of first x folds is <class 'numpy.ndarray'>
the type of first y folds is <class 'list'>
within k fold split function
x_set type is<class 'list'>
y_set type is<class 'list'>
X set is 5
Y set is 5
len of x final list 5
len of y final list 5
x_final_type <class 'list'>
y_final_type <class 'list'>
x_final_type[0] <class 'numpy.ndarray'>
y_final_type[0] <class 'list'>
length of y final[0] 1
type of y final[0] <class 'list'>
x_final_type[0] shape (1, 784)
made it past dataset
length of x dataset5
length of y dataset5
shape of y dataset is (5, 1, 12000)
(1, 12000)
(1, 120000)
(1, 120000)
type of x_train_dataset <class 'numpy.ndarray'>
x train dataset shape (1, 784)
x trai ndataset[0] shape(784,)
type of y_train_dataset <class 'list'>
type of y_train_dataset <class 'list'>
type of train_dataset <class 'tuple'>
length of train dataset 2
[[0. 0. 0. ... 0. 0. 

ValueError: Error when checking target: expected dense_123 to have shape (10,) but got array with shape (120000,)

#### Section 3, write a few sentences below.