# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 5 Assignment: K-Fold Cross-Validation**

**Student Name: **

# Assignment Instructions

For this assignment you will use the **reg-30-spring-2018.csv** dataset.  This is a dataset that I generated specifically for this semester.  You can find the CSV file in the **data** directory of the class GitHub repository here: [reg-30-spring-2018.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/reg-30-spring-2018.csv).

You will train 5 neural networks, one for each fold of a 5-fold cross validation and return the out of sample predictions.  You will submit these perdictions to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

Complete the following tasks:

* Normalize all numerics to zscores and all text/categoricals to dummies.  Do not normalize the *target*.
* Your target (y) is the filed named *target*.
* If you find any missing values (NA's), replace them with the median values for that column.
* Use a 5-fold cross validation and return out of sample predictions.  Your RMSE will not be as good as assignment #4, but this is because #4 was overfit.
* Your submission should contain the id (column name *id*), your prediction (column name *pred"), the expected value (from the **reg-30-spring-2018.csv** dataset, named *y*, and the absolute value of the difference between the expected and predicted (column name *diff*).
* You might get warnings about the means of your columns differing from mine.  Do not worry about small differences.  
* Your submitted dataframe will have these columns: id, y, pred, diff.


# Helpful Functions

You will see these at the top of every module and assignment.  These are simply a set of reusable functions that we will make use of.  Each of them will be explained as the semester progresses.  They are explained in greater detail as the course progresses.  Class 4 contains a complete overview of these functions.

In [2]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os
import requests
import base64


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name, tv)
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
               * (normalized_high - normalized_low) + normalized_low
        
# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #5 Sample Code

The following code provides a starting point for this assignment.

In [3]:
import os
import pandas as pd
from scipy.stats import zscore
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics
from sklearn.model_selection import KFold
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

# This is your student key that I emailed to you at the beginnning of the semester.
key = ""  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
# file='/resources/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # IBM Data Science Workbench
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\t81_558_class1_intro_python.ipynb'  # Windows
# file='/Users/jeff/projects/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # Mac/Linux
file = ''

# Begin assignment
path = ""

filename_read = os.path.join(path,"reg-30-spring-2018.csv")
df = pd.read_csv(filename_read)

# Encode the feature vector
ids = df['id']
df.drop('id',1,inplace=True)
encode_text_dummy(df, 'region')
encode_text_dummy(df, 'item')
missing_median(df,'width')
encode_numeric_zscore(df, 'distance', mean=None, sd=None)
encode_numeric_zscore(df, 'landings', mean=None, sd=None)
encode_numeric_zscore(df, 'number', mean=None, sd=None)
encode_numeric_zscore(df, 'pack', mean=None, sd=None)
encode_numeric_zscore(df, 'age', mean=None, sd=None)
encode_numeric_zscore(df, 'weight', mean=None, sd=None)
encode_numeric_zscore(df, 'volume', mean=None, sd=None)
encode_numeric_zscore(df, 'width', mean=None, sd=None)
encode_numeric_zscore(df, 'max', mean=None, sd=None)
encode_numeric_zscore(df, 'power', mean=None, sd=None)
encode_numeric_zscore(df, 'size', mean=None, sd=None)
# Encode to a 2D matrix for training
x,y = to_xy(df,'target')
# Cross-Validate
kf = KFold(5)
oos_y = []
oos_pred = []
fold = 0
for train, test in kf.split(x):
    fold+=1
    print("Fold #{}".format(fold))
        
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]
    
    model = Sequential()
    model.add(Dense(20, input_dim=x.shape[1], activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    
    monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
    model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=0,epochs=1000)
    
    pred = model.predict(x_test)
    
    oos_y.append(y_test)
    oos_pred.append(pred)        
    
    # Measure this fold's RMSE
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print("Fold score (RMSE): {}".format(score))
# Build the oos prediction list and calculate the error.  
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print("Final, out of sample score (RMSE): {}".format(score))    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
print(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )
# Save a copy if you like

oosDF=pd.DataFrame()
oosDF['id'] = ids
oosDF['y'] = oos_y
oosDF['pred'] = oos_pred
oosDF['diff'] = abs(oos_pred-oos_y)

oosDF.to_csv('5.csv',index=False)

# Submit assignment
submit(source_file=file,data=oosDF,key=key,no=5)



Fold #1
Epoch 00021: early stopping
Fold score (RMSE): 168.28468322753906
Fold #2
Epoch 00023: early stopping
Fold score (RMSE): 43.70139694213867
Fold #3
Epoch 00024: early stopping
Fold score (RMSE): 136.08485412597656
Fold #4
Epoch 00035: early stopping
Fold score (RMSE): 112.95545959472656
Fold #5
Epoch 00030: early stopping
Fold score (RMSE): 73.0551986694336
Final, out of sample score (RMSE): 115.65438079833984
               0
0      -8.165592
1     -21.348686
2     -26.013430
3     -20.795984
4     -21.365068
5     -10.866996
6     -17.973490
7     -21.344086
8     -36.634697
9      13.215101
10    -26.114847
11     -2.963598
12     39.961044
13   -125.353180
14     -2.048354
15     12.016312
16    -23.559896
17    -50.049709
18    -20.455130
19     32.706715
20     48.583778
21     -3.992370
22    -20.359522
23    -10.925120
24    234.530075
25     -9.960091
26    -34.372551
27     15.607453
28    -32.721043
29     29.529924
...          ...
1203  -51.202515
1204   24.545040
1

        id           y       pred
0        1   -8.165592  32.212528
1        2  -21.348686  30.675665
2        3  -26.013430  26.781134
3        4  -20.795984  -1.551021
4        5  -21.365068   5.017003
5        6  -10.866996  29.565313
6        7  -17.973490   8.494145
7        8  -21.344086  -9.466914
8        9  -36.634697  -0.452739
9       10   13.215101   9.478398
10      11  -26.114847   6.294805
11      12   -2.963598  11.996831
12      13   39.961044  39.323856
13      14 -125.353180   9.658208
14      15   -2.048354  21.567511
15      16   12.016312  -5.154353
16      17  -23.559896  -5.620616
17      18  -50.049709   6.583867
18      19  -20.455130  42.366825
19      20   32.706715  -4.593623
20      21   48.583778   6.241773
21      22   -3.992370  10.975469
22      23  -20.359522  -4.301799
23      24  -10.925120  -1.195865
24      25  234.530075   2.025083
25      26   -9.960091   6.544534
26      27  -34.372551   6.016118
27      28   15.607453  14.040166
28      29  -3