# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 8 Assignment: Building a Kaggle Submission File**

**Student Name: **

# Assignment Instructions

For this assignment you will use the [**reg-30-spring-2018.csv**](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/reg-30-spring-2018.csv) dataset to train a neural network and [**reg-30-spring-2018-eval.csv
**](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/reg-30-spring-2018-eval.csv) to use as test to build a submission (similar to Kaggle).  The training code used for this assignment will be identical to [Assignmnent 4](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class4.ipynb) and you are encouraged to use your [Assignment 4](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class4.ipynb) code as a starting point.  Refer to [Module 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class8_kaggle.ipynb) for instructions on producing a Kaggle type submission file.  Please note, Module #8 provides an example of producing a classification (iris) submission file, you will need to convert this for 

The dataframe that you submit should have two columns: *id* and *target*.  The *id* column should matchup with the test data file.  The *target* column is your prediction.  It is unlikely that the mean of *target* will match exacly with mine.



# Helpful Functions

You will see these at the top of every module and assignment.  These are simply a set of reusable functions that we will make use of.  Each of them will be explained as the semester progresses.  They are explained in greater detail as the course progresses.  Class 4 contains a complete overview of these functions.

In [12]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os
import requests
import base64


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name, tv)
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
               * (normalized_high - normalized_low) + normalized_low
        
# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #8 Sample Code

The following code provides a starting point for this assignment.

In [14]:
import os
import pandas as pd
from scipy.stats import zscore
from keras.models import Sequential
from keras.layers.core import Dense, Activation
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics

# This is your student key that I emailed to you at the beginnning of the semester.
key = ""  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
# file='/resources/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # IBM Data Science Workbench
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\t81_558_class1_intro_python.ipynb'  # Windows
file=''  # Mac/Linux

# Begin assignment
path = ""

filename_train = os.path.join(path,"reg-30-spring-2018.csv")
filename_test = os.path.join(path,"reg-30-spring-2018-eval.csv")
df = pd.read_csv(filename_train,na_values=['NA','?'])
ids = df['id']
df.drop('id',1,inplace=True)
encode_text_dummy(df, 'region')
encode_text_dummy(df, 'item')
missing_median(df,'width')
encode_numeric_zscore(df, 'distance', mean=None, sd=None)
encode_numeric_zscore(df, 'landings', mean=None, sd=None)
encode_numeric_zscore(df, 'number', mean=None, sd=None)
encode_numeric_zscore(df, 'pack', mean=None, sd=None)
encode_numeric_zscore(df, 'age', mean=None, sd=None)
encode_numeric_zscore(df, 'weight', mean=None, sd=None)
encode_numeric_zscore(df, 'volume', mean=None, sd=None)
encode_numeric_zscore(df, 'width', mean=None, sd=None)
encode_numeric_zscore(df, 'max', mean=None, sd=None)
encode_numeric_zscore(df, 'power', mean=None, sd=None)
encode_numeric_zscore(df, 'size', mean=None, sd=None)
x,y=to_xy(df, 'target')
model = Sequential()
model.add(Dense(20, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x,y,verbose=2,epochs=250)
df_test = pd.read_csv(filename_test,na_values=['NA','?'])
encode_text_dummy(df_test, 'region')
encode_text_dummy(df_test, 'item')
missing_median(df_test,'width')
encode_numeric_zscore(df_test, 'distance', mean=None, sd=None)
encode_numeric_zscore(df_test, 'landings', mean=None, sd=None)
encode_numeric_zscore(df_test, 'number', mean=None, sd=None)
encode_numeric_zscore(df_test, 'pack', mean=None, sd=None)
encode_numeric_zscore(df_test, 'age', mean=None, sd=None)
encode_numeric_zscore(df_test, 'weight', mean=None, sd=None)
encode_numeric_zscore(df_test, 'volume', mean=None, sd=None)
encode_numeric_zscore(df_test, 'width', mean=None, sd=None)
encode_numeric_zscore(df_test, 'max', mean=None, sd=None)
encode_numeric_zscore(df_test, 'power', mean=None, sd=None)
encode_numeric_zscore(df_test, 'size', mean=None, sd=None)
ids = df_test['id']
df_test.drop('id',1,inplace=True)
x = df_test.as_matrix().astype(np.float32)
pred = model.predict(x)

submit_df=pd.DataFrame()
submit_df['id'] = ids
submit_df['target'] = pred
submit(source_file=file,data=submit_df,key=key,no=8)



Epoch 1/250
 - 0s - loss: 46216.8527
Epoch 2/250
 - 0s - loss: 18544.6846
Epoch 3/250
 - 0s - loss: 14919.7847
Epoch 4/250
 - 0s - loss: 14179.5961
Epoch 5/250
 - 0s - loss: 13678.1006
Epoch 6/250
 - 0s - loss: 13952.8411
Epoch 7/250
 - 0s - loss: 13842.1933
Epoch 8/250
 - 0s - loss: 13625.2021
Epoch 9/250
 - 0s - loss: 13217.3427
Epoch 10/250
 - 0s - loss: 13317.9201
Epoch 11/250
 - 0s - loss: 13699.2612
Epoch 12/250
 - 0s - loss: 13541.5521
Epoch 13/250
 - 0s - loss: 13453.8979
Epoch 14/250
 - 0s - loss: 14673.0618
Epoch 15/250
 - 0s - loss: 14107.1093
Epoch 16/250
 - 0s - loss: 14260.7881
Epoch 17/250
 - 0s - loss: 13026.4498
Epoch 18/250
 - 0s - loss: 13400.5247
Epoch 19/250
 - 0s - loss: 13243.7638
Epoch 20/250
 - 0s - loss: 13230.8938
Epoch 21/250
 - 0s - loss: 13287.7578
Epoch 22/250
 - 0s - loss: 14049.5080
Epoch 23/250
 - 0s - loss: 13876.4527
Epoch 24/250
 - 0s - loss: 13894.2363
Epoch 25/250
 - 0s - loss: 12952.5998
Epoch 26/250
 - 0s - loss: 13129.7035
Epoch 27/250
 - 0s - 

Epoch 214/250
 - 0s - loss: 13486.4786
Epoch 215/250
 - 0s - loss: 13934.6440
Epoch 216/250
 - 0s - loss: 13102.3276
Epoch 217/250
 - 0s - loss: 12732.7634
Epoch 218/250
 - 0s - loss: 13036.2645
Epoch 219/250
 - 0s - loss: 13465.9560
Epoch 220/250
 - 0s - loss: 13146.8408
Epoch 221/250
 - 0s - loss: 13409.8659
Epoch 222/250
 - 0s - loss: 12942.3512
Epoch 223/250
 - 0s - loss: 12865.2068
Epoch 224/250
 - 0s - loss: 12912.0583
Epoch 225/250
 - 0s - loss: 13297.3918
Epoch 226/250
 - 0s - loss: 12901.6835
Epoch 227/250
 - 0s - loss: 12687.5693
Epoch 228/250
 - 0s - loss: 13107.3034
Epoch 229/250
 - 0s - loss: 12739.4888
Epoch 230/250
 - 0s - loss: 13638.2314
Epoch 231/250
 - 0s - loss: 13017.3958
Epoch 232/250
 - 0s - loss: 12810.4333
Epoch 233/250
 - 0s - loss: 13673.8191
Epoch 234/250
 - 0s - loss: 12641.9918
Epoch 235/250
 - 0s - loss: 13040.8196
Epoch 236/250
 - 0s - loss: 13568.6945
Epoch 237/250
 - 0s - loss: 13493.5534
Epoch 238/250
 - 0s - loss: 12685.0058
Epoch 239/250
 - 0s - los



Success: Submitted Assignment #8 for q.yuan:
You have submitted this assignment 3 times. (this is fine)

