# T81-558: Applications of Deep Neural Networks
**Class 12: Deep Learning Applications**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

Tonight we will see how to apply deep learning networks to data science.  There are many applications of deep learning.  However, we will focus primarily upon data science.  For this class we will go beyond simple academic examples and see how to construct an ensemble that could potentially lead to a high score on a Kaggle competition.  We will see how to evaluate the importance of features and several ways to combine models.

Tonights topics include:

* Log Loss Error
* Evaluating Feature Importance
* The Biological Response Data Set
* Neural Network Bagging
* Nueral Network Ensemble

# Helpful Functions from Previous Classes

The following are utility functions from previous classes.

In [1]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os

# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df,name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name,x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)

# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the origional column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df,name,target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x)==str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name,tv)
        df[name2] = l
    
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df,name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_

# Encode a numeric column as zscores
def encode_numeric_zscore(df,name,mean=None,sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name]-mean)/sd

# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)

# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)

# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df,target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)

    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.int32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32),df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# Regression chart, we will see more of this chart in the next class.
def chart_regression(pred,y):
    t = pd.DataFrame({'pred' : pred, 'y' : y_test.flatten()})
    t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()
    
# Get a new directory to hold checkpoints from a neural network.  This allows the neural network to be
# loaded later.  If the erase param is set to true, the contents of the directory will be cleared.
def get_model_dir(name,erase):
    base_path = os.path.join(".","dnn")
    model_dir = os.path.join(base_path,name)
    os.makedirs(model_dir,exist_ok=True)
    if erase and len(model_dir)>4 and os.path.isdir(model_dir):
        shutil.rmtree(model_dir,ignore_errors=True) # be careful, this deletes everything below the specified path
    return model_dir

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name]-df[name].mean())>=(sd*df[name].std()))]
    df.drop(drop_rows,axis=0,inplace=True)
    
# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low =-1, normalized_high =1, 
                         data_low=None, data_high=None):
    
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])
    
    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
                * (normalized_high - normalized_low) + normalized_low

# LogLoss Error

Log loss is an error metric that is often used in place of accuracy for classification.  Log loss allows for "partial credit" when a miss classification occurs.  For example, a model might be used to classify A, B and C.  The correct answer might be A, however if the classification network chose B as having the highest probability, then accuracy gives the neural network no credit for this classification.  

However, with log loss, the probability of the correct answer is added to the score.  For example, the correct answer might be A, but if the neural network only predicted .8 probability of A being correct, then the value -log(.8) is added.

$$ logloss = -\frac{1}{N}\sum^N_{i=1}\sum^M_{j=1}y_{ij} \log(\hat{y}_{ij}) $$

The following table shows the logloss scores that correspond to the average predicted accuracy for the correct item. The **pred** column specifies the average probability for the correct class.  The **logloss** column specifies the log loss for that probability.


In [3]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

loss = [1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.075, 0.05, 0.025, 1e-8 ]

df = pd.DataFrame({'pred':loss, 'logloss': -np.log(loss)},columns=['pred','logloss'])

display(df)

Unnamed: 0,pred,logloss
0,1.0,-0.0
1,0.9,0.105361
2,0.8,0.223144
3,0.7,0.356675
4,0.6,0.510826
5,0.5,0.693147
6,0.4,0.916291
7,0.3,1.203973
8,0.2,1.609438
9,0.1,2.302585


The table below shows the opposit.  For a given logloss, what is the average probability for the correct class.

In [4]:
import numpy as np
import pandas as pd
from IPython.display import display, HTML

loss = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.5, 2, 2.5, 3, 3.5, 4 ]

df = pd.DataFrame({'logloss':loss, 'pred': np.exp(np.negative(loss))},
                  columns=['logloss','pred'])

display(df)

Unnamed: 0,logloss,pred
0,0.1,0.904837
1,0.2,0.818731
2,0.3,0.740818
3,0.4,0.67032
4,0.5,0.606531
5,0.6,0.548812
6,0.7,0.496585
7,0.8,0.449329
8,0.9,0.40657
9,1.0,0.367879


# Evaluating Feature Importance

Feature importance tells us how important each of the features (from the feature/import vector are to the prediction of a neural network, or other model.  There are many different ways to evaluate feature importance for neural networks.  The following paper presents a very good (and readable) overview of the various means of evaluating the importance of neural network inputs/features.

Olden, J. D., Joy, M. K., & Death, R. G. (2004). [An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data](http://depts.washington.edu/oldenlab/wordpress/wp-content/uploads/2013/03/EcologicalModelling_2004.pdf). *Ecological Modelling*, 178(3), 389-397.

In summary, the following methods are available to neural networks:

* Connection Weights Algorithm
* Partial Derivatives
* Input Perturbation
* Sensitivity Analysis
* Forward Stepwise Addition 
* Improved Stepwise Selection 1
* Backward Stepwise Elimination
* Improved Stepwise Selection

For this class we will use the **Input Perturbation** feature ranking algorithm.  This algorithm will work with any regression or classification network.  implementation of the input perturbation algorithm for scikit-learn is given in the next section. This algorithm is implemented in a function below that will work with any scikit-learn model.

This algorithm was introduced by [Breiman](https://en.wikipedia.org/wiki/Leo_Breiman) in his seminal paper on random forests.  Although he presented this algorithm in conjunction with random forests, it is model-independent and appropriate for any supervised learning model.  This algorithm, known as the input perturbation algorithm, works by evaluating a trained model’s accuracy with each of the inputs individually shuffled from a data set.  Shuffling an input causes it to become useless—effectively removing it from the model. More important inputs will produce a less accurate score when they are removed by shuffling them. This process makes sense, because important features will contribute to the accuracy of the model.

The provided algorithm will use logloss to evaluate a classification problem and RMSE for regression.

In [2]:
from sklearn import metrics
import scipy as sp
import numpy as np
import math

def mlogloss(y_test, preds):
    epsilon = 1e-15
    sum = 0
    for row in zip(preds,y_test):
        x = row[0][row[1]]
        x = max(epsilon,x)
        x = min(1-epsilon,x)
        sum+=math.log(x)
    return( (-1/len(preds))*sum)

def perturbation_rank(model, x, y, names, regression):
    errors = []

    for i in range(x.shape[1]):
        hold = np.array(x[:, i])
        np.random.shuffle(x[:, i])
        
        if regression:
            # The following code is only needed until Google fixes SKCOMPAT
            # pred = model.predict(x)
            pred = list(model.predict(x_test, as_iterable=True))
            error = metrics.mean_squared_error(y, pred)
        else:
            # The following code is only needed until Google fixes SKCOMPAT
            # pred = model.predict_proba(x)
            pred = list(model.predict_proba(x_test, as_iterable=True))
            error = mlogloss(y, pred)
            
        errors.append(error)
        x[:, i] = hold
        
    max_error = np.max(errors)
    importance = [e/max_error for e in errors]
   
    data = {'name':names,'error':errors,'importance':importance}
    result = pd.DataFrame(data, columns = ['name','error','importance'])
    result.sort_values(by=['importance'], ascending=[0], inplace=True)
    return result

### Classification Input Perturbation Ranking

In [3]:
# Classification ranking

import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow.contrib.learn as learn
import numpy as np
from tensorflow.contrib.learn.python.learn.metric_spec import MetricSpec

# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.INFO)

path = "./data/"
    
filename = os.path.join(path,"iris.csv")    
df = pd.read_csv(filename,na_values=['NA','?'])

# Encode feature vector
encode_numeric_zscore(df,'petal_w')
encode_numeric_zscore(df,'petal_l')
encode_numeric_zscore(df,'sepal_w')
encode_numeric_zscore(df,'sepal_l')
species = encode_text_index(df,"species")
num_classes = len(species)

# Create x & y for training

# Create the x-side (feature vectors) of the training
x, y = to_xy(df,'species')
    
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=45)

# Get/clear a directory to store the neural network to
model_dir = get_model_dir('iris',True)

# Create a deep neural network with 3 hidden layers of 10, 20, 5
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=x.shape[1])]
classifier = learn.DNNClassifier(
    model_dir= model_dir,
    config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1),
    hidden_units=[10, 20, 5], n_classes=num_classes, feature_columns=feature_columns)

# Might be needed in future versions of "TensorFlow Learn"
#classifier = learn.SKCompat(classifier) # For Sklearn compatibility

# Early stopping
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    x_test,
    y_test,
    every_n_steps=500,
    #metrics=validation_metrics,
    early_stopping_metric="loss",
    early_stopping_metric_minimize=True,
    early_stopping_rounds=50)
    
# Fit/train neural network
classifier.fit(x_train, y_train,monitors=[validation_monitor],steps=10000)


INFO:tensorflow:Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x10b0effd0>, 'save_checkpoints_steps': None, '_environment': 'local', '_evaluation_master': '', '_is_chief': True, 'tf_random_seed': None, 'save_summary_steps': 100, 'keep_checkpoint_max': 5, '_num_ps_replicas': 0, 'save_checkpoints_secs': 1, '_master': '', '_task_id': 0, 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, 'keep_checkpoint_every_n_hours': 10000, '_task_type': None}
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
sepa

<tensorflow.contrib.learn.python.learn.estimators.dnn.DNNClassifier at 0x115120b38>

In [4]:
# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.ERROR)

# Rank the features
from IPython.display import display, HTML

names = df.columns.values[0:-1] # x column names
rank = perturbation_rank(classifier, x_test, y_test, names, False)
display(rank)

Unnamed: 0,name,error,importance
3,petal_w,1.166125,1.0
2,petal_l,1.145329,0.982166
1,sepal_w,0.397985,0.341289
0,sepal_l,0.154799,0.132747


### Regression Input Perturbation Ranking

In [25]:
import tensorflow as tf
import tensorflow.contrib.learn as learn
from sklearn.model_selection import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

path = "./data/"

# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.INFO)

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])

# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')

# Encode to a 2D matrix for training
x,y = to_xy(df,'mpg')

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.20, random_state=42)

# Get/clear a directory to store the neural network to
model_dir = get_model_dir('mpg',True)

# Create a deep neural network with 3 hidden layers of 50, 25, 10
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=x.shape[1])]
regressor = learn.DNNRegressor(
    model_dir= model_dir,
    config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1),
    feature_columns=feature_columns,
    hidden_units=[50, 25, 10])

# Might be needed in future versions of "TensorFlow Learn"
#classifier = learn.SKCompat(classifier) # For Sklearn compatibility

# Early stopping
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    x_test,
    y_test,
    every_n_steps=500,
    early_stopping_metric="loss",
    early_stopping_metric_minimize=True,
    early_stopping_rounds=50)
    
# Fit/train neural network
regressor.fit(x_train, y_train,monitors=[validation_monitor],steps=10000)

INFO:tensorflow:Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x13f1b17f0>, 'save_checkpoints_steps': None, 'tf_random_seed': None, '_evaluation_master': '', '_environment': 'local', '_task_type': None, 'keep_checkpoint_every_n_hours': 10000, '_master': '', 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, 'keep_checkpoint_max': 5, '_is_chief': True, '_task_id': 0, 'save_checkpoints_secs': 1, '_num_ps_replicas': 0, 'save_summary_steps': 100}
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
sepa

DNNRegressor(optimizer=None, feature_columns=[_RealValuedColumn(column_name='', dimension=398, default_value=None, dtype=tf.float32, normalizer=None)], dropout=None, hidden_units=[50, 25, 10])

In [6]:
# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.ERROR)

# Rank the features
from IPython.display import display, HTML

names = df.columns.values[1:] # x column names
rank = perturbation_rank(regressor, x_test, y_test, names, True)
display(rank)

Unnamed: 0,name,error,importance
2,horsepower,25.487295,1.0
3,weight,23.591404,0.925614
5,year,13.226979,0.518964
1,displacement,10.212932,0.400707
4,acceleration,8.685765,0.340788
6,origin-1,8.395874,0.329414
7,origin-2,7.903455,0.310094
0,cylinders,7.881567,0.309235
8,origin-3,7.494909,0.294064


# The Biological Response Data Set

* [Biological Response Dataset at Kaggle](https://www.kaggle.com/c/bioresponse)
* [1st place interview for Boehringer Ingelheim Biological Response](http://blog.kaggle.com/2012/07/05/1st-place-interview-for-boehringer-ingelheim-biological-response/)

In [6]:
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold
from IPython.display import HTML, display

path = "./data/"

filename_train = os.path.join(path,"bio_train.csv")
filename_test = os.path.join(path,"bio_test.csv")
filename_submit = os.path.join(path,"bio_submit.csv")
df_train = pd.read_csv(filename_train,na_values=['NA','?'])
df_test = pd.read_csv(filename_test,na_values=['NA','?'])

activity_classes = encode_text_index(df_train,'Activity')

#display(df_train)

### Biological Response with Neural Network

In [38]:
import os
import pandas as pd
import tensorflow as tf
import tensorflow.contrib.learn as learn
from sklearn.model_selection import train_test_split
import tensorflow.contrib.learn as skflow
import numpy as np
import sklearn

# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.ERROR)

# Encode feature vector
x, y = to_xy(df_train,'Activity')
x_submit = df_test.as_matrix().astype(np.float32)
num_classes = len(activity_classes)

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42) 


# Get/clear a directory to store the neural network to
model_dir = get_model_dir('bio',True)

# Create a deep neural network with 4 hidden layers of [500, 250, 100, 50]
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=x.shape[1])]
classifier = learn.DNNClassifier(
    model_dir= model_dir,
    config=tf.contrib.learn.RunConfig(save_checkpoints_secs=60),
    hidden_units=[500, 250, 100, 50], n_classes=num_classes, feature_columns=feature_columns)

# Might be needed in future versions of "TensorFlow Learn"
#classifier = learn.SKCompat(classifier) # For Sklearn compatibility

# Early stopping
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    x_test,
    y_test,
    every_n_steps=500,
    #metrics=validation_metrics,
    early_stopping_metric="loss",
    early_stopping_metric_minimize=True,
    early_stopping_rounds=50)
    
# Fit/train neural network
print("Fitting/Training...")
classifier.fit(x_train, y_train,monitors=[validation_monitor],steps=10000)
print("Fitting done...")

# Give logloss error
pred = np.array(list(classifier.predict_proba(x_test, as_iterable=True)))
pred = pred[:,1]
# Clip so that min is never exactly 0, max never 1
pred = np.clip(pred,a_min=1e-6,a_max=(1-1e-6)) 
print("Validation logloss: {}".format(sklearn.metrics.log_loss(y_test,pred)))

# Evaluate success using accuracy
pred = list(classifier.predict(x_test, as_iterable=True))
score = metrics.accuracy_score(y_test, pred)
print("Validation accuracy score: {}".format(score))

# Build a submission file
pred_submit = np.array(list(classifier.predict_proba(x_submit, as_iterable=True)))
pred_submit = pred_submit[:,1]
# Clip so that min is never exactly 0, max never 1
pred = np.clip(pred,a_min=1e-6,a_max=(1-1e-6)) 
submit_df = pd.DataFrame({'MoleculeId':[x+1 for x in range(len(pred_submit))],'PredictedProbability':pred_submit})
submit_df.to_csv(filename_submit, index=False)


Fitting/Training...
Fitting done...
Validation logloss: 1.3024759304987723
Validation accuracy score: 0.7835820895522388


In [33]:
pred = np.array(list(classifier.predict_proba(x_test, as_iterable=True)))
pred = pred[:,1]


In [36]:
print(np.array(list(zip(pred,y_test))))

[[  1.00000000e+00   1.00000000e+00]
 [  9.99718249e-01   1.00000000e+00]
 [  6.02514367e-04   0.00000000e+00]
 ..., 
 [  1.00000000e+00   1.00000000e+00]
 [  1.70188039e-04   0.00000000e+00]
 [  4.79252189e-01   0.00000000e+00]]


# What Features/Columns are Important 

The following uses perturbation ranking to evaluate the neural network.

In [8]:
# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.ERROR)

# Rank the features
from IPython.display import display, HTML

names = df_train.columns.values[0:-1] # x column names
rank = perturbation_rank(classifier, x_test, y_test, names, False)
display(rank)

Unnamed: 0,name,error,importance
26,D26,1.927250,1.000000
50,D50,1.636884,0.849337
959,D959,1.636834,0.849311
1021,D1021,1.627036,0.844227
1066,D1066,1.623286,0.842281
200,D200,1.620021,0.840587
1370,D1370,1.619655,0.840397
1152,D1152,1.619063,0.840090
1187,D1187,1.618359,0.839725
978,D978,1.618107,0.839594


### Biological Response with Random Forest

In [37]:
# Random Forest

from sklearn.ensemble import RandomForestClassifier
import sklearn


x, y = to_xy(df_train,'Activity')
y = y.ravel() # Make y just a 1D array, as required by random forest
x_test = df_test.as_matrix().astype(np.float32)

rf = RandomForestClassifier(n_estimators=100)
rf.fit(x, y)
pred = rf.predict_proba(x_test)
pred = pred[:,1]
pred_insample = rf.predict_proba(x)
pred_insample = pred_insample[:,1]

submit_df = pd.DataFrame({'MoleculeId':[x+1 for x in range(len(pred))],'PredictedProbability':pred})
submit_df.to_csv(filename_submit, index=False)
print("Insample logloss: {}".format(sklearn.metrics.log_loss(y,pred_insample)))
#display(submit_df)

Insample logloss: 0.12571285789055234


# Neural Network Bagging

Neural networks will typically achieve better results when they are bagged.  Bagging a neural network is a process where the same neural network is trained over and over and the results are averaged together.

In [47]:
import numpy as np
import os
import pandas as pd
import math
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import tensorflow.contrib.learn as learn

PATH = "./data/"
SHUFFLE = False
FOLDS = 10

def mlogloss(y_test, preds):
    epsilon = 1e-15
    sum = 0
    for row in zip(preds,y_test):
        x = row[0][row[1]]
        x = max(epsilon,x)
        x = min(1-epsilon,x)
        sum+=math.log(x)
    return( (-1/len(preds))*sum)

def stretch(y):
    return (y - y.min()) / (y.max() - y.min())


def blend_ensemble(x, y, x_submit):

    folds = list(StratifiedKFold(y, FOLDS))
    feature_columns = [tf.contrib.layers.real_valued_column("", dimension=x.shape[0])]

    models = [
        learn.DNNClassifier(hidden_units=[100, 50, 25, 5], n_classes=2, feature_columns=feature_columns),  # steps=1000
        learn.DNNClassifier(hidden_units=[100, 50, 25, 5], n_classes=2, feature_columns=feature_columns), # steps=500
        learn.DNNClassifier(hidden_units=[200, 100, 50, 25], n_classes=2, feature_columns=feature_columns), # steps=1000
        learn.DNNClassifier(hidden_units=[200, 100, 50, 25], n_classes=2, feature_columns=feature_columns), # steps=500
        learn.DNNClassifier(hidden_units=[50, 25, 5], n_classes=2, feature_columns=feature_columns)] #steps=500

    dataset_blend_train = np.zeros((x.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))

    for j, model in enumerate(models):
        print("Model: {} : {}".format(j, model) )
        fold_sums = np.zeros((x_submit.shape[0], len(folds)))
        total_loss = 0
        for i, (train, test) in enumerate(folds):
            x_train = x[train]
            y_train = y[train]
            x_test = x[test]
            y_test = y[test]
            model.fit(x_train, y_train,steps=10)
            pred = np.array(list(classifier.predict_proba(x_test, as_iterable=True)))
            # pred = model.predict_proba(x_test)
            dataset_blend_train[test, j] = pred[:, 1]
            pred2 = np.array(list(classifier.predict_proba(x_submit, as_iterable=True)))
            #fold_sums[:, i] = model.predict_proba(x_submit)[:, 1]
            fold_sums[:, i] = pred2[:, 1]
            loss = mlogloss(y_test, pred)
            total_loss+=loss
            print("Fold #{}: loss={}".format(i,loss))
        print("{}: Mean loss={}".format(model.__class__.__name__,total_loss/len(folds)))
        dataset_blend_test[:, j] = fold_sums.mean(1)

    print()
    print("Blending models.")
    blend = LogisticRegression()
    blend.fit(dataset_blend_train, y)
    return blend.predict_proba(dataset_blend_test)

if __name__ == '__main__':

    np.random.seed(42)  # seed to shuffle the train set

    print("Loading data...")
    filename_train = os.path.join(PATH, "bio_train.csv")
    df_train = pd.read_csv(filename_train, na_values=['NA', '?'])

    filename_submit = os.path.join(PATH, "bio_test.csv")
    df_submit = pd.read_csv(filename_submit, na_values=['NA', '?'])

    predictors = list(df_train.columns.values)
    predictors.remove('Activity')
    x = df_train.as_matrix(predictors)
    y = df_train['Activity']
    x_submit = df_submit.as_matrix()

    if SHUFFLE:
        idx = np.random.permutation(y.size)
        x = x[idx]
        y = y[idx]

    submit_data = blend_ensemble(x, y, x_submit)
    submit_data = stretch(submit_data)

    ####################
    # Build submit file
    ####################
    ids = [id+1 for id in range(submit_data.shape[0])]
    submit_filename = os.path.join(PATH, "bio_submit.csv")
    submit_df = pd.DataFrame({'MoleculeId': ids, 'PredictedProbability': submit_data[:, 1]},
                             columns=['MoleculeId','PredictedProbability'])
    submit_df.to_csv(submit_filename, index=False)

Loading data...
Model: 0 : <tensorflow.contrib.learn.python.learn.estimators.dnn.DNNClassifier object at 0x0000020598BE8630>
Fold #0: loss=0.41971067017404995
Fold #1: loss=0.33424993056806734
Fold #2: loss=0.3571527327864354
Fold #3: loss=0.2970918889468416
Fold #4: loss=0.3301752010517648
Fold #5: loss=0.3377331461482012
Fold #6: loss=0.20413984968653634
Fold #7: loss=0.33457021469956666
Fold #8: loss=0.34963052553945656
Fold #9: loss=0.4366637295322299
DNNClassifier: Mean loss=0.340111788913315
Model: 1 : <tensorflow.contrib.learn.python.learn.estimators.dnn.DNNClassifier object at 0x00000205985E7748>
Fold #0: loss=0.41971067017404995
Fold #1: loss=0.33424993056806734
Fold #2: loss=0.3571527327864354
Fold #3: loss=0.2970918889468416
Fold #4: loss=0.3301752010517648
Fold #5: loss=0.3377331461482012
Fold #6: loss=0.20413984968653634
Fold #7: loss=0.33457021469956666
Fold #8: loss=0.34963052553945656
Fold #9: loss=0.4366637295322299
DNNClassifier: Mean loss=0.340111788913315
Model: 2 :

# Neural Network Ensemble

A neural network ensemble combines neural network predictions with other models.  The exact blend of all of these models is determined by logistic regression.  The following code performs this blend for a classification.

In [51]:
import numpy as np
import os
import pandas as pd
import math
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import tensorflow.contrib.learn as learn
import tensorflow as tf

PATH = "./data/"
SHUFFLE = False
FOLDS = 10

def mlogloss(y_test, preds):
    epsilon = 1e-15
    sum = 0
    for row in zip(preds,y_test):
        x = row[0][row[1]]
        x = max(epsilon,x)
        x = min(1-epsilon,x)
        sum+=math.log(x)
    return( (-1/len(preds))*sum)

def stretch(y):
    return (y - y.min()) / (y.max() - y.min())


def blend_ensemble(x, y, x_submit):

    folds = list(StratifiedKFold(y, FOLDS))
    feature_columns = [tf.contrib.layers.real_valued_column("", dimension=x.shape[1])]

    models = [
        learn.DNNClassifier(hidden_units=[100, 50, 25, 5], n_classes=2, feature_columns=feature_columns),
        KNeighborsClassifier(n_neighbors=3),
        RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]

    dataset_blend_train = np.zeros((x.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))

    for j, model in enumerate(models):
        print("Model: {} : {}".format(j, model) )
        fold_sums = np.zeros((x_submit.shape[0], len(folds)))
        total_loss = 0
        for i, (train, test) in enumerate(folds):
            x_train = x[train]
            y_train = y[train]
            x_test = x[test]
            y_test = y[test]
            
            if type(model) == tf.contrib.learn.python.learn.estimators.dnn.DNNClassifier:
                model.fit(x_train, y_train,steps=10)
                pred = np.array(list(classifier.predict_proba(x_test, as_iterable=True)))
                pred2 = np.array(list(classifier.predict_proba(x_submit, as_iterable=True)))
            else:
                model.fit(x_train, y_train)
                pred = model.predict_proba(x_test)
                pred2 = model.predict_proba(x_submit)
                
            dataset_blend_train[test, j] = pred[:, 1]
            fold_sums[:, i] = pred2[:, 1]
            
            loss = mlogloss(y_test, pred)
            total_loss+=loss
            print("Fold #{}: loss={}".format(i,loss))
        print("{}: Mean loss={}".format(model.__class__.__name__,total_loss/len(folds)))
        dataset_blend_test[:, j] = fold_sums.mean(1)

    print()
    print("Blending models.")
    blend = LogisticRegression()
    blend.fit(dataset_blend_train, y)
    return blend.predict_proba(dataset_blend_test)

if __name__ == '__main__':

    np.random.seed(42)  # seed to shuffle the train set

    print("Loading data...")
    filename_train = os.path.join(PATH, "bio_train.csv")
    df_train = pd.read_csv(filename_train, na_values=['NA', '?'])

    filename_submit = os.path.join(PATH, "bio_test.csv")
    df_submit = pd.read_csv(filename_submit, na_values=['NA', '?'])

    predictors = list(df_train.columns.values)
    predictors.remove('Activity')
    x = df_train.as_matrix(predictors)
    y = df_train['Activity']
    x_submit = df_submit.as_matrix()

    if SHUFFLE:
        idx = np.random.permutation(y.size)
        x = x[idx]
        y = y[idx]

    submit_data = blend_ensemble(x, y, x_submit)
    submit_data = stretch(submit_data)

    ####################
    # Build submit file
    ####################
    ids = [id+1 for id in range(submit_data.shape[0])]
    submit_filename = os.path.join(PATH, "bio_submit.csv")
    submit_df = pd.DataFrame({'MoleculeId': ids, 'PredictedProbability': submit_data[:, 1]},
                             columns=['MoleculeId','PredictedProbability'])
    submit_df.to_csv(submit_filename, index=False)



Loading data...
Model: 0 : <tensorflow.contrib.learn.python.learn.estimators.dnn.DNNClassifier object at 0x00000205ACD40C50>
Fold #0: loss=0.41971067017404995
Fold #1: loss=0.33424993056806734
Fold #2: loss=0.3571527327864354
Fold #3: loss=0.2970918889468416
Fold #4: loss=0.3301752010517648
Fold #5: loss=0.3377331461482012
Fold #6: loss=0.20413984968653634
Fold #7: loss=0.33457021469956666
Fold #8: loss=0.34963052553945656
Fold #9: loss=0.4366637295322299
DNNClassifier: Mean loss=0.340111788913315
Model: 1 : KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')
Fold #0: loss=3.606678388314123
Fold #1: loss=2.2197228940978317
Fold #2: loss=3.6717523663107237
Fold #3: loss=2.5045156203944594
Fold #4: loss=4.443553550438037
Fold #5: loss=4.410524301688227
Fold #6: loss=3.400455469543658
Fold #7: loss=3.0885474338547683
Fold #8: loss=2.1219335323249253
Fold #9: loss=3.0613772690497