# Lab Five: Wide and Deep Network Architectures

In this lab, you will select a prediction task to perform on your dataset, evaluate two different deep learning architectures and tune hyper-parameters for each architecture. If any part of the assignment is not clear, ask the instructor to clarify. 

This report is worth 10% of the final grade. Please upload a report (<b>one per team</b>) with all code used, visualizations, and text in a rendered Jupyter notebook. Any visualizations that cannot be embedded in the notebook, please provide screenshots of the output. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.

<b>Dataset Selection</b>

Select a dataset similarly to lab one. That is, the dataset must be table data. In terms of generalization performance, it is helpful to have a large dataset for building a wide and deep network. It is also helpful to have many different categorical features to create the embeddings and cross-product embeddings. It is fine to perform binary classification, multi-class classification, or regression.

## Preparation (4 pts)

<ul>
    <li>[<b>1 points</b>] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). 
    <li>[<b>1 points</b>] Identify groups of features in your data that should be combined into cross-product features. Provide justification for why these features should be crossed (or why some features should not be crossed).</li>
    <li>[<b>1 points</b>] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a <b>detailed argument for why this (these) metric(s) are appropriate on your data</b>. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.</li>
    <li>[<b>1 points</b>] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). <b>Explain why your chosen method is appropriate or use more than one method as appropriate</b>. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice. </li>
</ul>

### Dataset Preparation & Preprocessing

In [None]:
# Importing packages and reading in dataset
import numpy as np
import pandas as pd

print('Pandas:', pd.__version__)
print('Numpy:',  np.__version__)

In [None]:
#read in csv file
merc_info = pd.read_csv('merc.csv')

#shuffle data
merc_info = merc_info.sample(frac=1)

#no null values, therefore do not need to impute any data
print(merc_info.info(null_counts=True))

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

merc_info["log_price"] = np.log(merc_info["price"])
del merc_info["price"]

merc_info["log_mileage"] = np.log(merc_info["mileage"])
del merc_info["mileage"]

# define objects that can encode each variable as integer
categorical = ["model", "transmission", "fuelType"]
encoders = dict()

# train all encoders
for cat in categorical:
    # integer encoded variables
    encoders[cat] = LabelEncoder() # save the encoder
    merc_info[cat+'_int'] = encoders[cat].fit_transform(merc_info[cat])

# scale the numeric, continuous variables
numerical = ["year", "log_price", "log_mileage", "tax", "mpg", "engineSize"]

for num in numerical:
    merc_info[num] = merc_info[num].astype(np.float)
    ss = StandardScaler()
    merc_info[num] = ss.fit_transform(merc_info[num].values.reshape(-1, 1))
    
categorical_headers_ints = [x+'_int' for x in categorical]
feature_columns = categorical_headers_ints + numerical

# Define features and target
X = merc_info[feature_columns]
y = merc_info["log_price"]

print(merc_info.info(null_counts=True))
merc_info

#### Explanation of Final Data Set
hello

### Split Dataset into Training and Testing

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(merc_info,y,test_size=0.2,random_state=123)

X_train_df = pd.DataFrame(X_train, columns= merc_info.columns)
X_test_df = pd.DataFrame(X_test, columns= merc_info.columns)

#### Explanation of Dataset Split
Our dataset is very large; it is made up about about 13,100 instances. Therefore, we are able to split our training and testing data by a 80-20 ratio.

### Embedding Categorical Data

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.layers import Embedding, Flatten, concatenate
from tensorflow.keras.models import Model

X_train = X_train_df[categorical_headers_ints].to_numpy() 
X_test = X_test_df[categorical_headers_ints].to_numpy() 

embed_branches = []
all_branch_outputs = [] # this is where we will keep track of output of each branch

# feed in the entire matrix of categircal variables
input_branch = Input(shape=(X_train.shape[1],), 
                     dtype='int64', 
                     name='categorical')

# for each categorical variable
for idx,col in enumerate(categorical_headers_ints):
    
    # what the maximum integer value for this variable?
    # which is the same as the number of categories
    N = max(X_train_df[col].max(),X_test_df[col].max())+1 
    
    # this line of code does this: input_branch[:,idx]
    x = tf.gather(input_branch, idx, axis=1)
    
    # now use an embedding to deal with integers as if they were one hot encoded
    x = Embedding(input_dim=N, 
                  output_dim=int(np.sqrt(N)), 
                  input_length=1, name=col+'_embed')(x)
    
    # save these outputs in list to concatenate later
    all_branch_outputs.append(x)
    
# now concatenate the outputs and add a fully connected layer
final_branch = concatenate(all_branch_outputs, name='concat_1')
final_branch = Dense(units=1,
                     activation='sigmoid', 
                     name='combined')(final_branch)

model = Model(inputs=input_branch, outputs=final_branch)

model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=['accuracy'])

model.summary()

### Making Crossed Columns

In [None]:
cross_columns = [['model', 'fuelType'],
                 ['transmission','fuelType'],
                 ['model','transmission']
                ]

# cross each set of columns in the list above
cross_col_df_names = []
for cols_list in cross_columns:
    # encode as ints for the embedding
    enc = LabelEncoder()
    
    # 1. create crossed labels by join operation
    X_crossed_train = X_train_df[cols_list].apply(lambda x: '_'.join(x), axis=1)
    X_crossed_test = X_test_df[cols_list].apply(lambda x: '_'.join(x), axis=1)
    
    # get a nice name for this new crossed column
    cross_col_name = '_'.join(cols_list)
    
    # 2. encode as integers
    enc.fit(np.hstack((X_crossed_train.to_numpy(),  X_crossed_test.to_numpy())))
    
    # 3. Save into dataframe with new name
    X_train_df[cross_col_name] = enc.transform(X_crossed_train)
    X_test_df[cross_col_name] = enc.transform(X_crossed_test)
    
    # keep track of the new names of the crossed columns
    cross_col_df_names.append(cross_col_name)     

#### Explanation of Evaluation Metrics
MSE cuz regression

## Modeling (5 pts)

<ul>
    <li>[<b>2 points</b>] Create at least three combined wide and deep networks to classify your data using Keras. Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations. Note: use the "history" return parameter that is part of Keras "fit" function to easily access this data. </li> 
    <li>[<b>2 points</b>] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two different number of layers. Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to select the number of layers that performs superiorly. </li>
    <li>[<b>1 points</b>] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). For classification tasks, use the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations.  Use proper statistical method to compare the performance of different models.  </li>
</ul>

## Exceptional Work (1 pt)

<ul>
    <li> 5000 students: You have free reign to provide additional analyses. </li> 
    <li>One idea (<b>required for 7000 level students</b>): Capture the embedding weights from the deep network and (<b>if needed</b>) perform dimensionality reduction on the output of these embedding layers (<b>only if needed</b>). That is, pass the observations into the network, save the embedded weights (called embeddings), and then perform  dimensionality reduction in order to visualize results. Visualize and explain any clusters in the data. </li>
</ul>