# Artificial Neural Networks with Sci-kit Learn

## The Gist of Neural Nets

A neural network is a supervised classification algorithm. With your help, it kind of teaches itself how to make better classifications.

For a basic neural net, you have three primary components: an input layer, a hidden layer, and an output layer, each consisting of nodes. The nodes of the input layer are basically your input variables; the nodes of the hidden layer are neurons that contain some function that operates on your input data; and there is one output node, which uses a function on the values given by the hidden layer, and then there is a final output given by this calculation. If this isn't making much sense yet, don't worry. 

Each node is connected to every other node in the layers in front of it, so in other words, your input nodes aren't connected to each other, but they will be connected to every node in the hidden layer, and every node in the hidden layer will be connected to the output node.

![basic neural net](http://www.texample.net/media/tikz/examples/PNG/neural-network.png)

The gray lines connecting input nodes to neurons (nodes in the hidden layer) are all weighted. These weights are some value between 0 and 1, and will be multiplied with whatever the input value is. Any node in the hidden layer — let's say $Node_A$ — will essentially have two functions; a combination function and an activation function. The combination function will likely take the summation of all of the input nodes times their respective weights. Where W is weight and X is input:

**Summation function: **

$Net_A = \sum W_{iA}X_{iA} = W_{0A}(1) + W_{1A}X_{1A} + W_{2A}X_{2A} + W_{3A}X_{3A}$

This is basically saying that for the first node in the hidden layer (which we've called $Node_A$), every connection to it will be summed up. So the first input and its weight is denoted $W_{1A}X_{1A}$. The second input that connects to $Node_A$ and its weight is denoted $W_{2A}X_{2A}$, and so forth. The first term $W_{0A}$ will always be constant ```1```, where this term is a constant factor, much like the intercept in regression models.

If we make up some inputs and weights, the equation will look something like this:

$ Net_A = (1)(0.5) + (0.4)(0.6) + (0.2)(0.8) + (0.7)(0.6) = 1.32$

The resulting ```1.32``` would then be input into the activation function (likely sigmoid).

**Sigmoid function: **

$y = \frac{1}{1 + e^{-x}}$

$y = \frac{1}{1 + e^{-1.32}} = 0.7892$

This value is then given to the output node, $Node_Z$. $Node_Z$ then combines these outputs from Nodes A, B, etc. into a weighted sum (using the weights associated with the connections of these nodes). Now, $X_i$ is treated as the outputs from each node in the hidden layer, and the formula from above is used again.

$Net_Z = \sum W_{iZ} = W_{0Z} + W_{AZ}X_{AZ} + W_{BZ}X_{BZ}$

The sigmoid is used again on the output of $Net_Z$, producing the true output value of the neural network's first run. Then it's run again and again for however many data points have been defined.

The weights are what make and break the accuracy of the entire neural network. When the NN is initialized, these weights will be randomized. The neural net then operates and creates its output value, and this value is matched against what it *should* be. The error is taken, and then the neural net uses some user-defined method to go back through the net to adjust the weights so that the accuracy is maximized, and the error is minimized.

I don't really expect anybody to fully grasp what is happening from these simple descriptions. At the bottom of the notebook, I will link some other resources that are useful.

In [1]:
import numpy as np
import pandas as pd
import tabulate

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.neural_network import MLPClassifier #you will probably need to update sklearn/conda
from sklearn.model_selection import train_test_split

from IPython.display import display, HTML
pd.set_option('display.notebook_repr_html', True)

In [2]:
df = pd.read_csv("Clem3Training.txt")

In [3]:
display(df.head())

Unnamed: 0,age,workclass,demogweight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K.
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K.
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K.
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K.
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K.


## Data Preparation

In [5]:
#Creates two columns that will be used as their categorical counterparts
df['marital-status-cats'] = df['marital-status'].copy()
df['workclass-cats'] = df['workclass'].copy()

#This dictionary is interpreted as; in column of df, the key will be replaced by the value
category_replacement = {'marital-status-cats' : {'Married-civ-spouse': 'y', 'Married-AF-spouse': 'y', 'Married-spouse-absent': 'y',
                                                'Divorced': 'n', 'Widowed': 'n', 'Separated': 'n', 'Never-married': 'n'},
                        'workclass-cats': {'Federal-gov': 'Gov', 'Local-gov': 'Gov', 'State-gov': 'Gov', 'Self-emp-inc': 'Self',
                                           'Self-emp-not-inc': 'Self'}}
#Reduces the number of categories
df.replace(category_replacement, inplace=True)

print(df['workclass-cats'].unique())
print(df['marital-status-cats'].unique())
print(df['race'].unique())

print(df.race.value_counts())
print(df['marital-status-cats'].value_counts())
print(df['workclass-cats'].value_counts())

['Gov' 'Self' 'Private' '?' 'Without-pay' 'Never-worked']
['n' 'y']
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
White                 21391
Black                  2379
Asian-Pac-Islander      775
Amer-Indian-Eskimo      241
Other                   214
Name: race, dtype: int64
n    13215
y    11785
Name: marital-status-cats, dtype: int64
Private         17385
Gov              3367
Self             2835
?                1399
Without-pay         9
Never-worked        5
Name: workclass-cats, dtype: int64


### Encoding for Categorical Variables

In [6]:
#####################################

##CODE BLOCK FOR VARIABLE ENCODINGS##

#####################################

#Encoding Income
enc = LabelEncoder()

label_encoder = enc.fit(df['income'])
print ("Categorical classes:", label_encoder.classes_)

integer_classes = label_encoder.transform(label_encoder.classes_)
print ("Integer classes:", integer_classes)


#Encoding Marital-Status
label_encoder = enc.fit(df['marital-status-cats'])
integer_classes = label_encoder.transform(label_encoder.classes_)
df['marital-encoded'] = label_encoder.transform(df['marital-status-cats'])

#Encoding race
label_encoder = enc.fit(df['race'])
integer_classes = label_encoder.transform(label_encoder.classes_)
df['race-encoded'] = label_encoder.transform(df['race'])

#Encoding sex
label_encoder = enc.fit(df['sex'])
integer_classes = label_encoder.transform(label_encoder.classes_)
df['sex-encoded'] = label_encoder.transform(df['sex'])

#Encoding workclass
label_encoder = enc.fit(df['workclass-cats'])
integer_classes = label_encoder.transform(label_encoder.classes_)
df['workclass-encoded'] = label_encoder.transform(df['workclass-cats'])

Categorical classes: ['<=50K.' '>50K.']
Integer classes: [0 1]


Note: I should perhaps use OneHotEncoder on some of these variables that have more than two unique values. When there is a range [0,4], categories that are far from each other could be misunderstood by our model.

### Min-Max Standardization for Continuous Variables

In [7]:
##########################################

##CODE BLOCK FOR MIN-MAX TRANSFORMATIONS##

##########################################

#Standardizing age so numeric values aren't misrepresented in calculations
df['age_mm'] = (df['age'] - (df['age'].min()) / (df['age'].max() - df['age'].min()))

df['education-num_mm'] = (df['education-num'] - (df['education-num'].min()) / (df['education-num'].max() - df['education-num'].min()))
df['capital-gain_mm'] = (df['capital-gain'] - (df['capital-gain'].min()) / (df['capital-gain'].max() - df['capital-gain'].min()))
df['capital-loss_mm'] = (df['capital-loss'] - (df['capital-loss'].min()) / (df['capital-loss'].max() - df['capital-loss'].min()))
df['hours-per-week_mm'] = (df['hours-per-week'] - (df['hours-per-week'].min()) / (df['hours-per-week'].max() - df['hours-per-week'].min()))

In [8]:
display(df.head())

Unnamed: 0,age,workclass,demogweight,education,education-num,marital-status,occupation,relationship,race,sex,...,workclass-cats,marital-encoded,race-encoded,sex-encoded,workclass-encoded,age_mm,education-num_mm,capital-gain_mm,capital-loss_mm,hours-per-week_mm
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,...,Gov,0,4,1,1,38.767123,12.933333,2174.0,0.0,39.989796
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,Self,1,4,1,4,49.767123,12.933333,0.0,0.0,12.989796
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,...,Private,0,4,1,3,37.767123,8.933333,0.0,0.0,39.989796
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,...,Private,1,2,1,3,52.767123,6.933333,0.0,0.0,39.989796
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,...,Private,1,2,0,3,27.767123,12.933333,0.0,0.0,39.989796


In [9]:
#A little bit of dataframe tidying

#Dropping unnecessary columns
to_drop = ['age', 'hours-per-week', 'capital-loss', 'capital-gain', 'education-num', 'demogweight', 'education', 'relationship', 'native-country', 'marital-status', 'marital-status-cats', 'workclass-cats', 'workclass', 'occupation', 'sex', 'race']
df = df.drop(to_drop, axis = 1)

#Reordering the columns to make it easier to use model_selection function
cols = ['income', 'race-encoded', 'sex-encoded', 'capital-gain_mm', 'capital-loss_mm', 'education-num_mm', 'age_mm','hours-per-week_mm', 'marital-encoded', 'workclass-encoded']
df = df[cols]
display(df.head())

Unnamed: 0,income,race-encoded,sex-encoded,capital-gain_mm,capital-loss_mm,education-num_mm,age_mm,hours-per-week_mm,marital-encoded,workclass-encoded
0,<=50K.,4,1,2174.0,0.0,12.933333,38.767123,39.989796,0,1
1,<=50K.,4,1,0.0,0.0,12.933333,49.767123,12.989796,1,4
2,<=50K.,4,1,0.0,0.0,8.933333,37.767123,39.989796,0,3
3,<=50K.,2,1,0.0,0.0,6.933333,52.767123,39.989796,1,3
4,<=50K.,2,0,0.0,0.0,12.933333,27.767123,39.989796,1,3


### Partitioning the Data into Training and Test Sets

In [10]:
df_x = df.iloc[:,1:] #All of the input variables, from race-ended onward
df_y = df.iloc[:, 0] #The target variable, income

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = .2, random_state = 1)

The data must be partitioned into training and test sets because neural networks are a supervised learning method. You have to feed the model pre-classified data (the training set), and then its classifications are judged on how well they predict the test data. The ```train_test_split``` function makes this super convenient.

## Building our Neural Net

In [11]:
nn = MLPClassifier(activation = 'logistic', solver = 'sgd', hidden_layer_sizes = (7,), max_iter = 2000, random_state = 3)
%time nn.fit(x_train, y_train)
print("Neural net accuracy: " + str(nn.score(x_test, y_test, sample_weight=None)))

Wall time: 656 ms
Neural net accuracy: 0.781


### Neural Net Parameters

1. ```random_state``` ensures that there is some consistency in sampling every time you run the neural net (which is useful because I'm giving multiple examples here). 

2. ```max_iter``` being set to 2000 ensures that I can run neural nets with many hidden layers, each with many neurons, otherwise my examples below might cause an iteration error. 

3. Logistic ```activation``` is saying that the NN uses a sigmoid activation function. 

4. ```solver``` determines how the algorithm is going to go through the neural net to adjust the weights (for the sake of minimizing error and increasing accuracy), and for this exercise I've used 'sgd' or 'stochastic gradient descent' because it's a quicker method than the standard gradient descent. The standard descent goes through every data item, while its stochastic counterpart uses a random sample.

5. ```hidden_layer_sizes``` is a beast deserving of its own section.

  * **Number of hidden layers:** Looking at hidden_layer_size in the table below, you may see one number, e.g. the first column is (5, ). Some of the values hold two numbers (5, 5), or more. If you see one number, that means there is one hidden layer. So (5, ) represents a single hidden layer, while (5, 5) represents two hidden layers, and so forth. 

  * **Number of neurons in the hidden layer:** The actual number that you're seeing (like 5) is how many neurons sit within that hidden layer. In the table below, in the first column, there are 5 neurons in the hidden layer. In the second column, there are 5 neurons in the first hidden layer, and 5 neurons in the second hidden layer. Go to the last column, there are 100 neurons in the first hidden layer, 100 in the second hidden layer, and 100 in the third hidden layer. Though I've used consistent numbers throughout each hidden layer, you could just as well have variations, like (15, 10).

In [12]:
table = [["hidden_layer_size", "(5, )", "(5, 5)", "(15, 15)", "(20, 20)", "(100, 100)", "(20, 20, 20, 20)", "(60, 60, 60, 60)", "(100, 100, 100)"],
         ["Processing time", "1.12 s", "5.39 s", "2.92 s", "3.95 s", "29.8 s", "458 ms", "710 ms", "18.4 s"],
         ["Accuracy of neural net", 0.7630, 0.7630, 0.7788, 0.7788, 0.8106, 0.7630, 0.763, 0.7788]]
         
display(HTML(tabulate.tabulate(table, tablefmt='html')))

0,1,2,3,4,5,6,7,8
hidden_layer_size,"(5, )","(5, 5)","(15, 15)","(20, 20)","(100, 100)","(20, 20, 20, 20)","(60, 60, 60, 60)","(100, 100, 100)"
Processing time,1.12 s,5.39 s,2.92 s,3.95 s,29.8 s,458 ms,710 ms,18.4 s
Accuracy of neural net,0.763,0.763,0.7788,0.7788,0.8106,0.763,0.763,0.7788


### An Optimum Number of Layers and Nodes

Looking to the table above, you see some peculiarities:

1. (5, 5) takes longer to compute than (20, 20) and even (60, 60, 60, 60)
2. Accuracy dramatically drops when going from (100, 100) to (100, 100, 100)

There are more, but the second interestingly points us to some important rules when determining the number of layers and nodes to use in the model. Increasing the number of hidden layers beyond 2 is arbitrary and decreases the power of back propagation (the algorithm that helps neural networks shine, by going backwards and altering input weights for increased accuracy). As for speed, I'm not quite sure why that happens, but a guess would be that when you have more hidden layers, the libraries do better in utilizing more CPU cores than less. If you use other libraries like Tensorflow, you can opt to use the GPU to run the network instead.

[jj\_](https://stats.stackexchange.com/a/180052/163011) from StackExchange shares Jeff Heaton's criteria for choosing how many neurons to use:

**How many hidden nodes/neurons should I use?**


>The number of hidden neurons should be between the size of the input layer and the size of the output layer.

>The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.

>The number of hidden neurons should be less than twice the size of the input layer.


We can calculate these easily—visually, even, by looking at our dataframe header and counting the input variables. From 'race-encoded' to 'workclass-encoded', we have 9 input nodes. Neural nets always have one output node. Using the second criterion, (9) * (2 / 3) + 1 = 7. All of the listed criteria are fulfilled.

Shown above, using one hidden layer and 7 neurons in that layer, we get an accuracy of 0.781. It might be tempting to hack at the neural net to try to get a higher accuracy, but you would be lying to yourself. It isn't healthy to have the neural net learn the training dataset *too* completely, because then you have a neural net that is really close to the heart of your dataset, but can't be generalized to new data. This is known as **overfitting.**

#### Resources:

[Sklearn Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) MLPClassifier is one of Sklearn's neural network models, in which MLP stands for multi-layer perceptron.

[Understanding Neural Networks with Tensorflow Playground](https://cloud.google.com/blog/big-data/2016/07/understanding-neural-networks-with-tensorflow-playground) This is an awesome resource for gaining an intuition about how neural nets work. You can play with their model, adding and subtracting hidden layers and neurons, to see how the data's dimensions are reduced, and how its values are transformed in space. Visualizing what the sigmoid function is actually doing is super helpful.

[Visualizing Representations](https://colah.github.io/posts/2015-01-Visualizing-Representations/) This set of visualizations is linked in the previous resource but in case anyone glosses over it, it's also helpful in that there are real-world examples regarding language and textual data.