# Neural Network Basics

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
!pip install tensorflow

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report, balanced_accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout, Input
from tensorflow.keras.losses import SparseCategoricalCrossentropy

## Neural Networks 

In this notebook, we will go over how to train neural network models to do supervised machine learning using tensorflow. This is mostly to demonstrate how the code works and how one goes about building these types of models. A more detailed explanation of what exactly is going on behind the scenes and the math behind the implementation is reserved for another class. 



### Prediction Example

Let's take a look at a quick example of doing some prediction. The `ncbirths` dataset has information on births in North Carolina, including information about the mother, weeks of pregnancy, and whether the baby was a low birthweight baby or not.

In [None]:
ncbirths = pd.read_csv("ncbirths.csv").dropna()
ncbirths.head()


To make some of our later assessments easier, we're going to "dummify" some of the categorical variables. This generate K-1 columns (1 for each category) each with a zero if that observation is a member of that category, or a 1 otherwise.

In [None]:
dummified = pd.get_dummies(ncbirths, columns=['lowbirthweight','racemom','habit','gender'],  
                            drop_first=True, dtype='int')

In [None]:
dummified.head()

We'll try a very simple example of predicting the low birthweight status of the baby using the number of weeks that the pregnancy lasted. If we were to take a look at the relationship with a graph, it might look like the following. Note that `1` refers to low birthweight while `0` refers to not low birthweight. 

In [None]:
dummified.plot.scatter(y = 'lowbirthweight_not low', x = 'weeks')

So, how do we use `weeks` to predict the low birthweight status? Well, using a straight line to show the relationship wouldn't make sense. This is because low birthweight can only take one of two values: `0` or `1`. So, instead, we try to create a curved function that is constrained between 0 and 1 and represents the **probability** of low birth weight at different values of "weeks".

To do this, we'll model the relationship using a logistic or sigmoid function, which follows an S-shaped curve that is constrained between 0 and 1. You can see this function by running the commands below:

In [None]:
x = tf.linspace(-10, 10, 500)
x = tf.cast(x, tf.float32)
f = lambda x : (1/20)*x + 0.6
plt.plot(x, tf.math.sigmoid(x))
plt.ylim((-0.1,1.1))
plt.title("Sigmoid function");

In a logistic regession model, we attempt to find the parameters of this function that be most likely to have generated the observed data. 

## Running the model

Just like we did in the previous class, we'll try to avoid overfitting our data by splitting up the data into a training set and an evaluation set. 

In [None]:
y_data =dummified['lowbirthweight_not low'].tolist()

x_data = dummified.drop(columns=['lowbirthweight_not low'])

x_train, x_test, y_train, y_test = train_test_split(x_data,  y_data,
                                     test_size=0.20, # % of observations for validation
                                     random_state = 500
                                    ) # this is a random process, so you want to set a random seed! 




Let's take a look at what happens when we fit a logistic regression line.

In [None]:
pred_cols =['weeks']

logit = LogisticRegression()
logit.fit(x_train[pred_cols], y_train)

We can see how the model is operating by looking at some predictions at different values of "weeks":

In [None]:

pred_by_week = pd.DataFrame({'weeks':x_test.weeks.sort_values().unique()})
pred_by_week['preds'] =logit.predict_proba(pred_by_week)[:,1]
fig, axes = plt.subplots(figsize=(8,6))
pred_by_week.plot.line('weeks','preds', ax = axes)


Now, we just want to use the model to predict our held-out data and then report the results. We'll use the same set of metrics we used in the previous class. First, we'll look at the confusion matrix to compare predictions to the actual labels:

In [None]:
preds = logit.predict_proba(x_test[pred_cols])[:,1]


pd.crosstab(y_test, preds>=.5,  margins=True).rename_axis(index = 'Truth', columns='Predictions')

And then we'll look at some summary statistics for our predictions

In [None]:
print(classification_report(y_test, preds>=.5, 
                            # add target_names to show labels in the report:
                            target_names=['Low Birth Weight', 'Not Low Birthweight']))

# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(y_test, preds>=.5))
print("balanced accuracy: ", balanced_accuracy_score(y_test, preds>=.5))

The predictions here are ok overall. Still, it looks like it might do well for some, but there are lots of points that it does poorly on. That's probably because we're just using one feature (variable) to make a prediction. In reality, we might want to use many features, and we might suspect that they have a complicated set of relationships with each other. Their effect on birthweight might be curvilinear, or be different depending on the prescence or absence of some other characteristic.

That's where something like neural networks might come in. The above logistic regression is an example of what one **node** in a neural network might look like. Neural networks essentially work by combining lots of these types of simple relationships to create a complex model that makes predictions. The key advantage of a neural network is that a sufficiently complex network can approximate **any functional relationship** between the predictors and the outcome that you could specify. See [here](http://neuralnetworksanddeeplearning.com/chap4.html) for a mostly visual introduction to this concept.

![Neural Network](neural_network.png)

*Source: https://towardsdatascience.com/simple-introduction-to-neural-networks-ac1d7c3d7a2c*

To make our neural network, we'll start by converting our data into a format that tensorflow can use. We'll use the same set of predictor variables here that we used in the previous model:

In [None]:
cols = ['weeks','gender_male', 'habit_smoker']


In [None]:
x_train_tensor=tf.convert_to_tensor(x_train[cols])
x_test_tensor=tf.convert_to_tensor(x_test[cols])
y_train_tensor=tf.convert_to_tensor(y_train)
y_test_tensor = tf.convert_to_tensor(y_test)


Next, we'll create a normalization layer that will rescale our variables by subtracting their mean and then dividing them by their standard deviation. Neural networks have a tendency to overfit, especially on unbalanced data, and this kind of normalization is one way we can reduce that tendency. 

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(x_train_tensor))

In [None]:
# check out what this does on the first few rows of data:
normalizer(x_train_tensor[:3])


Now, we'll build a neural network by adding together several layers in a sequence. The `normalizer` layer just takes our input data and normalizes it. The `Dense` layers are "nodes": each one takes our input data and shifts and reweights it using an activation function. The final layer `Dense(1, activation='sigmoid')` is the output layer that converts the outputs from the nodes into a prediction that follows the same functional form as the logistic regression we performed earlier.

In [None]:
input_dim = x_train_tensor.shape[1]

model = Sequential([
  normalizer,
  Dense(64, activation='relu'),
  Dense(32, activation='relu'),
  Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              
             )

In [None]:
model.summary()

Now we'll build our model and train it for 10 epochs. Since our classes are imbalanced, we'll also use a `class_weight` to make the the low birth weight cases get more consideration when training the model. This usually helps the network make better predictions when there is an imbalance between groups.

In [None]:


history =  model.fit(x_train_tensor, y_train_tensor, 
                     # the number of iterations. This may need to go higher! Especially for compelx models
                     epochs=10,
                     validation_data = (x_test_tensor, y_test_tensor),
                     # controls the amount of output that is printed
                     verbose=1,
                     # adding a small weighting function.
                     class_weight =   {0:5., 1:1.}
                    )

In [None]:
model.evaluate(x_test_tensor,  y_test_tensor, verbose=2)

Now, we'll get our predictions and compare the results

In [None]:
preds = model.predict(x_test_tensor)


In [None]:
pd.crosstab(y_test_tensor, preds.flatten()>=.5,  margins=True).rename_axis(index = 'Truth', columns='Predictions')

In [None]:
print(classification_report(y_test_tensor, preds>=.5, 
                            # add target_names to show labels in the report:
                              target_names=['Low Birth Weight', 'Not Low Birthweight']))


# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(y_test_tensor, preds>=.5))
print("balanced accuracy: ", balanced_accuracy_score(y_test_tensor, preds>=.5))

<h2 style="color:red;font-weight:bold">Question</h2>

<span style="color:red;font-weight:bold">See if you can improve on the accuracy of the model above. Try adding another "layer", increasing the number of nodes, or adding additional features.</span>

Is this better or worse than the prior model? Any thoughts as to why it might perform better?

### MNIST Data

Let's look at an example of applying neural network modeling to some image data. The MNIST dataset that comes with the `tensorflow` package contains images from handwritten digits (numbers). Our goal is to train a neural network that is able to accurately determine what number is written based on the data from the image. In other words, we want to build a neural network that is able to recognize numbers that have been handwritten. 

The data itself is structured so that it is in a 2-dimensional format for each observation. Each observation is 28 by 28, with the values within each cell representing the intensity of the pixel. These values make up the **features**, or variables that we use to predict/classify the observation as one of the 10 numerical digits. 


In [None]:
# Load the data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Scale it so that the values are between 0 and 1
x_train, x_test = x_train / 255.0, x_test / 255.0

To visually see what the data look like, let's graph some of the observations. 

In [None]:
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(x_train[i], cmap=plt.cm.binary)
    plt.xlabel(y_train[i])

plt.show()

We specify the neural network model using `Sequential` and adding the layers in a list. 

Let's take a look at the layers one by one.
- `Flatten(input_shape=(28, 28))`: This flattens the 28 by 28 data into a 1-D format. There isn't anything being done to values at this step -- all that is happening is that the 2-D shape is being changed to a 1-D shape so that all of the same values are in a vector format.
- `Dense(128, activation='relu')` / `Dense(64, activation='relu')` : This is a dense layer, with the first argument specifying how many nodes there are. We have two dense layers in this neural network: one with 128 nodes and one with 64 nodes. You can imagine all of the features (variables) in our data feeding into every single one of the 128 nodes, and the outputs of those 128 nodes feeding into the 64 nodes in the next step. 
- `Dense(10)`: This is an **output layer**. Since we are trying to predict the image as being one of ten different categories (that is, the individual digits values from 0 to 9), we need a layer with 10 nodes.

In [None]:
model = Sequential([
  Flatten(input_shape=(28, 28)),
  Dense(128, activation='relu'),
  Dense(64, activation='relu'),
  Dense(10)
])

Then, we need to compile the model, specify the loss function, and give it the metric we will use to evaluate how it is doing. 

In [None]:
model.compile(optimizer='adam',
              loss=SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

Finally, we fit the model by giving it our data. Since we are using the training set to build our model, we give it the x and y data from the train. We also set the batch size and the number of epochs to 10. The **batch size** refers to how much of the data is used to fit the model at a time. An **epoch** refers to the number of times that the full data has been sent through the neural network. 

In [None]:
model.fit(x_train, y_train, batch_size = 32, epochs=10)

## Evaluation

Now, let's take a look at how this would do on new data. We can use the `evaluate` method to apply our trained model to the test set and see how accurate it actually is.

In [None]:
model.evaluate(x_test,  y_test, verbose=2)

In [None]:
predictions = model.predict(x_test)
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(x_test[i], cmap=plt.cm.binary)
    if np.argmax(predictions[i]) == y_test[i]:
        color = 'green'
    else:
        color = 'red'
    plt.xlabel(f'Predicted: {np.argmax(predictions[i])}, Actual: {y_test[i]}', color = color)

plt.show()