# Neural Networks

In [None]:
%pip install keras_nlp
%pip install tensorflow_datasets
%pip install transformers
%pip install tensorflow-hub

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report, balanced_accuracy_score

We previously used methods like linear and logistic regression for supervised machine learning. You can think of a logistic regression model as a kind of network that takes a weighted sum of the inputs and then applies a non-linear transformation to produce a predicted probability of y = 1. In neural network parlance, you'll see this kind of transformation referred to as an [activation function](https://en.wikipedia.org/wiki/Activation_function#Table_of_activation_functions).

<div style='display: block;margin-left: auto;margin-right: auto; width: 50%;'>
<img src="logit_network_1.png" alt="drawing" width="600"/>
</div>

But what happens if the effect of $x_1$ depends on the value of $x_2$? Or what if the effect of $x_1$ is curvilinear? Or what if $x_1$ and $x_2$ have an interactive effect on the outcome? Since the activation function for the logistic regression model is based on a weighted sum, so unless we explicitly add a terms like $x_1 \times x_2$ or $x_1^2$ as features, we'll fail to capture any of these effects, which might lead us to some seriously flawed predictions. This problem becomes even more daunting when we consider models like the ones we used for text analysis, where we have thousands of predictors with all sorts of complicated relationships. This sort of problem is where neural networks can have a real advantage over simpler approaches.

As with a logistic regression model, a neural network model would take the same input and output layer. Unlike a logistic regression, a neural network would also include one or more "hidden" layers that would infer additional weights, pass the sum of the weights through another activation function, and then pass these new values to the output layer to generate a prediction:

<div style='display: block;margin-left: auto;margin-right: auto; width: 50%;'>
<figure>
    <img src="neural_network_1.png" alt="drawing" width="600"/>
</figure>
</div>

Including these hidden layers allows a neural network to model all sorts of complexity. In fact, a sufficiently complex network can theoretically approximate **[any continuous functional relationship](https://en.wikipedia.org/wiki/Universal_approximation_theorem)** between the predictors and the outcome that you could come up with. (See [here](http://neuralnetworksanddeeplearning.com/chap4.html) for a mostly visual introduction to this concept.) This ability to model non-linear relationships is the key advantage of neural networks, and is the central reason they're so useful when modeling really complex problems like classifying images or generating text.

## Building a model

Let's take a look at a quick example of doing some prediction. The `ncbirths` dataset has information on births in North Carolina, including information about the mother and baby. Our goal here will be to predict `lowbirthweight` status using some of the variables below.


| fage           | Father's age                                                                     |
|----------------|----------------------------------------------------------------------------------|
| mage           | Mother's age                                                                     |
| mature         | Maturity status of mother                                                        |
| weeks          | Weeks of gestation                                                               |
| premie         | Whether the baby was born prematurely (>=36 weeks)                               |
| visits         | Number of hospital visits during pregnancy                                       |
| marital        | Mother's marital status                                                          |
| racemom        | Mother's race                                                                    |
| hispmom        | Hispanic origin of mother.                                                       |
| gained         | Weight gained during pregnancy                                                   |
| weight         | Baby's weight                                                                    |
| lowbirthweight | Whether the baby was below 2500 grams                                            |
| gender         | Baby's gender                                                                    |
| habit          | Whether the mother smoked cigarettes                                             |

In [None]:
ncbirths = pd.read_csv("ncbirths.csv").dropna()
ncbirths.head()


We'll select some features to use in our model. For this analysis, we'll use the ages of both parents, the number of doctor visits, the number of weeks of gestation, and whether the mother smoked cigarettes.

In [None]:
features = ['mage', 'fage', 'weeks', 'visits', 'habit']
labels = 'lowbirthweight'

X_train, X_test, y_train, y_test = train_test_split(ncbirths[features], 
                                                    ncbirths[labels], 
                                                    test_size=0.2, 
                                                    random_state=42)

You'll notice that we have a mixture of categorical and numeric data here. We'll need to do some pre-processing to make these categorical columns suitable for use in a machine learning model. We'll make a `ColumnTransformer` pipeline to handle the data pre-processing. This function will take a list of tuples, where the first element is just a name, the second element is a `sklearn.preprocessing` function, and the third argument is a list of column names that we want to transform.

In this setup, we'll standardizing the numeric features and the categorical features will be one-hot (dummy) coded. The `remainder='passthrough'` argument means that any additional features that are not listed here will just be passed through to the next step without any modification.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

data_prep  = ColumnTransformer([
    ("standardizer", StandardScaler(), ["fage", "mage", "visits", 'weeks']),
    ("onehot_encoder", OneHotEncoder(handle_unknown='ignore'), ["habit"])
    ],
    remainder = 'passthrough'
    )

We'll also create a function to build a neural network model. 

For now, you can ignore most of the code below. The most important parts are the arguments to the `Sequential` function. We'll discuss each layer below.

In [None]:
from scikeras.wrappers import KerasClassifier
from typing import Dict, Iterable, Any

def create_model(meta: Dict[str, Any]):
    inp = meta["n_features_in_"]
    model = Sequential([
        Input(shape = (inp, ), name='Input'),            # 1. input layer
        Dense(16,  activation='relu', name='hidden'),    # 2. hidden layer
        Dense(1, activation='sigmoid', name='output')    # 3. output layer
        ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model


keras_model = KerasClassifier(model=create_model,  epochs=20, batch_size=64, verbose=0,  random_state=42)

1. The first layer is an `Input` layer. This is just going to take our features and pass them to the next layer. The only important argument here is `shape`, which should indicate the number of features we're including in the model. However, in this function, the number of features is being determined dynamically from the data.
2. The next `Dense` layer is the hidden layer. The first argument means we'll have 16 nodes in this layer, and they'll use a [rectified linear unit activation function]() (relu). 
3. Finally last `Dense` layer is our output layer. It just takes the results and applies a sigmoid function to convert them into probabilities.

Finally, we'll create complete pipeline by combining our data processing steps with the keras model function:

In [None]:
from sklearn.pipeline import Pipeline 

keras_pipe = Pipeline( [("data_prep", data_prep), 
                        ("neural_net", keras_model)
                         ] )

And now we'll use `fit` to fit the model to our training data, and then `predict` to predict the validation set:

In [None]:
keras_pipe.fit(X_train, y_train)

In [None]:
preds = keras_pipe.predict(X_test)

We'll put this together in a confusion matrix and create a classification report to assess the results:

In [None]:
pd.crosstab(preds, y_test)

In [None]:
print(classification_report(y_test, preds, 
                            # add target_names to show labels in the report:
                            target_names=['negative', 'positive']))

# add cohen's kappa and balanced accuracy
print("cohens kappa: ", cohen_kappa_score(y_test, preds))

<h2 style="color:red;font-weight:bold">Try it out</h2>


The code chunk below just replicates what we ran before. Try making some of the following changes, then write a brief assessment of how your results compare to the results from the previous step. Are there noticeable improvements? Did the model take longer to train? Did you notice any errors or warnings?

- Increase the number of  `epochs`
- Increase the number of nodes in the hidden layer to 64
- Add an additional `Dense` hidden layer with 16 nodes
- Change the activation function on the hidden layer to use the `sigmoid` or  `tanh` function
- Add additional predictors


In [None]:
features = ['mage', 'fage', 'weeks', 'visits', 'habit']
labels = 'lowbirthweight'
data_prep  = ColumnTransformer([
    ("standardizer", StandardScaler(), ["fage", "mage", "visits", 'weeks']),
    ("onehot_encoder", OneHotEncoder(handle_unknown='ignore'), ["habit"])
    ],
    remainder = 'passthrough'
    )
def create_model(meta: Dict[str, Any]):
    inp = meta["n_features_in_"]
    model = Sequential([
        Input(shape = (inp, )),
        Dense(16,  activation='relu'),
        Dense(1, activation='sigmoid')
        ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model
keras_model = KerasClassifier(model=create_model,  epochs=20, batch_size=64, verbose=0,  random_state=100)

keras_pipe_modified = Pipeline([
    ("data_prep", data_prep), 
    ("neural_net", keras_model)
    ])

In [None]:
# your code

Your comments:

## Note:

You can try using a k-fold stratified cross-validation on your model to compare accuracy scores across multiple runs of this process:

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5, random_state=100, shuffle=True)
scores = cross_val_score(keras_pipe_modified, ncbirths[features], ncbirths[labels], cv=skf, scoring='accuracy')
scores