# deep learning

## intro

it's hard to understate the pervasiveness and success of deep learning methods in recent years. knowledge of deep learning techniques is a must for modern data scientists.

it's so important, in fact, that GU offers an entire class on it: [Math 514: intro to neural networks](https://myaccess.georgetown.edu/pls/bninbp/bwckctlg.p_display_courses?term_in=201910&one_subj=MATH&sel_crse_strt=514&sel_crse_end=514&sel_subj=&sel_levl=&sel_schd=&sel_coll=&sel_divs=&sel_dept=&sel_attr=#_ga=2.146187319.2080322458.1542654672-115011657.1531320772). this is not that course! you should consider taking it.

what follows is merely an incredibly hand-waivy introduction to deep neural nets. I hope to give you enough understanding and context that you feel comfortable executing simple deep learning code

### deep learning vs. deep neural nets

for starters, a bit of nomenclature: the lecture is called "deep learning" but I will often be talking about "deep neural nets" instead. they are related:

+ **deep learning** is a family of statistical modelling approaches that attempt to "learn" the underlying structure or most convenient representation of data in order to make a specific sort of prediction
    + "learning" happens through exposure to subsequent examples. a model trained as is should become a better model if exposed to a new example
    + the predictions made in deep learning are typically supervised (real-world targets), but not necessarily so (autoencoders)
+ **deep neural nets** are a sub-family of *deep learning* models that are specifically constructed out of inter-connected "neurons", computation steps that perform a linear transformation and then a subsequent nonlinear transformation

it's a minor distinction, but there are things that are **deep learning** that are not **deep neural nets**; we're not going to talk about them!

### introduction to deep neural nets

let's talk about what a deep neural net is

#### one neuron / node
the fundamental element of a neural net is the neuron. this is so named due to long-standing analogies to the way neurons work in a brain.

I think this analogy is more confusing than it is worth. *just watch me* call them nodes instead of neurons.

the ~~neuron~~ node is a two-step operation: you do one *linear* transformation with a vector of weights and a bias value, then you do one *nonlinear* transformation with some function (called the **activation function**).

what weights? what bias value? what function? to be determined!

suppose we have a record of data with two features $x_1$ and $x_2$. a neuron that can act on that record would have two weight values ($w_1$ and $w_2$, one for each feature), a bias value $b$, and an activation function $f$

<br><div align="center"><img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-09-at-3-42-21-am.png?w=568&h=303" width="600px"></div>

the first step is the *linear* transformation. symoblically, this is:

$$
W \cdot x + b = \sum_i W_i x_i + b
$$

geometrically, this is a measurement of how large the vector $x$ is when projected along the weight $W$

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1DYpFUfRxuuAhn66402TJOMbyXCzf4Zcw" width="600px"></div>

basically, we are *linearizing* the input by converting every incoming record in whatever space to one single number measuring the amount of that vector pointing in some specific direction.

after we have linearized the input, we add a non-linearity using an **activation function**. this activation function takes one input value (the linearized input value) and outputs something that is specifically non-linear.

there are [a lot of these functions](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions), but the most common are

the **sigmoid**, $\sigma(x) = \dfrac{1}{1 + \exp(-x)}$

In [28]:
import plotly.offline, plotly.graph_objs as go; plotly.offline.init_notebook_mode(connected=True)
x = np.linspace(-10, 10, 100)
y = 1 / (1 + np.exp(-x))
data = [go.Scatter(x=x, y=y)]
plotly.offline.iplot(data)

the **ReLU** (**Re**ctified **L**inear **U**nit),

$$
\operatorname{relu}(x) = \left\{\begin{array}{ll}
0 & x \leq 0 \\
x & x > 0
\end{array}
\right.
$$

In [29]:
y = np.where(x <= 0, 0, x)
data = [go.Scatter(x=x, y=y)]
plotly.offline.iplot(data)

the **leaky ReLU**,

$$
\operatorname{relu}(x) = \left\{\begin{array}{ll}
0.01 x & x \leq 0 \\
x & x > 0
\end{array}
\right.
$$

In [30]:
y = np.where(x <= 0, 0.01 * x, x)
data = [go.Scatter(x=x, y=y)]
plotly.offline.iplot(data)

#### a stack of neurons

once we understand what one neuron is doing, we could take a whole stack of $N$ of them. each could have different $W$ and $b$ values. they could have different activation functions (but typically don't). and they all could take one input record and create $N$ output values. 

this is often visualized as a "net", where the inputs and neurons (nodes) are drawn as circles, and the "weights" are represented as edges (that is, edge from a node to $x_i$ represents that nodes' $w_i$ value)

<br><img src="https://draftin.com/images/34466?token=YFsmpDuQfD3DDylinRD8F4sLOgjCFm4Aow1gIWoCY5KED3bnQKs17RaTja95OIQQWdr25dqS2fxq_6mDwwdcs9Y"></img>

so with a stack of $N$ nodes we can convert an input record $x$ into an $N$-dimensional output record $z$

#### a stack of stack of neurons

the output of one stack of neurons is a new record. it's in some crazy $N$-dimensional space which is determined by the weights and biases of the previous layer, but it's basically now just a new record.

so we could do the same thing with *that* record that we did with our $x$ records, and feed it into a *new* stack of nodes

this is the "deep" neural net -- it's a neural net with hidden layers, so it's become "deep"

<br><img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg"></img>

#### finally, an output

remember, we started down this path because our model we are constructing should be able to *predict* something. so we need a final layer that will take... whatever it is that we've created -- whatever that representation is -- and predict a value.

in practice, this is usually a logistic function (for binary predictions), a softmax (for categorical predictions), a linearization-only node (for regression), or a collection of logistics (for multi-categoriy predictions)

**<div align="center">what are your questions so far?</div>**

#### summary

so a neural net is: a series of **layers**, where each **layer** is a stack of some number of **neurons**, and each **neuron** is a linearization (defined by a weight $w$ and a bias $b$) followed by an **activation** function.

#### why it works

if I just gave you a neural net with random number of layers, with layers of random node size, and with random weights and biases throughout, it would be *terrible* at making predictions. so it's not the *structure* that is making good predictions.

rather, this particular way of arranging things has some special properties that make it easy to figure out how to tweak weights to incrementally improve those predictions. the process whereby we tweak weights is called *backpropagation* and is, at its heart, just the chain rule applied to millions of variables.

slowly but surely, and with enough input data, we can update the weights in our deep neural net to **learn** the ways of representing our data (the elements that come out of each layer of nodes) that are **optimal** for making our predictions.

in a way, it's almost like cheating -- we know we want to make predictions, and we have a clever way of mashing together our input features such that what comes out is some $N$ dimensional vector that we can pass to a logistic regression and get amazing results

#### why we care (in this class)

so why go through the hassle of covering this in "advanced math and statistical computing" when it's the topic of an entire different course?

because there's so much computing action focused on deep learning!

we spent the bulk of last lecture talking about how anything that can be parallelized is a good candidate for `gpu` analytics and acceleration, and in particular linear algebra.

well, as you saw above, deep neural nets are a giant pile of linear algebra. recent advancements in `gpu` availability (price and number) as well as speed have caused an explosion of `gpu` deep learning application development. including ours, in this course

## higher-level deep learning `api`s

let's figure out how we can code deep neural network models!

recall the deep learning stack picture from above:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1M3LZQRI8nfCscnyL_h7xjKi4i9e8lo1t"></div>

this is called a stack because each level is providing a new interface and possibly new functionality "on top of" the level below it. the bottom three green levels are all very low level, and represent the particular stack made available by `nvidia`.

the very lowest level of that diagram (`gpu`) is the hardware level. you could have several types of `gpu`, but in this class we will focuse on `nvidia` brand `gpu`s.

the level above that (`cuda`) is a set of `c++` libraries which allow programmers to write code that can be executed on the `gpu`. for `nvidia` `gpu`s, developers at `nvidia` have done the heavy lifting here, creating the bridge from the hardware (extremely low-level instructions!) to `c++` (a full OOP language).

they have also used that set of `c++` libraries to create a second set of libraries that use the lower-level `c++` `cuda` code base to implement functionality that is specifically meant to be used in deep learning applications.

at this point, if you are an extremely `l33t h4x0r` `c++` programmer, feel free to hop into your `ide` and bang out some deep neural nets.

for the rest of us mere mortals, we will focus on even-higher-level apis in `python`. fortunately, a few exist

the orange (`tensorflow`) and red (`keras`) boxes represent two levels of abstraction available to `python` coders for interacting with `gpu`s.

+ `tensorflow` is a numerical computation framework that defines complicated computations as a directed graph of smaller computations
    + we define a "graph" of operations (nodes) that we connect by their inputs and ouputs (edges)
    + the graph defines how to get from one first node (e.g. loading the ultimate input) to any step downstream in the graph
    + this is *not* deep learning specific (we could create another crappy alarm clock script with `tensorflow`, e.g.), but it was created with deep learning in mind

perhaps that begs the question: how can I use this existing `c++` libraries from with `python`?

you look for a `python` `api` that 

+ `keras` is a deep-learning-focused framework
    + this is a high-level way of describing deep neural networks and training methods in simple `python` code
    + it has *backends*, internal libraries which are used to *implement* the higher-level `keras` framework

#### deep learning `api`s

each of these libraries is an *interface* to *something* beneath it. they provide developers a set of functions in some runtime (e.g. `python`) that hide the difficult, messy internal implementation details, so that someone who wants to use `tensorflow` to do *whatever* it is that `tensorflow` does won't need to know or care about what's happening a level below.

like any good interface, each of them **can** be an interface to the layer below

+ `tensorflow` is a `python` interface to `c++` libraries `cuda` and `c++` (really, it "goes through" the `tensorflow` `c++` libraries, for an extra layer of interface-y goodness)
+ `keras` is a `python` interface to `tensorflow`

but they also **are not required** to be using that particular lower-level piece 

+ `tensorflow` can use non-`nvidia` or non-`gpu` lower-level libraries (that is, it can work on different `gpu`s, or `cpu`s, or google's proprietary `tpu`s, or `android` phones)
+ `keras` can use other `python` deep learning libraries (e.g. `theano`, `cntk`, or apache `mxnet`)

you've used at least one subject-matter-specific `api` library before in this class: the `scikit-learn` library is a framework for creating machine learning models. you are used to relying on the same sort of `api` for `sklearn` models:

```python
model = sklearn.somemodeltype.MySpecialClassifier(param1, param2)
model.fit(X_train, y_train)
model.score(X_test, y_test)
model.predict(X_test)
```

all that changes from model to model is the actual `model` object you create, but there are standard ways of creating those models. then you assume they all have the same methods

finally, if you weren't confused enough yet, the `keras` library is available as a standalone `python` package but *also* as a modele within the `tensorflow` package:

In [35]:
import keras
help(keras)

Using TensorFlow backend.


Help on package keras:

NAME
    keras

PACKAGE CONTENTS
    activations
    applications (package)
    backend (package)
    callbacks
    constraints
    datasets (package)
    engine (package)
    initializers
    layers (package)
    legacy (package)
    losses
    metrics
    models
    objectives
    optimizers
    preprocessing (package)
    regularizers
    utils (package)
    wrappers (package)

DATA
    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...

VERSION
    2.2.4

FILE
    /Users/zach.lamberty/miniconda3/envs/bullshit/lib/python3.6/site-packages/keras/__init__.py




In [37]:
import tensorflow as tf
help(tf.keras)

Help on package tensorflow._api.v1.keras in tensorflow._api.v1:

NAME
    tensorflow._api.v1.keras - Implementation of the Keras API meant to be a high-level API for TensorFlow.

DESCRIPTION
    Detailed documentation and user guides are available at
    [keras.io](https://keras.io).

PACKAGE CONTENTS
    activations (package)
    applications (package)
    backend (package)
    callbacks (package)
    constraints (package)
    datasets (package)
    estimator (package)
    initializers (package)
    layers (package)
    losses (package)
    metrics (package)
    models (package)
    optimizers (package)
    preprocessing (package)
    regularizers (package)
    utils (package)
    wrappers (package)

VERSION
    2.1.6-tf

FILE
    /Users/zach.lamberty/miniconda3/envs/bullshit/lib/python3.6/site-packages/tensorflow/_api/v1/keras/__init__.py




note: these are not the same version!!

let's take a step back and re-focus on what we want to do, to help illuminate what these `api`s (`tensorflow` and `keras`) are doing for us.

we want to create a deep neural network models, and we'd like to be able to use `gpu`s to accelerate our computation if we would like

`tensorflow` can help us do this

+ **if** we can define our model as a directed graph of computation nodes and input / output edges, **then** `tensorflow` will handle the implementation on different lower-level hardwares (`gpu`, `cpu`, `tpu`, `android`) for us
+ **if** we have a novel or experimental neural network architecture we want to try, **then** `tensorflow` provides us with all of the necessary infrastructure to create and train that model in the same way we would train any other model. we should be able to build pretty much *whatever* deep learning model we want
+ **if** we want to train a fairly straightfward model type (`dnn`, `cnn`, `rnn`, `lstm`), **then** we may have to work a little bit harder than we'd like to define that model (see `keras`)

also, `keras` can help us do this

+ **if** we have a backend which implements neural net computation methods (e.g. `tensorflow` or `theano`), **then** `keras` gives us a *much* simpler interface for writing that code
+ **if** we want to use a different backend (`tensorflow` on one computer and `mxnet` on another), **then** `keras` will handle the implmenetation details for each without requiring code changes

if you are just starting out and want to do some simple neural network development, I **strongly** encourage you to start with `keras`.

in particular, the author of the `keras` library (François Chollet) is a prolific author and blogger. his [`keras` blog](https://blog.keras.io/author/francois-chollet.html) is one of the best resources out there for tutorials on how to write deep learning models in `keras`. additionally, he wrote a great book: https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438

**<div align="center">what are your questions so far?</div>**

### alternatives

the *thing* `keras` gives us is a high-level backend-agnostic interface for creating most types of deep neural net architectures. alternative options include apache `mxnet`, which is a high-level framework with implementations in multiple different languages (and, coincidentally, one of the *backends* to `keras` to boot)

the *thing* `tensorflow` gives us is a computation environment with all the basic building blocks of deep neural net models (e.g. activation functions, loss functions, gradient descent algorithms) and supporting implementation on various different hardware types. the main alternative to `tensorflow` for this at this time is `pytorch`.

many people prefer `pytorch` to `tensorflow`, so this is by no means a settled dispute. that being said, one of the people that prefers `tensorflow` is `google`, so I feel pretty confident that project will keep advancing.

## hands-on

enough yaking, more key clacking. let's build some models

**<div align="center">exercise: install `tensorflow` and `keras`</div>**

on some machine where you have `conda` and some disk space, let's run

```sh
conda install -y tensorflow keras
```

verify it work by running (in a `python` or `ipython` session)

```python
import keras
import tensorflow as tf
```

note: we also could have used `docker` to create a `container` with `tensorflow` (and therefore `keras`, via `tf.keras`) pre-installed. look at https://hub.docker.com/r/tensorflow/tensorflow/ for details, but the basic commands are

```sh
# pull (if you haven't) and run the latest py v3 tensorflow
# container
docker run --rm -it -p 8888:8888 tensorflow/tensorflow:latest-py3

# open a jupyter notebook at localhost:8888
```

### using `tensorflow`

the [`tensorflow` documentation](https://www.tensorflow.org/tutorials/) is the definitive source for information on how to write `tensorflow` code, and this is no replacement. I simply want to cover the high-level concepts of working with `tensorflow`

#### the execution graph

the fundamental object in `tensorflow` is the execution graph: a directed graph connecting *computation* nodes (think `add`, `subtract`, `multiply`, etc) with edges that symoblize inputs and outputs.

you build this graph up as an object by adding an `add` operation to the graph. importantly, when you write

```python
mysum = tf.add(1, 1)
```

you **are not *performing* that computation**. you are creating an `add` operation object which has as its inputs two constant values (edges) with values of 1.

In [43]:
import tensorflow as tf

mysum_op = tf.add(1, 1)
mysum_op

<tf.Tensor 'Add_1:0' shape=() dtype=int32>

that operation was automatically added to the "graph" of computations as a single node:

In [44]:
g = tf.get_default_graph()
g.get_operations()

[<tf.Operation 'Add/x' type=Const>,
 <tf.Operation 'Add/y' type=Const>,
 <tf.Operation 'Add' type=Add>,
 <tf.Operation 'Add_1/x' type=Const>,
 <tf.Operation 'Add_1/y' type=Const>,
 <tf.Operation 'Add_1' type=Add>]

if you actually want to get any information out of any operation you need to *evaluate* that node within a *session* -- you create a context in which `tensorflow` knows what inputs it should expect (you define them when you `run` the session!) and which outputs to return (you are `run`-ing operations)

In [46]:
with tf.Session() as sess:
    mysum_value = sess.run(mysum_op)
mysum_value

2

#### eager execution

when developing code, this extremely structured way of doing things (build a computation graph, then run it in a session) can be... pretty annoying.

the google developers created an "eager execution" functionality to address exactly this problem. you can

1. develop your code in "eager execution" mode -- get the results of your operation immediately
1. remove one line of code from the beginning of your developed file and put everything inside a `tf.Session` to get the "production" behavior

In [4]:
# NOTE:
# you must restart your kernel if you want to do this!!!
import tensorflow as tf
tf.enable_eager_execution()

mysum_op = tf.add(1, 1)
mysum_op

<tf.Tensor: id=15, shape=(), dtype=int32, numpy=2>

In [5]:
mysum_op.numpy()

2

+ hands-on
    + exercise: install tensorflow
    + demo
        + make a logistic classifier for iris
        + at least in tf, maybe in all
    + aws gpu instance
        + nvidia has an available ami

+ spec out the pricing of this!