![](./images/DL-NLP-intro.png)

# Agenda

- What is Deep Learning?
- The Rise of Deep Learning
- Convolutional Neural Networks
- Deep Learning for NLP
- Homework Review & Troubleshooting
- Assesment

![Ng](./images/Ng-quote.png)

# Deep Learning

![brain-network](https://ville.montreal.qc.ca/idmtl/en/wp-content/uploads/sites/2/2017/04/inf2.jpg)

## Mimics the neocortex of our brains

![](https://media.nature.com/full/nature-assets/neuro/journal/v19/n3/images/nn.4244-F1.jpg)

Deep-learning software attempts to mimic the activity in layers of neurons in the neocortex, the wrinkly 80 percent of the brain where thinking occurs. The software learns, in a very real sense, to recognize patterns in digital representations of sounds, images, and other data.

The basic idea—that software can simulate the neocortex’s large array of neurons in an artificial “neural network”—is decades old, and it has led to as many disappointments as breakthroughs. 

![](http://www.nlpacademy.co.uk/images/uploads/whatisnlp.jpg)

# Perceptron = Single Layer

![](./images/perceptron.png)

The neurons in our brain transmitting these electrochemical signals have inspired the artificial neuron called a perceptron.

Perceptrons are one of the earliest algorithms used for classification in supervised learning - with only a single layer.


# Classification

# Input > Classifier > Output
![](./images/classifiers.png)

Classification - categorize a group of objects while only using some basic data features to describe them in order to predict a value.

This is done by building a model based on one or more numerical or categorical variables (predictors, attributes, or features). 



## ML vs DL
- Most current machine learning works	well because of	human-designed representations and	input	features	
- Machine learning becomes just	 optimizing	weights	to	best make a	final prediction	
- Representation	learning	attempts	to automatically learn	good features or representations	
- Deep learning algorithms attempt to	learn multiple	levels	of	representation	of	increasing	complexity/abstraction

## Linear Classifiers

![](./images/linear-classifiers.png)

Regression vs Classification

The simplest version of a classifier in ML is linear regression. 

The formula for a straight line is y=mx+b. Where x is the input, m is the slope of the line, b is the y-intercept, and y is the output for that position of x.

The values we have available to adjust or train are m and b, where m is the slope and b is the y intercept. There is no other way to affect the variables of the line, since all that is left is our input and our output. 

In machine learning there are many M’s since there may be many features. 


![](./images/weights-biases.png)

The collection of these values is formed into a matrix that is denoted (W) for the weights matrix.
The b’s are arranged together for the biases.

The prediction accuracy of a neural net depends on its weights and biases. 

Each iteration or cycle of updating the weights and biases (improving accuracy) is called one training step.

Each edge has a unique weight and each weight has a unique bias - this means the combination for each is also unique, which explains why the nodes fire differently. 


# A neural network = running several logistic regressions at the same time

![neural net](https://cdn-images-1.medium.com/max/1600/0*IUWJ5oJ_z6AiG7Ja.jpg)

If	we	feed a vector	of	inputs	through	a	bunch	of	logistic regression	functions,	then	we	get	a	vector of outputs....

- which	we can	feed into another logistic regression function
- It is the training criterion that will direct what the intermediate hidden variables should be,	so as to do	a good job at predicting the targets for the next layer, etc.	
- Before we	know it, we	have a multilayer neural network…

# Rise of Deep Learning

## Why Now?

Despite prior investigation and understanding of many of the algorithmic techniques …
Before 2006 training deep architectures was unsuccessful.

### What has changed?
- Faster machines and more data help DL more than other algorithms
- New methods for unsupervised pre-training have been developed
- More efficient parameter estimation methods
- Better understanding of model regularization



![](https://cdn-images-1.medium.com/max/2000/1*3IXgb4fIFFJbIOgMzl9RGA.jpeg)

Jensen Huang, NVIDIA CEO

![](https://image.slidesharecdn.com/introductiontomulti-gpudeeplearningwithdigits2-mikewang-150820102631-lva1-app6891/95/introduction-to-multi-gpu-deep-learning-with-digits-2-mike-wang-22-638.jpg?cb=1440074781)

### Deep Learning in the Cloud

![](https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-content/uploads/2015/11/nvidia-deep-learning.jpg)

# Convolutional Neural Networks (CNNs)


### What is a Convolution?

![](./images/cnn-artithmetic.gif)

Source: https://github.com/vdumoulin/conv_arithmetic

The arithmetic being performed can be thought of as a spotlight - extracting features, such as edges, from an image one layer at a time. These layers are often called filters.

![](./images/convolution-schematic.gif)

Convolution with 3×3 Filter. Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution


For me easiest way to understand a convolution is by thinking of it as a sliding window function applied to a matrix. That’s a mouthful, but it becomes quite clear looking at a visualization.

Imagine that the matrix on the left represents an black and white image. Each entry corresponds to one pixel, 0 for black and 1 for white (typically it’s between 0 and 255 for grayscale images). The sliding window is called a kernel, filter, or feature detector. Here we use a 3×3 filter, multiply its values element-wise with the original matrix, then sum them up. To get the full convolution we do this for each element by sliding the filter over the whole matrix.

### What are Convolutional Neural Networks?


![](https://www.mathworks.com/content/mathworks/www/en/discovery/convolutional-neural-network/jcr:content/mainParsys/image_copy.adapt.full.high.jpg/1508999490138.jpg)

CNNs are basically just several layers of convolutions with nonlinear activation functions like ReLU or tanh applied to the results. In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer. That’s also called a fully connected layer, or affine layer. In CNNs we don’t do that. Instead, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results. 

A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of computer graphics and implemented on a hardware level on GPUs. 

# Deep Learning for Natural Language Processing

### So, how does any of this apply to NLP?

![](https://cdn-images-1.medium.com/max/1600/0*DAcgw-fqaYq2Ppm1.png)

### CNN Applied to NLP

![](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM.png)

Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. 

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the “width” of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this.

![](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-8.03.47-AM.png)

Source: Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification

Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary. 

The most natural fit for CNNs seem to be classifications tasks, such as Sentiment Analysis, Spam Detection or Topic Categorization. Convolutions and pooling operations lose information about the local order of words, so that sequence tagging as in PoS Tagging or Entity Extraction is a bit harder to fit into a pure CNN architecture.

### Labeled vs Unlabeled Data

- Today, most practical, good NLP & ML methods require labeled training data (i.e., supervised learning)
    
    * But almost all data is unlabeled
    
    

- Most information must be acquired unsupervised

    * Fortunately, a good model of observed data can really help you learn classification decisions

# NLP Pipieline

![](./images/nlp-pipeline.png)

How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lie in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.

Many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it classification, regression etc. in broad terms. And with the huge amount of data that is present in the text format, it is imperative to extract knowledge out of it and build applications.

## Representing Words

### Word Embeddings

![](https://deeplearning4j.org/img/word2vec_translation.png)

These are implemented by using Word Embeddings or numerical representations of texts so that computers may handle them. Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text.

### Atomic symbols

    * Large vocabulary size (~1,000,000 words in English)
    * Joint distributions impossible to infer
    
   ![](http://www.cse.unsw.edu.au/~billw/dictionaries/pix/parsetree.gif)


Current NLP systems are incredibly fragile because of their atomic symbol representations

### Structure Corresponds to Meaning

![](./images/structure-meaning.png)

### Words could be represented by Vectors

- [0,0,0,0,1,0,0] 
    * Also known as "one-hot" representation...
        ![](https://adeshpande3.github.io/assets/NLP8.png)

**What is a word vector?** 

At one level, it’s simply a vector of weights. In a simple 1-of-N (or ‘one-hot’) encoding every element in the vector is associated with a word in the vocabulary. The encoding of a given word is simply the vector in which the corresponding element is set to one, and all other elements are zero.

Using such an encoding, there’s no meaningful comparison we can make between word vectors other than equality testing.

## Word to Vec (2013) - Word Vector Representations

   - Google 
   - Publicly Available
    ![](https://lh6.googleusercontent.com/proxy/Akd9MtQYM3jzzZQrysgNzLoawPRw_xveviWzvKXS7hxih1b-iWLA5ijHLgtP07tkMhaOOse635CKPF_cS-s4tg=w5000-h5000)

    
    

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

## Principle of Word Vectorization

![](./images/word-company.png)

**Similar words should have similar vector representations...**

![](./images/coocurrance-matrix.png)

In word2vec, a distributed representation of a word is used. Take a vector with several hundred dimensions (say 1000). Each word is representated by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words.

Such a vector comes to represent in some abstract way the ‘meaning’ of a word. 


**Advantages of Co-occurrence Matrix**

- It preserves the semantic relationship between words. i.e man and woman tend to be closer than man and apple.
- It uses factorization which is a well-defined problem and can be efficiently solved.
- It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.

**Disadvantages of Co-Occurrence Matrix**

It requires huge memory to store the co-occurrence matrix.

![](https://www.datascience.com/hs-fs/hubfs/Resources/Articles/nn_embed.png?t=1516924658193&width=1414&height=694&name=nn_embed.png)

 Google's word2vec is one of the most widely used implementations due to its training speed and performance. Word2vec is a predictive model, which means that instead of utilizing word counts à la  latent Dirichlet allocation (LDA), it is trained to predict a target word from the context of its neighboring words. The model first encodes each word using one-hot-encoding, then feeds it into a hidden layer using a matrix of weights; the output of this process is the target word. The word embedding vectors are are actually the weights of this fitted model.

# Continuous Bag of Words (CBOW)

![](https://adriancolyer.files.wordpress.com/2016/04/word2vec-cbow.png)

The way CBOW work is that it tends to predict the probability of a word given a context. A context may be a single word or a group of words.

The context words form the input layer. Each word is encoded in one-hot form, so if the vocabulary size is V these will be V-dimensional vectors with just one of the elements set to one, and the rest all zeros. There is a single hidden layer and an output layer.



**Advantages of CBOW:**

- Being probabilistic is nature, it is supposed to perform superior to deterministic methods(generally).
- It is low on memory. It does not need to have huge RAM requirements like that of co-occurrence matrix where it needs to store three huge matrices.
 

**Disadvantages of CBOW:**

- CBOW takes the average of the context of a word (as seen above in calculation of hidden activation). For example, Apple can be both a fruit and a company but CBOW takes an average of both the contexts and places it in between a cluster for fruits and companies.
- Training a CBOW from scratch can take forever if not properly optimized.

## GloVe (2014) - Word Vector Representations

    
   - Stanford NLP Pipeline
   - Publicly Available
   
    ![](https://nlp.stanford.edu/projects/glove/images/man_woman.jpg)

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

### Linear Relationships

![](./images/word-linear-relationships.png)

The vectors are very good at answering analogy questions of the form a is to b as c is to ?. For example, man is to woman as uncle is to ? (aunt) using a simple vector offset method based on cosine distance.

This kind of vector composition also lets us answer “King – Man + Woman = ?” question and arrive at the result “Queen” ! All of which is truly remarkable when you think that all of this knowledge simply comes from looking at lots of word in context with no other information provided about their semantics.



# Latest Advancements in DL & NLP

## Visual Grounding

![](./images/visual-grounding.png)

Map sentences and images into a joint space...

## Image Text Embedding

![](./images/image-text-embedding.png)

## Image Text Generation

![](http://www.stat.ucla.edu/~zyyao/projects/I2T/diagram_semanticweb.gif)

## Automatic Text Generation

![](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/07/Automatic-Text-Generation-Example-of-Shakespeare.png)

Automatic Text Generation Example of Shakespeare
Source: [Andrej Karpathy blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

## Computer Generated Handwriting

![](./images/computer-handwriting.png)

Source: [Generating Sequences With
Recurrent Neural Networks](https://arxiv.org/pdf/1308.0850.pdf)

Interactive Handwriting Generation Demo: 
http://www.cs.toronto.edu/~graves/handwriting.cgi?text=Accel+AI+Institute&style=&bias=0.15&samples=3

## Deep Learning Image, Video, and Audio Generation

[![](https://assets.pcmag.com/media/images/547605-president-obama-lip-synced-speech-example.jpg?thumb=y&width=810&height=456)](https://youtu.be/9Yq67CjDqvw)

Source: Synthesizing Obama: Learning Lip Sync from Audio

# Questions to Ponder... 

- How will the law evolve to distinguish between computer & human output?
- How will the law evolve to protect against false replications of individuals? 
- How will the law evolve to help citizens distinguish between real and fake news?
