In [2]:
# default_exp nlp
# default_cls_lvl 2

# Neural Networks that understand Language: King - Man + Woman == ?
> 

In this Chapter:
- Natural Language Processing
- Supervised NLP
- Capturing Word Correlation in Input Data
- Intro to an Embedding Layer
- Neural Architecture
- Comparing Word Embeddings
- Filling in the Blank
- Meaning is derived from Loss
- Word Analogies

> "Man is a Slow, Sloppy, and Brilliant Thinker; Computers are Fast, Accurate, and Stupid!" — John Pfeiffer, Fortune, 1961

## What does it mean to understand language?
### What kinds of predictions do people make about language?

- Different datasets often justify different styles of neural network training according to the challenges hidden in the data.

<div style="text-align:center;"><img style="width:33%;" src="static/imgs/11/domains_intersections.png"/></div>

- **NLP: Natural Language Processing is a much older field that overlaps deep learning**
- This field is dedicated exlusively to the automated task of understanding human language.

## Natural Language Processing (NLP)
### NLP is divided into a collection of tasks and challenges

- Here are a few types of classification problems that are common in NLP:
    - Using the Characters of a document to predict where words start and end.
    - Using the Words of a document to predict where sentences start and end.
    - Using the Words in a sentence to predict the part of speech of each word.
    - Using the Words of a sentence to predict where phrases start and end.
    - Using words in a sentence to predict where named entities (person, place, thing) references start and end.
    - Using sentences in a document to predict which pronouns refer to the same person/place/thing.
    - Using words in a sentence to predict the sentiment of a sentence.
- NLP tasks seek to do one of three things:
    - **label a region of text**.
    - **Link two or more regions of Text**.
    - **Try to fill in missing information based on Context**.
- Until recently, most of the SoTA NLP Algorithms where **advanced, probabilistic, non-parametric** models (but not Deep Learning).
- The recent development and popularization of two major neural algorithms have swept the field of NLP:
    - **Neural Word Embeddings**.
    - **Recurrent Neural Networks**.
- NLP plays a very special role in **AGI** (Artificial General Intelligence), because **language is the bedrock of consious logic and communication in humans**.

## Supervised NLP
### Words go in, & predictions come out

- Up until now, we represented inputs as numbers.
    - **But NLP uses text as input. How do you process it?**
- Because NNs only map input numbers to output numbers, we need to convert our words into their corresponding numerical representation.
    - As it turns out, How we do this is exteremly important!
    
<div style="text-align:center;"><img style="width:75%;" src="static/imgs/11/input_text.png"/></div>

- In order to find the optimal numerical representation for text, we need to look at the underlying input-to-output problem, let's take an example:

## IMDB Movie Reviews Dataset
### You can predict whether people post positive/negative reviews

- The IMDB Reviews Dataset is a collection of Review/Rating Pairs that often looks like the following:

> "This Movie was terrible, The Plot was Dry, The acting unconvincing, and I spilled popcorn on my shirt!" — Rating: 1 Stars.

- The entire dataset consists of around 50K reviews
    - The **Input Reviews are usually a few sentences** & the **Output rating is between 1 and 5 stars**.
    - It should be obvious that this sentiment dataset might be very different from other sentiment datasets, such as product reviews or hospital patient reviews.
- Data Processing:
    - You'll adjust the range of stars from 1 to 5 into 0 to 1.
        - So you can use Binary Softmax (Sigmoid).
    - The input data is a list of characters, this presents a few problems:
        - The input data is text instead of numbers.
        - **Input is Variable-Length Text.**
- "What about the Input Text will have Correlation with the Output?"
    - Representing that property might work well.
    - I wouldn't expect any characters (in a list of characters) to have correlation with the output (rating).
    - But several words would have a bit of correlation with the output rating
        - "Terrible", "Unconvincing" are such word examples.
    - These words have significant **negative** correlation with the rating.
        - By **Negative**, I mean as the frequency of these words increases, ratings tend to decrease in number of stars.

## Capturing Word Correlation in Input Data
### Bag of words: Given a review's Vocabulary, predict the sentiment

In [58]:
import numpy as np
import re
import pandas as pd

In [8]:
IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv'

In [25]:
!ls $IMDB_PATH

[31m/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv[m[m


In [32]:
df = pd.read_csv(IMDB_PATH, encoding="ISO-8859-1", index_col=0)  # added encoding to fix error
df.head(7)

Unnamed: 0,type,review,label,file
0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt
5,test,"A funny thing happened to me while watching ""M...",neg,10004_2.txt
6,test,This German horror film has to be one of the w...,neg,10005_2.txt


In [38]:
# let's take a look at one review:
df.loc[0].review, df.loc[0].label

("Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",
 'neg')

- What's commonly done in this case is to create a matrix where each row represents a review.
- **Each column represents whether a review contains a particular word in the vocabulary.**
- To create a vector for a review, you just loop over the content and put $1$s in places where the corresponding vocabulary words are present in the review.
- How big are these vectors?
    - Well, it depends on the global vocabulary of the reviews.
    - If you have 2,000 unique words, you need vectors of length 2,000.
- This form of storage, called **one-hot encoding**, is the most common way to store binary information, in our case, the presence/absence of particular vocabulary words from the text of a review.
- If our vocabulary have only 4 words, than the one-hot encoding might actually look like this:

In [39]:
one_hots = {}
one_hots['cat'] = np.array([1, 0, 0, 0])
one_hots['the'] = np.array([0, 1, 0, 0])
one_hots['dog'] = np.array([0, 0, 1, 0])
one_hots['sat'] = np.array([0, 0, 0, 1])

<div style="text-align:center;"><img style="width:75%;" src="static/imgs/11/one-hots.png"/></div>

In [40]:
sentence = ['the', 'cat', 'sat']
x = one_hots[sentence[0]] + one_hots[sentence[1]] + one_hots[sentence[2]]
print('Sent Encoding:' + str(x))

Sent Encoding:[1 1 0 1]


- We create a vector for each term in the vocabulary.
- This allows you to use vector addition to represent a set of words present in a sentence.

## Predicting Movie Reviews
### With the Previous Strategy and the previous network, you can predict sentiment

- You build a vector for each word and use the two-layer network to predict sentiment.

- Using the strategy we just identified, you can build a vector for each word in the sentiment dataset and use the previous two-layer network to predict the sentiment.
- I Strongly recommend attempting this from memory.
- Open a new Jupyter Notebook, load in the dataset, build you one-hot vectors, and then build a neural network to predict the rating of each movie review (positive or negative).

In [1]:
import re
import numpy as np
import pandas as pd
from collections import Counter

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/imdb_master.csv'

In [2]:
df = pd.read_csv(IMDB_PATH, encoding="ISO-8859-1", index_col=0)
df = df[df['label'].isin(['neg', 'pos'])]
all_reviews_text = " ".join(df.review.tolist())

In [3]:
# we get unique tokens
all_tokens = all_reviews_text.split(" ")
unique_tokens = [v for (v, _) in Counter(all_tokens).most_common(10000)]
len(all_tokens), len(unique_tokens)

(11557297, 10000)

In [37]:
# create a function out of it
def get_tokens(text):
    return list(set(text.split(" ")))

In [4]:
# create one-hot representations of each token
word_to_index, index_to_word = {}, {}
for i, word in enumerate(unique_tokens):
    word_to_index[word], index_to_word[i] = i, word

In [5]:
df['words_count'] = df['review'].apply(lambda x: len(x.split(" "))) 

In [6]:
df.describe()

Unnamed: 0,words_count
count,50000.0
mean,231.14594
std,171.326419
min,4.0
25%,126.0
50%,173.0
75%,280.0
max,2470.0


- **we will take a word one-hot vector of size 10,000**.
- & so the review length doesn't matter, we'll just add up each word in the review to get a final representation of the review in a 10,000 vector:

- Let's Preprocess the training data:

In [7]:
train = df[df.type == 'train']

In [8]:
# we delete columns we're not interested in
del([train['type'], train['file'], train['words_count']])

In [9]:
# now we transform label into a number
train['y'] = train.apply(lambda row: int(row.label == 'pos'), axis=1)
del train['label']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [11]:
# shuffle train now ..
df = train.sample(frac=1).reset_index(drop=True)

In [43]:
x, y = [], []
for _, r in df.iterrows():
    review, label = r['review'], r['y']
    one_hot = np.zeros(10000)
    tokens = get_tokens(review)
    for token in tokens:
        if token in word_to_index:
            one_hot[word_to_index[token]] = int(1)
    x.append(one_hot)
    y.append(label)

In [44]:
x, y = np.array(x), np.array(y)

In [49]:
x.shape, y.shape

((25000, 10000), (25000,))

Now we have the representations we need to move forward and create a dense neural network to train.

Do it From Memory Later..

## Intro to an embedding layer
### Here is one more trick to make the network faster

<div style="text-align:center;"><img style="width:33%" src="static/imgs/11/dumb_network.png"></div>

- The first layer is the **dataset**.
- this is followed by what's called a **linear layer**.
- this is followed by a **ReLU Layer**, 
- Another Linear Layer.
- and Then the Output, which is the **prediction layer**.
- As it turns out, you can take a bit of a **shortcut** to **layer 1** by **replacing the 1st linear layer with an embedding layer.**
- Taking a vector of 1s and 0s is mathematically equivalent to **summing several rows of a matrix**.
- **We just sum W_0's rows that mark available words to form the unique "embedding layer"**.
- Thus, **it's much more efficient to select the relevant rows of W_0 and sum them as opposed to doing a big vector-matrix multiplication**.

<div style="text-align:center;"><img style="width:50%" src="static/imgs/11/embedding_layer.png"></div>

- Because the sentiment vocabulary is on the order of 70k words, most of the vector matrix multiplication is spent multiplying zeros in the input vector by weights before summing them, embeddings are much more efficient.
- the only difference is that summing a bunch of rows is much **faster**.

In [3]:
import numpy as np
import sys

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'

In [4]:
f = open(IMDB_PATH, mode='r')
raw_reviews = f.readlines()
f.close()

In [5]:
f = open(IMDB_LABEL_PATH, mode='r')
raw_labels = f.readlines()
f.close()

In [6]:
len(raw_reviews), len(raw_labels)

(25000, 25000)

In [7]:
# python's map object is an iterator
# you can also convert map objects to lists, tupes, ..
tokens = list(map(lambda x: set(x.split(" ")), raw_reviews))

In [8]:
# let's extract the vocab
vocab = set()
for sent in tokens:
    for word in sent:
        if (len(word)>0):
            vocab.add(word)
vocab = list(vocab)

In [9]:
word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

In [10]:
# transform all reviews to vectors
input_dataset = list()
for sent in tokens:
    sent_indices = list()
    for word in sent:
        try:
            sent_indices.append(word2index[word])
        except:
            ""
    input_dataset.append(list(set(sent_indices)))

In [11]:
# same for target data
target_dataset = list()
for label in raw_labels:
    if label == "positive\n":
        target_dataset.append(1)
    else:
        target_dataset.append(0)

In [12]:
import numpy as np

np.random.seed(1)

In [13]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

In [14]:
lr, epochs = 0.01, 2
embedding_layer_size = 100

In [15]:
W0 = 0.2 * np.random.random((len(vocab), embedding_layer_size)) - 0.1
W1 = 0.2 * np.random.random((embedding_layer_size, 1)) - 0.1

In [16]:
# training loop
correct, total = (0, 0)

for epoch in range(epochs):
    
    # leave last 1000 for testing
    for i in range(len(input_dataset) - 1000):
        # Forward Propagation
        x, y = input_dataset[i], target_dataset[i]
        layer_1 = sigmoid(np.sum(W0[x], axis=0))
        layer_2 = sigmoid(layer_1.dot(W1))
        
        # Gradients Calc
        layer_2_delta = (layer_2 - y)
        layer_1_delta = layer_2_delta.dot(W1.T)
        
        # Backpropagation
        W0[x] -= layer_1_delta*lr  # update only corresponding embeddings (w/o attached input to gradient).
        W1 -= np.outer(layer_1, layer_2_delta) * lr
        
        # training accuracy
        if(np.abs(layer_2_delta) < 0.5):
            correct += 1
        total += 1
        
        if (i%1000 == 0):
            progress = str(i/float(len(input_dataset)))
            print('\rIter:', str(iter), ' Progress:', progress[2:4], '.', progress[4:6], '% Training Accuracy:', str(correct/float(total)), '%')
            
    # test set evaluation
    correct, total = (0, 0)
    for i in range(len(input_dataset) - 1000, len(input_dataset)):
        x, y = input_dataset[i], target_dataset[i]
        layer_1 = sigmoid(np.sum(W0[x], axis=0))
        layer_2 = sigmoid(layer_1.dot(W1))
        if(np.abs(layer_2-y) < 0.5):
            correct += 1
        total += 1
    print("Test Accuracy: ", str(correct/float(total)))

Iter: <built-in function iter>  Progress: 0 .  % Training Accuracy: 0.0 %
Iter: <built-in function iter>  Progress: 04 .  % Training Accuracy: 0.45854145854145856 %
Iter: <built-in function iter>  Progress: 08 .  % Training Accuracy: 0.591704147926037 %
Iter: <built-in function iter>  Progress: 12 .  % Training Accuracy: 0.6691102965678107 %
Iter: <built-in function iter>  Progress: 16 .  % Training Accuracy: 0.7000749812546864 %
Iter: <built-in function iter>  Progress: 2 .  % Training Accuracy: 0.7204559088182364 %
Iter: <built-in function iter>  Progress: 24 .  % Training Accuracy: 0.7382102982836194 %
Iter: <built-in function iter>  Progress: 28 .  % Training Accuracy: 0.7546064847878875 %
Iter: <built-in function iter>  Progress: 32 .  % Training Accuracy: 0.7671541057367829 %
Iter: <built-in function iter>  Progress: 36 .  % Training Accuracy: 0.7763581824241751 %
Iter: <built-in function iter>  Progress: 4 .  % Training Accuracy: 0.785921407859214 %
Iter: <built-in function iter

## Interpreting the Output
### What did the Neural Network learn along the way?

- The Network was looking for correlation between the input data points and the output data points.
- It's extremely beneficial to know what kind of patterns the network detected while training and took as signal for predicting sentiment.
- Just because the network was able to find correlation between the input and the output doesn't mean that it found every pattern of language.
- Understanding what the difference between what the network is able to currently learn from data sets and what it should learn to truly understand language is very important & essential to solve artificial general intelligence.
- What about language was our network able to learn?
    - Let's start by considering was what **presented to the network**
        - Presented Each review's vocabulary and asked for the network to classify if it's positive or negative.
    - You'd expect the network to know which words have strong correlation with negative opinions and which are positive.
- But this isn't the whole story.

## Neural Architecture
### How did the choice of architecture effect what the network was able to learn?

- Hidden layers are about grouping input data points coming from the previous layer into n groups.
    - Each hidden neuron takes in a data point and asks "is this data point in my group?"
    - As the hidden layer learns, it searches for useful groupings.
- What are useful groupings?
    - The grouping must be useful in the prediction of the output label.
        - If it's not useful in predicting the output label, the network summarization won't allow the layer to find the groupings.
    - A Grouping is useful if it finds hidden and interesting structure in the data.
        - bad groupings just memorize data.
        - good groupings capture phenomenas that are useful linguistically.
- For example, understanding the difference between "terrible" and "not terrible" is a powerful grouping.
- But because the input to the network is a vocabulary and not a sequence, "It is Great, Not terribe" will be interpreted exactly like "It is Terrible, Not Good".
- If you can construct two examples with the same activation hidden layer & the pattern is present in the first example while absent in the 2nd, then the network couldn't detect the pattern you're interested in.
- 2 Data Points (Movie Reviews) get the same prediction if they subscribe to most of the trained groupings.

### What should you see in the weights connecting weights to hidden neurons?

- Words that have similar predictive power should subscribe to similar groups.

<div style="text-align:center;"><img style="width:25%" src="static/imgs/11/embedding_weights.png"></div>

- Words that have similar labels (positive or negative) will have similar weights.
- Words that subscribe to similar groups, having similar weights, will have similar linguistic meaning with regards to the task at hand (sentiment analysis).

## Comparing Word Embeddings
### How Can you Visualize Weight Similarity?

- You can get the embedding of each word by simply extracting the corresponding row from the first weight matrix.
- You do word-to-word comparison by simply calculating the euclidian distance between the two vectors.

In [17]:
from collections import Counter
import math

In [20]:
def similar(target='beautiful'):
    target_index = word2index[target]
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - W0[target_index]
        squared_difference = raw_difference**2
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)

- This will allow you to easily find out the similar words to a target word, examples:

In [21]:
print(similar('beautiful'))

[('beautiful', -0.0), ('beautifully', -0.7234740527429394), ('episodes', -0.7431900463968468), ('realistic', -0.757880033742247), ('beauty', -0.759955286643346), ('atmosphere', -0.7601443881195916), ('recommended', -0.7610825019283651), ('fun', -0.7701812509518946), ('great', -0.7834113731660364), ('bit', -0.7858516072936133)]


In [22]:
print(similar('terrible'))

[('terrible', -0.0), ('disappointment', -0.7148509282587314), ('annoying', -0.7671560931689875), ('boring', -0.7899973527799934), ('worse', -0.796032596035763), ('laughable', -0.797488012028028), ('avoid', -0.8111283880874369), ('dull', -0.8210389457432301), ('mess', -0.833565219519962), ('disappointing', -0.8393998496880327)]


In [23]:
print(similar('average'))

[('average', -0.0), ('manipulative', -0.6258354687765075), ('barbershop', -0.6512567886916747), ('gyneth', -0.6527740640628336), ('kolya', -0.6560375177956016), ('neverheless', -0.6569103115275596), ('broker', -0.6607965580519881), ('ghatak', -0.6613206616357246), ('triumf', -0.6639680365540314), ('shawn', -0.6645095432239714)]


In [24]:
print(similar('love'))

[('love', -0.0), ('know', -0.6807296076265122), ('classic', -0.7058610625390093), ('although', -0.7089204276967219), ('genius', -0.7132786925108467), ('touched', -0.7141661782554475), ('delicious', -0.7160007781267519), ('packed', -0.7161515900219529), ('satire', -0.7171565782119897), ('carrie', -0.7175034627599344)]


- This is a standard phenomenon in the correlation summarization.
- It seeks to create similar latent representations within the network to facilitate information compression to arrive to the correct target label.

## What is the meaning of a neuron?
### Meaning is entirely based on the target labels being predicted

- Notice that "Beautiful" & "Atmosphere" are nearly identical, but **only in the context of sentiment prediction**.
    - but in the other hand, their meaning is quite different (one is an adjective & the other is a noun).
- The meaning of a neuron in the network depends entirely on the target labels.
- The Neural Network is entirely ignorant of any other meaning outside the task it was trained on.
- How do you make the meaning of a neuron more broad?
    - Well, if you give it a task that requires broad understanding of language, it will learn new complexities and its neurons will become much more general.
- The Task you'll use to learn more interesting word embeddings is the "fill in the blank" task.
    - There is nearly infinite training data (the internet).
        - Which means infinite signal to the network.
    - Being able to learn to fill the blank requires at least some context language understanding.

## Filling in the Blank
### Learn Richer Word Meanings by having A Richer Signal

- This example uses almost the same previous architecture with minor modifications.
- You'll split the text into 5 words sentences, then remove one word (focus term), and train the network to predict the focus term.
- you'll use a technique called **negative sampling** to make the network train a bit more faster.
- Consider that in order to predict the focus term, you need one label for each possible word.
    - This would require several thousand labels, which would cause the network to train slowly.
    - To overcome this, let's randomly ignore most of the labels for each forward propagation.
        - Although this seems crude, it's a technique that works well in practice.

In [1]:
import sys, random, math
from collections import Counter
import numpy as np

IMDB_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/reviews.txt'
IMDB_LABEL_PATH = '/Users/mohamedakramzaytar/data/2019/Q2/kaggle/IMDB/labels.txt'

In [2]:
np.random.seed(1)

In [3]:
random.seed(1)

In [4]:
f = open(IMDB_PATH, 'r')
raw_reviews = f.readlines()
f.close()

In [5]:
len(raw_reviews)

25000

In [6]:
tokens = list(map(lambda x: x.split(" "), raw_reviews))
len(tokens[0]), len(tokens[1]), len(tokens[2])

(185, 127, 537)

In [7]:
word_counter = Counter()

In [8]:
for review in tokens:
    for token in review:
        word_counter[token] -= 1

In [9]:
_ = word_counter.most_common()  # least common in this case.

`most_common()` just sorts out the data, it doesn't take the Top N most common tokens unless you force it to (by giving it an argument i think).

In [10]:
vocab = list(set(map(lambda x: x[0], word_counter.most_common())))

In [11]:
word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

In [12]:
concatenated = list()
input_dataset = list()
for review in tokens:
    review_indices = list()
    for token in review:
        try:
            review_indices.append(word2index[token])
            concatenated.append(word2index[token])
        except:
            ""
    input_dataset.append(review_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)

In [13]:
lr, epochs = (.05, 2)
hidden_size, window, negative = 50, 2, 5

In [14]:
W0 = (np.random.rand(len(vocab), hidden_size) - 0.5) * 0.2
W1 = np.random.rand(len(vocab), hidden_size)*0

In [15]:
W0.shape, W1.shape

((74075, 50), (74075, 50))

`W1` Could simply be replaced by: `np.zeros(len(vocab), hidden_size)`

In [16]:
layer_2_target = np.zeros(negative+1)
layer_2_target[0] = 1

In [17]:
def similar(target='beautiful'):
    target_index = word2index[target]
    
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - W0[target_index]
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)

In [18]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [19]:
for review_i, review in enumerate(input_dataset * epochs):
    for target_i in range(len(review)):
        # predict only a random subset, because it's really expensive to predict every vocab
        # We can't do a softmax over all possible words, we will predict for the target word + a subset of the total vocab
        target_samples = [review[target_i]] + list(concatenated[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
        
        # get tokens on the right & on Left of target word
        left_context = review[max(0, target_i-window):target_i]
        right_context = review[target_i+1: min(len(review), target_i+window)]
        
        # feed forward
        # context words w/o target word
        # mean instead of sum, interesting
        layer_1 = np.mean(W0[left_context+right_context], axis=0)
        # using sigmoid here is kind of weird because there is only one true target token
        layer_2 = sigmoid(layer_1.dot(W1[target_samples].T))
        layer_2_delta = layer_2 - layer_2_target
        layer_1_delta = layer_2_delta.dot(W1[target_samples])
        
        # update weights
        W0[left_context+right_context] -= layer_1_delta*lr
        W1[target_samples] -= np.outer(layer_2_delta, layer_1)*lr
        
    if(review_i % 500 == 0):
        print('\rProgress:'+str(review_i/float(len(input_dataset)*epochs)) + " " + str(similar('terrible')))
print(similar('terrible'))

Progress:0.0 [('terrible', -0.0), ('cognac', -0.3738881216269329), ('derisive', -0.38706720946311435), ('grease', -0.38977976159778677), ('accessory', -0.39072454239376825), ('constructs', -0.3918349061764808), ('misreads', -0.3923191534753908), ('rambles', -0.39428401242305783), ('mecha', -0.39500026848095804), ('starlets', -0.39594037298290563)]
Progress:0.01 [('terrible', -0.0), ('ill', -0.6255941821256622), ('awesome', -0.6375994436921958), ('competent', -0.6407977227851339), ('troubled', -0.6531069241561729), ('jr', -0.6534538265536488), ('blockbuster', -0.6563001235939745), ('brief', -0.6606079018272691), ('marty', -0.6607438906298844), ('aubrey', -0.6627754935036241)]
Progress:0.02 [('terrible', -0.0), ('superb', -1.0578247838536894), ('awesome', -1.2031760650904362), ('compelling', -1.203498585352637), ('pointless', -1.2124866217538417), ('solid', -1.216145731204235), ('fantastic', -1.2247588874668456), ('sad', -1.2297465716290998), ('okay', -1.2341042974148884), ('perfect', -1

- The Word Embeddings get trained according to the task the neural network is trained on:
    - Sentiment Analysis: Embeddings are grouped together depending on how Positive/Negative they are
        - Or depending on How they effect a Review being good or bad.
    - Filling the Blank: Embeddings are grouped together depending on how close/far they are when filling blanks.
        - Solve: "*I ___ You so much!*"
            - Possible Solution — "I hate You so much!"
            - Possible Solution — "I love You so much!"
        - In this sense, hate & love are pretty close!

## Meaning is Derived from Loss

<div style="text-align:center;"><img style="width:50%" src="static/imgs/11/loss_matters.png"></div>

- Before, words were clustered according to the likelihood that the review is positive/negative.
- Now, they are clustered based on the likelihood that they will occur on the same phrase.
    - Regardless of sentiment.
- The key takeaway is that even though you are training on the same dataset, using a very similar network architecture, you can influence what the network learns by changing the loss function (task).
- Even though it's looking at the same information, you can alter its learning behavior by simply changing the input/output structure.
- Let's call the process of choosing what the network should learn: *Intelligence Targeting**.
- You can also change how the network measures error, its architetcure, and regularization, this is also a way of performing Intelligence targeting.
- In deep learning research, all of the above techniques fall under the umbrella term: **Loss function construction**.

### Neural Networks don't really **LEARN** Data; they minimize Loss Functions
### The Choice of Loss Function Determines the Network's Knowledge

- Considering Learning is all about Minimizing a Loss Function gives a broader understanding of how neural networks work.
- Different Kinds of Layers, Activations, Regularization Techniques, datasets, aren't really that different.
    - For Example: If the network is overfitting, you can augment the loss fucntion by choosing simpler non-linearities, adding dropout, enforcing regularizations, adding more data and so on..
    - All of these techniques will have a similar effect on the loss function and the learning behavior.
- With learning, everything is contained within the loss function.
    - & **If something is going wrong, remember that the solution is in the Loss Function**.

## King - Man + Woman ~= Queen
### Word Analogies are an interesting consequence of the previously trained network

- This represents one of the famous properties of word embeddings (or trained vectors).
- The task of filling in the blank creates an interesting property called "word analogies".
    - whereas you can take different embeddings and perform algebric operations on them.

In [20]:
def analogy(positive=['terrible', 'good'], negative=['bad']):
    norms = np.sum(W0*W0, axis=1)
    norms.resize((norms.shape[0], 1))
    # normalize weights for vector-level operations
    normed_weights = W0 * norms
    query_vect = np.zeros(W0.shape[1])
    for word in positive:
        query_vect += normed_weights[word2index[word]]
    for word in negative:
        query_vect -= normed_weights[word2index[word]]
    
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = W0[index] - query_vect
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)[1:]

In [26]:
analogy(['elizabeth', 'he'], ['she'])

[('christopher', -196.14901513962437),
 ('tom', -196.63100891837615),
 ('william', -196.71139862402313),
 ('mr', -196.78191513069368),
 ('simon', -196.82182610970864),
 ('it', -196.82971818242368),
 ('him', -196.83517620210762),
 ('been', -196.85949829130325),
 ('gary', -196.90340914909922)]

## Word Analogies
### Linear Compression of an Existing Property in the Data

- Even though the "Word Analogy" Discovery was initially very exciting, the deep learning NLP paradigm didn't move forward from that to discover new features, instead, current language models rely on ~~Recurrent Neural Networks to do language modeling~~
    - *Book was released before ELMO, BERT, GPT-2,.. that is why he considers RNNs to be the SoTA in Language modeling.*
- Nevertheless, we need to understand why this concept emerged out of the network as a result of us training the network to fill in the blank?
- If you imagine the word embeddings to have two dimensions, then it would be easier to know why word analogies work:

<div style="text-align:center;"><img style="width:33%" src="static/imgs/11/word_analogies.png"></div>

In [27]:
king = [.6, .1]
man = [.5, .0]
woman = [.0, .8]
queen = [.1, 1.0]

In [34]:
[x_i - y_i for (x_i, y_i) in zip(king, man)]

[0.09999999999999998, 0.1]

In [35]:
[x_i - y_i for (x_i, y_i) in zip(queen, woman)]

[0.1, 0.19999999999999996]

- The relative usefulness to the final prediction between "Man"/"King" & "Woman"/"Queen" is similar, Why?
    - The difference between "King" and "Man" Leaves a vector of Royalty.
    - **There are a bunch of male/female related words in one grouping, & a bunch of king/queen related words in another grouping.**
        - & because the relative distance between the two group is constant, it means that the distances between each grouping items will be relatively the same.
    - This can be traced back to the chosen loss.
- **This is more about the properties of language than deep learning**.
    - Any linear compression of these co-occurent statistics will yield the same results.

## Summary
### You've learned a lot about Word embeddings & the impact of loss on learning

- We've unpacked the principles of using neural networks to model language.
- **I encourage you to build the examples of this chapter from scratch.**