<a href="https://colab.research.google.com/github/ShaunakSen/NLP/blob/master/Forked_Word_Representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An introduction to NLP, Pre-processing, and Word Representations

### Workshop contents:
- [A very brief overview of pre-neural NLP](#Introduction-to-NLP)
- [How do we get computers to represent our data?](#How-do-we-get-computers-to-represent-our-natural-language-data?)
- [Distributed Representations](#Distributed-Representations)
- [Pre-processing and tokenization: Cleaning your corpus](#Pre-processing-and-tokenization:-Cleaning-your-corpus)
- [Word2Vec](#Word2Vec)
    - [Skipgram](#Skipgram)
- [SpaCy](#SpaCy)

# Introduction to NLP

_Natural language processing_ (NLP) is an interdisciplinary field concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. There are a wide variety of tasks that NLP covers, such as Translation, Natural Language Generation, Entity Detection, Sentiment Classification, and so forth.

Research into NLP started in the 50s. Even though today's systems are (obviously) vastly better than what they were previously, NLP is still considered "unsolved" and modern research is evolving at a rapid pace. One facet as to why it is challenging is due to the ambiguity of language. For example, the word "mean" has multiple definitions: "signify", "unkind", and "average". We also find ourselves using idioms a lot in language - phrases for which the meaning doesn't directly represent sequence of words (e.g. over the moon). Specifically looking at translation, word order differences also prove to be a problem:
- DE: Gestern bin ich in London gewesen
- Word-By-Word EN: ‘Yesterday have I to London been’
- Ground Truth EN: Yesterday I have been to London 

The history of NLP can essentially be broken down into three approaches: Rule-based, Statistical, and Neural:
- Rule-based (1950 - 1999)
 - Hand-crafted rules to model linguistic intuitions 
- Corpus-based 
 - Example-based (EBMT) - MT Translation by analogy: if this segment has been translated before, copy its translation
 - Statistical (2000-2015)
   - Statistical models used to push the “translation by analogy” paradigm to its extreme 
   - Language-independent
   - Low cost
 - Neural - a.k.a. Deep learning (2014-)
   - Learning __representations (features) & models__ from data
   
Neural approaches to NLP in particular allow us to solve the following kinds of problems:
![types_rnn](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/types_rnn.png?raw=1)

# How do we get computers to represent our natural language data?
Let's assume we're working with a many-to-one classification problem, for example classifying user written movie reviews between 1-5. How can we feed raw text into a model:

![1_raw_text_model](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/1_raw_text_model.png?raw=1)


One possible solution is to one-hot our **corpus** based on our **vocabulary**:
![2_corpus_vocab](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/2_corpus_vocab.png?raw=1)

Let's build this:


In [1]:
corpus=[["this movie had a brilliant story line with great action"],
["some parts were not so great but overall pretty ok"],
["my dog went to sleep watching this piece of trash"]] # we'll cover how to deal with emoji later

# Let's tokenize our corpus. 
  # Tokeniziation involves splitting the corpus from the sentence level to the word level

def get_tokenized_corpus(corpus):
    return [sentence[0].split(" ") for sentence in corpus]

tokenized_corpus = get_tokenized_corpus(corpus)
print(tokenized_corpus)

[['this', 'movie', 'had', 'a', 'brilliant', 'story', 'line', 'with', 'great', 'action'], ['some', 'parts', 'were', 'not', 'so', 'great', 'but', 'overall', 'pretty', 'ok'], ['my', 'dog', 'went', 'to', 'sleep', 'watching', 'this', 'piece', 'of', 'trash']]


Below is some code to load in a __vocabulary__. A vocabulary is simply a __list of unique words__ which we can perform lookups against.

In [2]:
vocab_file = open("google-10000-english.txt", "r")
vocabulary = [word.strip() for word in vocab_file.readlines()]
print("First five entries of vocabulary:", vocabulary[0:5])

First five entries of vocabulary: ['the', 'of', 'and', 'to', 'a']


![3d_onehot](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/one_hot31010000.png?raw=1)

In [3]:
# Let's one-hot the tokenized corpus
import numpy as np

def get_onehot_corpus(corpus, tokenized_corpus, vocabulary):

    CORPUS_SIZE = len(corpus)
    MAX_LEN_SEQUENCE = max(len(x) for x in tokenized_corpus)
    VOCAB_LEN = len(vocabulary)

    ##Create a 3-d array of zeros
    onehot_corpus = np.zeros((CORPUS_SIZE, MAX_LEN_SEQUENCE, VOCAB_LEN))
    for corpus_idx, tokenized_sentence in enumerate(tokenized_corpus):
        print (corpus_idx, tokenized_sentence)
        for sequence_idx, token in enumerate(tokenized_sentence):
            ##find the token index
            ##change the relevant value in the numpy array to zeros
            token_vocab_idx = vocabulary.index(token)
            onehot_corpus[corpus_idx, sequence_idx, token_vocab_idx] = 1
            
    return onehot_corpus

onehot_corpus = get_onehot_corpus(corpus, tokenized_corpus, vocabulary)
print(onehot_corpus.shape) # 3, 10, vocab_len

#3 for the corpus length i.e we have 3 sentences
#10 for the max sentence length

onehot_corpus[2, 5] # the one-hot entry for the 6th word in 3rd sentence


0 ['this', 'movie', 'had', 'a', 'brilliant', 'story', 'line', 'with', 'great', 'action']
1 ['some', 'parts', 'were', 'not', 'so', 'great', 'but', 'overall', 'pretty', 'ok']
2 ['my', 'dog', 'went', 'to', 'sleep', 'watching', 'this', 'piece', 'of', 'trash']
(3, 10, 10000)


array([0., 0., 0., ..., 0., 0., 0.])

In [0]:
new_reviews = [["good movie"], ["this movie had a significant amount of flaws"]]
corpus.extend(new_reviews)

tokenized_corpus = get_tokenized_corpus(corpus)
onehot_corpus = get_onehot_corpus(corpus, tokenized_corpus, vocabulary)


Oh... How do you think we can get around this? Discuss with someone around you the issues of one-hot encoding.

Dog and hound. Dog and potato.

# Distributed Representations
__"You shall know a word by the company it keeps" - Firth (1957).__
What does this mean?

Before focus on words themselves, let's look at concepts with a 2-dimension representations.
Animal cuteness and size:  

<table>
  <tr>
    <th>Animal</th>
    <th>Cuteness</th>
    <th>Size</th>
  </tr>
  <tr>
    <td>Lion</td>
    <td>80</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Elephant</td>
    <td>75</td>
    <td>95</td>
  </tr>
  <tr>
    <td>Hyena</td>
    <td>10</td>
    <td>30</td>
  </tr>
  <tr>
    <td>Mouse</td>
    <td>60</td>
    <td>8</td>
  </tr>
  <tr>
    <td>Pig</td>
    <td>30</td>
    <td>30</td>
  </tr>
  <tr>
    <td>Horse</td>
    <td>50</td>
    <td>65</td>
  </tr>
  <tr>
    <td>Dolphin</td>
    <td>90</td>
    <td>45</td>
  </tr>
  <tr>
    <td>Wasp</td>
    <td>2</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Giraffe</td>
    <td>60</td>
    <td>80</td>
  </tr>
  <tr>
    <td>Dog</td>
    <td>95</td>
    <td>20</td>
  </tr>
  <tr>
    <td>Alligator</td>
    <td>8</td>
    <td>40</td>
  </tr>
  <tr>
    <td>Mole</td>
    <td>30</td>
    <td>12</td>
  </tr>
  <tr>
    <td>Black Widow</td>
    <td>100</td>
    <td>30</td>
  </tr>
  </tr>
</table>



In [5]:
import plotly.graph_objects as go

animal_labels = ["Lion", "Elephant", "Hyena", "Mouse", "Pig", "Horse", "Dolphin", "Wasp", "Giraffe", "Dog", "Alligator", "Mole", "Scarlett Johansson"]
animal_cuteness = [80, 75, 10, 60, 30, 50, 90, 1, 60, 95, 8, 30, 100]
animal_size = [50, 95, 30, 8, 30, 65, 45, 1, 80, 20, 40, 12, 30]



fig = go.Figure(data=[go.Scatter(
    x=animal_cuteness, y=animal_size,
    text=animal_labels,
    mode='markers+text',
    marker_size=70)
])

fig.update_layout(
    title="Animal Cuteness vs Animal Size",
    xaxis_title="Animal Cuteness",
    yaxis_title="Animal Size",
)

fig.show()


The plot shows us that the closest animal to the Alligator is the Hyena based on size and cuteness. Let's calculate the (Euclidean) distance between the two!


In [0]:
import math
def distance_2d(x1, y1, x2, y2):
    return math.sqrt((x1-x2)**2 + (y1-y2)**2)

In [0]:
# Helper function to return us the animal information we need
def get_animal_info(animal_name):
    if animal_name in animal_labels:
        animal_idx = animal_labels.index(animal_name)
        animal_cuteness_ = animal_cuteness[animal_idx]
        animal_size_ = animal_size[animal_idx]
        return animal_cuteness_, animal_size_
    else:
        return False

In [9]:
alligator_cuteness, alligator_size = get_animal_info("Alligator")
hyena_cuteness, hyena_size = get_animal_info("Hyena")
elephant_cuteness, elephant_size = get_animal_info("Elephant")


print("DISTANCE BETWEEN ALLIGATOR AND HYENA", distance_2d(alligator_cuteness, alligator_size, hyena_cuteness, hyena_size))
print("DISTANCE BETWEEN ALLIGATOR AND ELEPHANT", distance_2d(alligator_cuteness, alligator_size, elephant_cuteness, elephant_size))

DISTANCE BETWEEN ALLIGATOR AND HYENA 10.198039027185569
DISTANCE BETWEEN ALLIGATOR AND ELEPHANT 86.68333173107735


This visual representation allows us to reason about many things. We can ask, for example, what's halfway between a Mosquito and a Horse (a Pig). We can also ask about differences. For example, the difference between a Mole and Mouse is 30 units of cuteness and a couple units in size.

The concept of difference allows us to reason about analogies. This means that we can say that animal_1 is to animal_2 the way that animal_3 is to animal_4. For example, we can say `Pig is to Horse as Mouse is to ???` - Lion


In [10]:
horse_cuteness, horse_size = get_animal_info("Horse")
pig_cuteness, pig_size = get_animal_info("Pig")

fig.add_trace(go.Scatter(x=[pig_cuteness, horse_cuteness], y=[pig_size, horse_size]))
fig.update_layout(showlegend=False)
fig.show()

In [11]:
# let's get the x and y difference between Pig and Horse
pig_horse_diff_cuteness = abs(pig_cuteness - horse_cuteness)
pig_horse_diff_size = abs(pig_size - horse_size)

# now let's apply the analogy to Mouse:
  # Pig is to Horse as Mouse is to ???
mouse_cuteness, mouse_size = get_animal_info("Mouse")
## plot the new new analogy
fig.add_trace(go.Scatter(x=[mouse_cuteness, mouse_cuteness+pig_horse_diff_cuteness], y=[mouse_size, mouse_size+pig_horse_diff_size]))
fig.update_layout(showlegend=False)
fig.show()

# Lion!

These concepts also work in higher dimensions. Before focusing on words themselves, let's briefly add another dimension to our dataset and look at one or two more concepts:


<table>
  <tr>
    <th>Animal</th>
    <th>Cuteness</th>
    <th>Size</th>
    <th>Ferocity</th>
  </tr>
  <tr>
    <td>Lion</td>
    <td>80</td>
    <td>50</td>
    <td>85</td>
  </tr>
  <tr>
    <td>Elephant</td>
    <td>75</td>
    <td>95</td>
    <td>20</td>
  </tr>
  <tr>
    <td>Hyena</td>
    <td>10</td>
    <td>30</td>
    <td>90</td>
  </tr>
  <tr>
    <td>Mouse</td>
    <td>60</td>
    <td>8</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Pig</td>
    <td>30</td>
    <td>30</td>
    <td>10</td>
  </tr>
  <tr>
    <td>Horse</td>
    <td>50</td>
    <td>65</td>
    <td>30</td>
  </tr>
  <tr>
    <td>Dolphin</td>
    <td>90</td>
    <td>45</td>
    <td>20</td>
  </tr>
  <tr>
    <td>Wasp</td>
    <td>2</td>
    <td>1</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Giraffe</td>
    <td>60</td>
    <td>80</td>
    <td>65</td>
  </tr>
  <tr>
    <td>Dog</td>
    <td>95</td>
    <td>20</td>
    <td>15</td>
  </tr>
  <tr>
    <td>Alligator</td>
    <td>8</td>
    <td>40</td>
    <td>90</td>
  </tr>
  <tr>
    <td>Mole</td>
    <td>30</td>
    <td>12</td>
    <td>15</td>
  </tr>
  <tr>
    <td>Black Widow</td>
    <td>100</td>
    <td>30</td>
    <td>69</td>
  </tr>
  </tr>
</table>


In [13]:
# just to remind us ;)
animal_labels = animal_labels
animal_cuteness = animal_cuteness
animal_size = animal_size
animal_ferocity = [85, 20, 90, 1, 10, 30, 20, 100, 65, 15, 90, 15, 69]


# nothing particularly important... just used for visualisation purposes
import statistics
animal_mean_stats = [statistics.mean(k) for k in zip(animal_cuteness, animal_size, animal_ferocity)]
animal_mean_stats, animal_labels

([71.66666666666667,
  63.333333333333336,
  43.333333333333336,
  23,
  23.333333333333332,
  48.333333333333336,
  51.666666666666664,
  34,
  68.33333333333333,
  43.333333333333336,
  46,
  19,
  66.33333333333333],
 ['Lion',
  'Elephant',
  'Hyena',
  'Mouse',
  'Pig',
  'Horse',
  'Dolphin',
  'Wasp',
  'Giraffe',
  'Dog',
  'Alligator',
  'Mole',
  'Scarlett Johansson'])

In [14]:
fig = go.Figure(data=[go.Scatter3d(
    x=animal_cuteness, y=animal_size, z=animal_ferocity,
    text=animal_labels,
    mode='markers+text',
    marker=dict(
        size=12,
        color=animal_mean_stats,                # set color to an array/list of desired values
        colorscale='Viridis',   # choose a colorscale
        opacity=0.8
    ))
])

fig.update_layout(title="Animal Cuteness vs Animal Size vs Animal Ferocity",
    scene = dict(
    xaxis_title='Animal Cuteness',
    yaxis_title='Animal Size',
    zaxis_title='Animal Ferocity')
)


fig.show()


In [0]:
# Let's redefine this function to return ferocity as well
def get_animal_info(animal_name):
    if animal_name in animal_labels:
        animal_idx = animal_labels.index(animal_name)
        animal_cuteness_ = animal_cuteness[animal_idx]
        animal_size_ = animal_size[animal_idx]
        animal_ferocity_ = animal_ferocity[animal_idx]
        return animal_cuteness_, animal_size_, animal_ferocity_
    else:
        return False

We'll demonstrate how to calculate the distance between vectors in 3d space, show the closest `n` points to a given point (animal) and also show how analogies work in 3 dimensions. The point of this exercise is give you an intuition behind how things can analogusly work in higher dimensional space.

![nd_distance](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/nd_distance.png?raw=1)

In [25]:
# Euclidean distance
def distance(coord1, coord2):
    # try and code this up in one line
    # print(list(zip(coord1, coord2)))
    return math.sqrt(sum([(u-v)**2 for u,v in list(zip(coord1, coord2))]))
    

distance((10, 30, 90), (20, 40, 90))

14.142135623730951

In [16]:
hyena_coords = get_animal_info("Hyena")
hyena_coords

(10, 30, 90)

In [26]:
hyena_coords = get_animal_info("Hyena")
alligator_coords = get_animal_info("Alligator")
print("Distance between Hyena and Alligator:", distance(hyena_coords, alligator_coords))

dog_coords = get_animal_info("Dog")
print("Distance between Hyena and Dog:", distance(hyena_coords, dog_coords))


Distance between Hyena and Alligator: 10.198039027185569
Distance between Hyena and Dog: 113.79806676741042


In [0]:
# Closest
animal_info = zip(animal_labels, animal_cuteness, animal_size, animal_ferocity)

def closest_to(animal_name, n=3):
    primary_animal_stats = get_animal_info(animal_name)
    distances_from_animal = []
    for label, cuteness, size, ferocity in animal_info:
        
        if label==animal_name:
            continue
            
        secondary_animal_stat = (cuteness, size, ferocity)
        distances_from_animal.append((label, distance(primary_animal_stats, secondary_animal_stat)))
        
    sorted_distances_from_animal = sorted(distances_from_animal, key=lambda x: x[1])
    
    return sorted_distances_from_animal[:n]
    
closest_to("Horse")

What if we could do the same with words? Before introducing the methodology of obtaining these vectors, it's relevant to discuss why word vectors are effective. One of the main reasons adoption is so widespread is because they can be **pretrained**. What this means is that we can use word vectors which have been trained on any corpus (e.g. Wikipedia, Twitter, medical journals, ancient religious texts etc.) in a potentially domain specific downstream task. For example, word vectors obtained by training on ancient religious texts might be used to classify ancient religious texts into a religion; or Twitter data can be used to generate conversation-like agents. That is, each word vector is just a vector to represent that word, and the algorithm we're using to solve the task at hand will learn the embeddings for all the words based on their context in the training corpus (e.g. the "meaning" of the word _lit_ would be different if comparing the vector based on a religious text and the vector based on Twitter).

Before the methodology, let's analyse some of these pre-trained embeddings. We'll use King, Queen, Royal, Man, Woman, Water, and Earth as an example.

In [0]:
glove_file = open("glove_50d_TRUNCATED.txt", "r", encoding="utf8")
glove_vectors_list = [word_and_vector.strip() for word_and_vector in glove_file.readlines()]
glove_vectors = {obj.split()[0]: np.asarray(obj.split()[1:], dtype=np.float) for obj in glove_vectors_list}
print(glove_vectors["the"])

In [0]:
king_vector = glove_vectors["king"]
queen_vector = glove_vectors["queen"]
man_vector = glove_vectors["man"]
woman_vector = glove_vectors["woman"]
water_vector = glove_vectors["water"]
earth_vector = glove_vectors["earth"]

In [0]:
from plotly.subplots import make_subplots

# Let's visualise these vectors and colour them based on their elemental values
# Red is lower, white=0, blue is higher

# vectors is a list of tuples: [(vector_name, vector)]
def plot_vectors(vectors):
    
    fig = make_subplots(rows=len(vectors), cols=1)
    
    for i, vector_tuple in enumerate(vectors):
        vector_name = vector_tuple[0]
        vector = vector_tuple[1]

        normalized_vector = (vector-np.min(vector))/(np.ptp(vector))
        x = ["dimension_"+str(_) for _ in range(50)]
        y = [1] * 50
        
        fig.add_trace(
            go.Bar(x=x, y=y, marker=dict(
            color=normalized_vector,                # set color to an array/list of desired values
            colorscale='rdbu',   # choose a colorscale
            opacity=0.8,
        )),
            row=i+1, col=1
        )
        
        fig.update_yaxes(title_text=vector_name, row=i+1, col=1)
        
    
    fig.update_layout(height=175*len(vectors), xaxis_showgrid=False, yaxis_showgrid=False, showlegend=False)
    fig.update_yaxes(showticklabels=False)
    fig.update_xaxes(showticklabels=False)
    fig.show()

plot_vectors([("King", king_vector), ("Queen", queen_vector), ("Man", man_vector), ("Woman", woman_vector), ("Water", water_vector), ("Earth", earth_vector)])

Similar to what was done with the animals previously, we can apply analogies to word vectors. In high-dimensional space it is preferrable to use **cosine** distance.

![cosine_distance](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/cosine_distance.png?raw=1)

Before applying/finding any analogies, we need a function which finds the closest vector to another vector:

In [0]:
# Use the cosine distance instead of the euclidean distance we coded up earlier on
from scipy.spatial.distance import cosine

def find_closest(to_find_closest, exclude_list):
    closest_distance = float("inf")
    closest_token = None

    for token, vector in glove_vectors.items():
        distance = cosine(to_find_closest, vector)
        if closest_distance > distance and token not in exclude_list:
            closest_distance = distance
            closest_token = token

    return closest_token

Find out the answer to "king - man + woman"

In [0]:
##

What about.. London is to England as Paris is to X?

In [0]:
##

Alongside analogies, we are also able to semantically 'reason':

In [0]:
print(find_closest(glove_vectors["france"] + glove_vectors["capital"], ["france", "capital"]))

#### The takeaway: Distributed vectors group similar words/objects together

# Pre-processing and tokenization: Cleaning your corpus
Now that we know how we want to represent words, let's do it! However, before we do so we need to clean our dataset. We're going to create our vocabulary from scratch using our corpus. Oov

In [0]:
# our corpus is now more in a format you would find a corpus in production
corpus = [corpus[idx][0] for idx, sentence in enumerate(corpus)]

more_movie_reviews = [
    "this was a BRILLIANT 👍 movie",
    "💩 film",
    "this was a terrible movie",
    "this was a treribel movie",
    "this was a good 👍 movie",
    "A moving story about U.S. wildlife.",
    "Wow. I had not expected wildlife in the US to be so diverse",
    "Wow - what a MOVIE 👍",
    "Us here at The Movie Reviewers found this movie exceedingly average",
    "The Polish people in this film... stunning 👍",
    "A bit of a polish to this movie would have made it great",
    "This film didn't exite me as much as much as the prequel.",
    "this movie doesn't live up to the hype of its trailer",
    "It's rare that a film is this good??",
    "This movie was 👍 💩"
]

corpus.extend(more_movie_reviews)
corpus

Like we did at the beginning of this notebook, we need to tokenize this corpus. Code up a function to do that.

In [0]:
def tokenize(corpus):
    ##split the reviews on whitespace

tokenized_corpus = tokenize(corpus)
print(tokenized_corpus)

What issues can you see? Let's count all the distinct tokens now:

In [0]:
distinct_tokens_count = {}
for t_review in tokenized_corpus:
    for token in t_review:
        if token not in distinct_tokens_count.keys():
            distinct_tokens_count[token] = 1
        else:
            distinct_tokens_count[token] += 1
            
for token, count in distinct_tokens_count.items():
    print("{:<14s} {:<10d}".format(token, count))

Ok.. that's a lot of information, lets just look at the key points:

In [0]:
for token, count in distinct_tokens_count.items():
    if "movie" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "film" in token.lower():
        print(token, count)        

for token, count in distinct_tokens_count.items():
    if "good" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "polish" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "wildlife" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "us" in token.lower() or "u.s." in token.lower():
        print(token, count)

Ok so let's quickly run through the issues here. Firstly, we see `movie`, `Movie` and `MOVIE`. How can we resolve this issue during our tokenization step? We could lowercase everything, but what issues can you see that causing? Semantic differences of Polish vs polish etc. What about the two different instances of `film` (`film...`)? Ok, we can remove punctuation, but wouldn't this ruin the `U.S.` acronym? Another solution would be to count `...` as a token, or remove it, but keep `.`s as part of the token if it is instantly followed by a letter (i.e. no space). This part of the pre-processing pipeline is called normalization. Let's do some basic normalization for now, and later on we'll use a library to do this for us ([regex] is a beast in itself). Our normalization step will be to lowercase everything and remove all punctuation.

In [0]:
import re # regex

# The backslash is an escape character to let the interpreter know we mean to use it as a string literal (- or ') 
re_punctuation_string = '[\s,/.?\-\']'  # split on spaces (\s),  commas (,), slash (/), fullstop (.), question marks (?), hyphens (-), and apostrophe (').
tokenized_corpus = []
for review in corpus:
    tokenized_review = re.split(re_punctuation_string, review) # in python's regex, [...] is an alternative to writing .|.|.
    tokenized_review = list(filter(None, tokenized_review)) # remove empty strings from list 
    tokenized_corpus.append([token.lower() for token in tokenized_review]) # Lowercasing everything
        
print(tokenized_corpus)

In [0]:
distinct_tokens_count = {}
for t_review in tokenized_corpus:
    for token in t_review:
        if token not in distinct_tokens_count.keys():
            distinct_tokens_count[token] = 1
        else:
            distinct_tokens_count[token] += 1

In [0]:
for token, count in distinct_tokens_count.items():
    if "movie" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "film" in token.lower():
        print(token, count)        

for token, count in distinct_tokens_count.items():
    if "good" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "polish" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "wildlife" in token.lower():
        print(token, count)

for token, count in distinct_tokens_count.items():
    if "us" in token.lower() or "u.s." in token.lower():
        print(token, count)

There are still some problems, e.g.: `['a', 'moving', 'story', 'about', 'u', 's', 'wildlife']` and the letters that were split on after apostrophes, e.g.: `s` and `t`. We'll leave this as is for now, and sort this out later. What do you think we need to do the emoji?

In [0]:
print("U+", ord("😂"))

# [Word2Vec](https://arxiv.org/abs/1301.3781)


Representing words as vectors is often attributed to the quote at the beginning of this section - __you shall know a word by the company it keeps__. Above, we demonstrated how this quote is manifested by the idea of distributed representations. This section discusses and implements one way of obtaining representations.

Word2Vec refers to a family of neural network based algorithms which obtain these word vectors/embeddings. One part of this family consists of two different model architectures: Continuous bag-of-words and Skip-gram. The other part of this family consists of two different approaches of dealing with a vocabulary in the order of **millions**: Hierarchical Softmax and Negative Sampling.

N.B. Word2Vec is not the only algorithm used to obtain word vectors. [GLoVE](https://nlp.stanford.edu/projects/glove/), [FastText](https://fasttext.cc/), and [pre-trained language models](https://arxiv.org/abs/1810.04805) are alternatives. We will be looking at pre-trained language models later on this course. The reason **Word2Vec** is focused on is because it provides a simple intuition behind how we can use neural networks to reduce dimensionality and then use the lower-dimensional vectors in downstream tasks. Using parameters from one model in another is also part of a paradigm known as _transfer learning_.

Recall the problem we are trying to solve - representing our sequences, movie reviews, in a way that we can feed to a (classification) model. Earlier on, we loaded in pre-trained embeddings. Each vector we obtained for a given word was the representation of this word. Let's discuss how to get this representation.

Looking at the quote again, we have this notion of "company". What do you think this means? It leads to the term we call a context window: the words which are in the **negative** $c$ range to the **positive** $c$ range away from the context word $w_t$.

![windows](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/context_window.png?raw=1)

Two variants, Continuous Bag of Words (CBOW) and Skipgram. We'll focus on skipgram because it has shown to outperform CBOW on larger corpus'. After we implement skipgram below, see if you're able to implement CBOW yourself.

![image.png](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/cbow_skipgram.png?raw=1)

## Skipgram

With the knowledge that the input and output are one-hot, lets derive the objective. Instead of considering the problem where we have four gold labels, lets look at the case where we have one:

![skipgram_NN](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/skipgram_NN.png?raw=1)

![skipgram_concepts](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/skipgram_concepts.png?raw=1)

![skipgram_maths](https://github.com/ShaunakSen/NLP/blob/master/1_WordRepresentations/images/skipgram_maths.png?raw=1)

Now we know what our neural network is going to look like - we just need to figure out how to feed our data into the model. To convert our input to one hot we'll need to build a vocabulary dictionary which maps words to an integer. We're going to use our corpus to build our vocabulary. The reason we didn't do that before was because we loaded in an external vocabulary file

In [0]:
##extract a vocabulary (i.e. a list/set of unique tokens)

print("LENGTH OF VOCAB:", len(vocabulary), "\nVOCAB:", vocabulary)

In [0]:
# Let's map these to the aforementioned dictionaries
word2idx = {}
n_words = 0

for token in vocabulary:
    if token not in word2idx:
        word2idx[token] = n_words
        n_words += 1
        
assert len(word2idx) == len(vocabulary)

In [0]:
print(word2idx)

**At this point, we should extract our contexts and focus words:**
- Let's say we're considering the sentence: `['this', 'movie', 'had', 'a', 'brilliant', 'story', 'line', 'with', 'great', 'action']`
- For every word in the sentence, we want to get the words which are `window_size` around it.
- So if `window_size==2`, for the word `this`, we obtain: `[['this', 'movie'], ['this', 'had']]`
- For the word `movie`, we obtain: `[['movie', 'this'], ['movie', 'had'], ['movie', 'a']]`
- For the word `had`, we obtain: `[['had', 'this'], ['had', 'movie'], ['had', 'a'], ['had', 'brilliant']]`

In [0]:
def get_focus_context_pairs(tokenized_corpus, window_size=2):
    focus_context_pairs = []
    for sentence in tokenized_corpus:

        for token_idx, token in enumerate(sentence):
            for w in range(-window_size, window_size+1):
                context_word_pos = token_idx + w

                if w == 0 or context_word_pos >= len(sentence) or context_word_pos < 0:
                    continue

                try:
                    focus_context_pairs.append([token, sentence[context_word_pos]])
                except:
                    continue
    
    return focus_context_pairs
                
focus_context_pairs = get_focus_context_pairs(tokenized_corpus)
print(focus_context_pairs)

In [0]:
# Let's map these to our indicies in preparation to one-hot
def get_focus_context_idx(focus_context_pairs):
    idx_pairs = []
    for pair in focus_context_pairs:
        idx_pairs.append([word2idx[pair[0]], word2idx[pair[1]]])
    
    return idx_pairs

idx_pairs = get_focus_context_idx(focus_context_pairs)
print(idx_pairs)

In [0]:
def get_one_hot(indicies, vocab_size=len(vocabulary)):
    oh_matrix = np.zeros((len(indicies), vocab_size))
    for i, idx in enumerate(indicies):
        oh_matrix[i, idx] = 1

    return torch.Tensor(oh_matrix)

Time to build our neural network!

In [0]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

from tqdm import tqdm
import random

writer = SummaryWriter('runs/word2vec')

In [0]:
class Word2Vec(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim_size):
        super().__init__()
        
        # Why do you think we don't have an activation function here?
        ##initialize the hidden layer and output layer
        ## self.projection =
        ## self.output =
        
    def forward(self, input_token):
        x = self.projection(input_token)
        output = self.output(x)
        return output

In [0]:
# Tensorboard doesn't handle encoding emojis.
# So while we can train our model on emojis (as we've just done)
# We gotta convert their unicode string to something displayable on Tensorboard

word2idx[":pile_of_poo:"] = word2idx.pop("\U0001f4a9")
word2idx[":thumbs_up:"] = word2idx.pop("\U0001f44d")
word2idx = {k: v for k, v in sorted(word2idx.items(), key=lambda item: item[1])} # sort dictionary


In [0]:
def train(word2vec_model, idx_pairs, state_dict_filename, early_stop=False, num_epochs=10, lr=1e-3):

    word2vec_model.train()
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(word2vec_model.parameters(), lr=lr)

    for epoch in tqdm(range(num_epochs)):

        random.shuffle(idx_pairs)

        for focus, context in idx_pairs:
            
            oh_inputs = get_one_hot([focus], len(vocabulary))
            target = torch.LongTensor([context])

            pred_outputs = word2vec_model(oh_inputs)

            loss = criterion(pred_outputs, target)

            loss.backward()
            optimizer.step()
            word2vec_model.zero_grad()
            
        ### These lines stop training early
            if early_stop: break
        if early_stop: break
        ###

        torch.save(word2vec_model.state_dict(), state_dict_filename)
        writer.add_embedding(word2vec_model.projection.weight.T,
                             metadata=word2idx.keys(), global_step=epoch)

In [0]:
word2vec = Word2Vec(len(vocabulary), len(vocabulary), 10)
train(word2vec, idx_pairs, "word2vec.pt")

# SpaCy

We've covered how to write some basic pre-processing rules from scratch, and we've seen some issues that rule based pre-processing can cause. Sometimes it'll simply take too long to code up rules for all edge cases. In downstream tasks, we might also want to augment the data with some more information, for example, their Parts of Speech tag (PoS) - an identifier per word describing the type of word it is (e.g. verb, adjective).

SpaCy is an easy to use NLP library which gives us access to neural-models for various linguistic features. In this section we're going to use the Tokenizer to preprocess a larger corpus of text and train a word2vec model on this corpus. Then we're going to use Tensorboard to visualise the embeddings we've just trained.

In [0]:
import spacy
import pickle
import os
from tqdm import tqdm

nlp = spacy.load("en")
GUTENBERG_DIR = "gutenberg/"

In [0]:
gutenberg_books = []
for i, book_name in enumerate(os.listdir(GUTENBERG_DIR)):
    book_file = open(os.path.join(
        GUTENBERG_DIR, book_name), encoding="latin-1")
    book = book_file.read()
    gutenberg_books.append(book)
    book_file.close()
    if i == 3:
        break

gutenberg_book_lines = []
for book in tqdm(gutenberg_books):
    book_lines = book.split("\n")
    book_lines = list(filter(lambda x: x != "", book_lines))
    gutenberg_book_lines.append(book_lines)
    
# print(gutenberg_book_lines[0])

In [0]:
if os.path.exists("tokenized_corpus_gutenberg.pkl"):
    tokenized_corpus = pickle.load(open("tokenized_corpus_gutenberg.pkl", "rb"))
else:
    tokenized_corpus = []
    for book_line in tqdm(gutenberg_book_lines):
        for line in tqdm(book_line):
            doc = nlp(line)
            tokenized_corpus.append([token.text.lower()
                                     for token in doc if not token.is_punct])

    print(tokenized_corpus[0:5])
    pickle.dump(tokenized_corpus, open("tokenized_corpus_gutenberg.pkl", "wb"))


One thing which wasn't mentioned before was the discrepency in the words which may be present at test/inference time but not during training. These are known as __out of vocabulary__ tokens. A simple strategy to deal with this is to replace every word in our training set which occurs with less than a certain threshold with an `<OOV>` token. At test time, if a given word isn't in the vocabulary that the model was trained on, we simply replace it with the `<OOV>` token.

In [0]:
def get_vocabulary(tokenized_corpus, cutoff_frequency=5):
    vocab_freq_dict = dict()
    for sentence in tokenized_corpus:
        for token in sentence:
            if token not in vocab_freq_dict.keys():
                vocab_freq_dict[token] = 0

            vocab_freq_dict[token] += 1

    vocabulary = set()
    ##for each token in our corpus, 
    ##add that token to our vocabulary if it appears less than cutoff_frequency amount of times
                
    return vocabulary

vocabulary = get_vocabulary(tokenized_corpus)
print("LENGTH OF VOCAB:", len(vocabulary), "\nVOCAB:", vocabulary)

In [0]:
OOV_token = "<OOV>"
vocabulary.add(OOV_token)
word2idx = {}
n_words = 0

tokenized_corpus_with_OOV = []
for sentence in tokenized_corpus:

    tokenized_sentence_with_OOV = []
    for token in sentence:
        if token in vocabulary:
            tokenized_sentence_with_OOV.append(token)
        else:
            tokenized_sentence_with_OOV.append(OOV_token)
    tokenized_corpus_with_OOV.append(tokenized_sentence_with_OOV)


In [0]:
for token in vocabulary:
    if token not in word2idx:
        word2idx[token] = n_words
        n_words += 1

assert len(word2idx) == len(vocabulary)

In [0]:
focus_context_pairs = get_focus_context_pairs(tokenized_corpus_with_OOV)

print(focus_context_pairs[0:20])

In [0]:
idx_pairs = get_focus_context_idx(focus_context_pairs)

print(idx_pairs[0:20])

In [0]:
writer = SummaryWriter('runs/word2vec_gutenberg')

In [0]:
w2v_gutenberg = Word2Vec(len(vocabulary), len(vocabulary), 10)
train(w2v_gutenberg, idx_pairs, "word2vec_gutenberg.pt", early_stop=True)

Let's visualise these embeddings using Tensorboard's projector!

### How do we use these word embeddings?

Our word embeddings __are__ the `projection` weight matrix that we trained earlier on. To use this in downstream tasks, we can save our weight matrix and initalise the embeddings of our downstream network with the weights we've obtained. This is something we'll do in the next session, but extracting the weight matrix is as simple as:

In [0]:
weights_matrix = w2v_gutenberg.projection.weight.T
print(weights_matrix.shape)

# Thank you!
Next up: LSTMs, Language Modelling, and Translation