[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AISaturdaysKigali/intro-to-dl/blob/master/tutorials/W8_TimeSeriesAndNLP/W8_Tutorial1.ipynb)

# Tutorial 1: Modeling sequencies and encoding text

**Week 8: Modern RNNs**


----
# Tutorial objectives

Before we begin with exploring how RNNs excel at modelling sequences, we will explore some of the other ways we can model sequences, encode text, and make meaningful measurements using such encodings and embeddings. 

---
## Setup

In [None]:
# @title Install dependencies

# @markdown There may be `Errors`/`Warnings` reported during the installation. However, they are to be ignored.
!pip install torchtext==0.4.0 --quiet
!pip install --upgrade gensim --quiet
!pip install unidecode --quiet
!pip install hmmlearn --quiet
!pip install fasttext --quiet
!pip install nltk --quiet
!pip install pandas --quiet
!pip install python-Levenshtein --quiet

In [None]:
# Imports
import time
import fasttext
import numpy as np
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.nn import functional as F

from hmmlearn import hmm
from scipy.sparse import dok_matrix

from torchtext import data, datasets
from torchtext.vocab import FastText

import nltk
from nltk import FreqDist
from nltk.corpus import brown
from nltk.tokenize import word_tokenize

from gensim.models import Word2Vec

from sklearn.manifold import TSNE
from sklearn.preprocessing import LabelEncoder

from tqdm import tqdm_notebook as tqdm

In [None]:
# @title Figure Settings
import ipywidgets as widgets
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/AISaturdaysKigali/content-creation/master/ai6kigali.mplstyle")

In [None]:
# @title  Load Dataset from `nltk`
# no critical warnings, so we supress it
import warnings
warnings.simplefilter("ignore")

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
nltk.download('webtext')

In [None]:
# @title Helper functions

import requests

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between vec_a and vec_b"""
    return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))


def tokenize(sentences):
  #Tokenize the sentence
  #from nltk.tokenize library use word_tokenize
  token = word_tokenize(sentences)

  return token


def plot_train_val(x, train, val, train_label, val_label, title, y_label,
                   color):
  plt.plot(x, train, label=train_label, color=color)
  plt.plot(x, val, label=val_label, color=color, linestyle='--')
  plt.legend(loc='lower right')
  plt.xlabel('epoch')
  plt.ylabel(y_label)
  plt.title(title)


def load_dataset(emb_vectors, sentence_length=50, seed=522):
  TEXT = data.Field(sequential=True,
                    tokenize=tokenize,
                    lower=True,
                    include_lengths=True,
                    batch_first=True,
                    fix_length=sentence_length)
  LABEL = data.LabelField(dtype=torch.float)

  train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

  TEXT.build_vocab(train_data, vectors=emb_vectors)
  LABEL.build_vocab(train_data)

  train_data, valid_data = train_data.split(split_ratio=0.7,
                                            random_state=random.seed(seed))
  train_iter, valid_iter, test_iter = data.BucketIterator.splits((train_data,
                                                                  valid_data,
                                                                  test_data),
                                                                  batch_size=32,
                                                                  sort_key=lambda x: len(x.text),
                                                                  repeat=False,
                                                                  shuffle=True)
  vocab_size = len(TEXT.vocab)

  print(f'Data are loaded. sentence length: {sentence_length} '
        f'seed: {seed}')

  return TEXT, vocab_size, train_iter, valid_iter, test_iter


def download_file_from_google_drive(id, destination):
  URL = "https://docs.google.com/uc?export=download"

  session = requests.Session()

  response = session.get(URL, params={ 'id': id }, stream=True)
  token = get_confirm_token(response)

  if token:
    params = { 'id': id, 'confirm': token }
    response = session.get(URL, params=params, stream=True)

  save_response_content(response, destination)


def get_confirm_token(response):
  for key, value in response.cookies.items():
    if key.startswith('download_warning'):
      return value

  return None


def save_response_content(response, destination):
  CHUNK_SIZE = 32768

  with open(destination, "wb") as f:
    for chunk in response.iter_content(CHUNK_SIZE):
      if chunk: # filter out keep-alive new chunks
        f.write(chunk)

In [None]:
# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# for DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https://pytorch.org/docs/stable/notes/randomness.html

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import torch

def set_seed(seed=None, seed_torch=True):
  if seed is None:
    seed = np.random.choice(2 ** 32)
  random.seed(seed)
  np.random.seed(seed)
  if seed_torch:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

  print(f'Random seed {seed} has been set.')

# In case that `DataLoader` is used
def seed_worker(worker_id):
  worker_seed = torch.initial_seed() % 2**32
  np.random.seed(worker_seed)
  random.seed(worker_seed)

In [None]:
# @title Set device (GPU or CPU). Execute `set_device()`

# inform the user if the notebook uses GPU or CPU.

def set_device():
  device = "cuda" if torch.cuda.is_available() else "cpu"
  if device != "cuda":
    print("WARNING: For this notebook to perform best, "
        "if possible, in the menu under `Runtime` -> "
        "`Change runtime type.`  select `GPU` ")
  else:
    print("GPU is enabled in this notebook.")

  return device

In [None]:
DEVICE = set_device()
SEED = 2021
set_seed(seed=SEED)

---
# Section 2: Word Embeddings

*Time estimate: ~60mins*


Words or subword units such as morphemes are the basic units that we use to express meaning  in language. The technique of mapping words to vectors of real numbers is known as word embedding. 

Word2vec is based on theories of distributional semantics - words that appear around each other are more likely to mean similar things than words that do not appear around each other. Keeping this in mind, our job is to create a high dimensional space where these semantic relations are preserved. The innovation in word2vec is the realisation that we can use unlabelled, running text in sentences as inputs for a supervised learning algorithm--as a self-supervision task. It is supervised because we use the words in a sentence to serve as positive and negative examples. Let’s break this down:

... "use the kitchen knife to chop the vegetables"…

**C1   C2   C3   T   C4   C5   C6   C7**

Here, the target word is knife, and the context words are the ones in its immediate (6-word) window. 
The first word2vec method we’ll see is called skipgram, where the task is to assign a probability for how likely it is that the context window appears around the target word. In the training process, positive examples are samples of words and their context words, and negative examples are created by sampling from pairs of words that do not appear nearby one another. 

This method of implementing word2vec is called skipgram with negative sampling. So while the algorithm tries to better learn which context words are likely to appear around a target word, it ends up pushing the embedded representations for every word so that they are located optimally (e.g., with minimal semantic distortion). In this process of adjusting embedding values, the algorithm brings semantically similar words close together in the resulting high dimensional space, and dissimilar words far away. 

Another word2vec training method, Continuous Bag of Words (CBOW), works in a similar fashion, and tries to predict the target word, given context. This is converse of skipgram, which tries to predict the context, given the target word. Skip-gram represents rare words and phrases well, often requiring more data for stable representations, while CBOW is several times faster to train than the skip-gram, but with slightly better accuracy for the frequent words in its prediction task. The popular gensim implementation of word2vec has both the methods included.  

## Section 2.1: Creating Word Embeddings

We will create embeddings for a subset of categories in [Brown corpus](https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html).  In order to achieve this task we will use [gensim](https://radimrehurek.com/gensim/) library to create word2vec embeddings. Gensim’s word2vec expects a sequence of sentences as its input. Each sentence is a list of words.
Calling `Word2Vec(sentences, iter=1)` will run two passes over the sentences iterator (or, in general iter+1 passes). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. 
`Word2vec` accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

`model = Word2Vec(sentences, min_count=10)  # default value is 5`


A reasonable value for min_count is between 0-100, depending on the size of your dataset.

Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:

`model = Word2Vec(sentences, size=200)  # default value is 100`


Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

The last of the major parameters (full list [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:

`model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization`

In [None]:
category = ['editorial', 'fiction', 'government', 'mystery', 'news', 'religion',
            'reviews', 'romance', 'science_fiction']

In [None]:
def create_word2vec_model(category='news', size=50, sg=1, min_count=5):
  try:
    sentences = brown.sents(categories=category)
    model = Word2Vec(sentences, vector_size=size, sg=sg, min_count=min_count)

  except (AttributeError, TypeError):
      raise AssertionError('Input variable "category" should be a string or list,'
      '"size", "sg", "min_count" should be integers')

  return model

def model_dictionary(model):
  words = list(model.wv.key_to_index)
  return words

def get_embedding(word, model):
  if word in model.wv.key_to_index:
    return model.wv[word]
  else:
    return None

In [None]:
all_categories = brown.categories()

In [None]:
all_categories

In [None]:
w2vmodel = create_word2vec_model(all_categories)

In [None]:
print(model_dictionary(w2vmodel))

In [None]:
print(get_embedding('weather', w2vmodel))

## Section 2.2: Visualizing Word Embedding

We can now obtain the word embeddings for any word in the dictionary using word2vec. Let's visualize these embeddings to get an inuition of what these embeddings mean. The word embeddings obtained from word2vec model are in high dimensional space. We will use `tSNE` (t-distributed stochastic neighbor embedding), a statistical method for dimensionality deduction that allow us to visualize high-dimensional data in a 2D or 3D space. Here, we will use `tSNE` from [`scikit-learn`] module(https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) (if you are not familiar with this method, think about `PCA`) to project our high dimensional embeddings in the 2D space.


For each word in `keys`, we pick the top 10 similar words (using cosine similarity) and plot them.  

 What should be the arrangement of similar words?
 What should be arrangement of the key clusters with respect to each other?
 

In [None]:
keys = ['voters', 'magic', 'love', 'God', 'evidence', 'administration', 'governments']

In [None]:
def get_cluster_embeddings(keys):
  embedding_clusters = []
  word_clusters = []

  # find closest words and add them to cluster
  for word in keys:
    embeddings = []
    words = []
    if not word in w2vmodel.wv.key_to_index:
      print('The word ', word, 'is not in the dictionary')
      continue

    for similar_word, _ in w2vmodel.wv.most_similar(word, topn=10):
      words.append(similar_word)
      embeddings.append(w2vmodel.wv[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)

  # get embeddings for the words in clusers
  embedding_clusters = np.array(embedding_clusters)
  n, m, k = embedding_clusters.shape
  tsne_model_en_2d = TSNE(perplexity=10, n_components=2, init='pca', n_iter=3500, random_state=32)
  embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

  return embeddings_en_2d, word_clusters

In [None]:
def tsne_plot_similar_words(title, labels, embedding_clusters,
                            word_clusters, a, filename=None):
  plt.figure(figsize=(16, 9))
  colors = cm.rainbow(np.linspace(0, 1, len(labels)))
  for label, embeddings, words, color in zip(labels, embedding_clusters, word_clusters, colors):
    x = embeddings[:, 0]
    y = embeddings[:, 1]
    plt.scatter(x, y, color=color, alpha=a, label=label)
    for i, word in enumerate(words):
      plt.annotate(word,
                   alpha=0.5,
                   xy=(x[i], y[i]),
                   xytext=(5, 2),
                   textcoords='offset points',
                   ha='right',
                   va='bottom',
                   size=10)
  plt.legend(loc="lower left")
  plt.title(title)
  plt.grid(True)
  if filename:
    plt.savefig(filename, format='png', dpi=150, bbox_inches='tight')
  plt.show()

In [None]:
embeddings_en_2d, word_clusters = get_cluster_embeddings(keys)
tsne_plot_similar_words('Similar words from Brown Corpus', keys, embeddings_en_2d, word_clusters, 0.7)

## Section 2.3: Exploring meaning with word embeddings

While word2vec was the method that started it all, research has since boomed, and we now have more sophisticated ways to represent words. One such method is FastText, developed at Facebook AI research, which breaks words into sub-words: such a technique also allows us to create embedding representations for unseen words. In this section, we will explore how semantics and meaning are captured using embedidngs, after downloading a pre-trained FastText model. Downloading pre-trained models is a way for us to plug in word embeddings and explore them without training them ourselves.

In [None]:
# # @title Download FastText English Embeddings of dimension 100
# import os, io, zipfile
# from urllib.request import urlopen

# zipurl = 'https://osf.io/w9sr7/download'
# print(f"Downloading and unzipping the file... Please wait.")
# with urlopen(zipurl) as zipresp:
#   with zipfile.ZipFile(io.BytesIO(zipresp.read())) as zfile:
#     zfile.extractall('.')
# print("Download completed!")

In [None]:
# Load 100 dimension FastText Vectors using FastText library
ft_en_vectors = fasttext.load_model('cc.en.100.bin')

In [None]:
print(f"Length of the embedding is: {len(ft_en_vectors.get_word_vector('king'))}")
print(f"Embedding for the word King is: {ft_en_vectors.get_word_vector('king')}")

Cosine similarity is used for similarities between words. Similarity is a scalar between 0 and 1.

Now find the 10 most similar words to "King"

In [None]:
ft_en_vectors.get_nearest_neighbors("king", 10)  # Most similar by key

### Word Similarity

More on similarity between words. Let's check how similar different pairs of word are. Feel free to play around.



In [None]:
def getSimilarity(word1, word2):
  v1 = ft_en_vectors.get_word_vector(word1)
  v2 = ft_en_vectors.get_word_vector(word2)
  return cosine_similarity(v1, v2)

print("Similarity between the words King and Queen: ", getSimilarity("king", "queen"))
print("Similarity between the words King and Knight: ", getSimilarity("king", "knight"))
print("Similarity between the words King and Rock: ", getSimilarity("king", "rock"))
print("Similarity between the words King and Twenty: ", getSimilarity("king", "twenty"))

## Try the same for two more pairs
# print("Similarity between the words ___ and ___: ", getSimilarity(...))
# print("Similarity between the words ___ and ___: ", getSimilarity(...))

# print("Similarity between the words ___ and ___: ", getSimilarity(...))
# print("Similarity between the words ___ and ___: ", getSimilarity(...))

### Homonym Words$^\dagger$

Find the similarity for homonym words with their different meanings. The first one has been implemented for you.


$^\dagger$: Two or more words having the same spelling or pronunciation but different meanings and origins are called *homonyms*. E.g., 

In [None]:
#######################     Words with multiple meanings     ##########################
print("Similarity between the words Cricket and Insect: ", getSimilarity("cricket", "insect"))
print("Similarity between the words Cricket and Sport: ", getSimilarity("cricket", "sport"))

## Try the same for two more pairs
# print("Similarity between the words ___ and ___: ", getSimilarity(...))
# print("Similarity between the words ___ and ___: ", getSimilarity(...))

# print("Similarity between the words ___ and ___: ", getSimilarity(...))
# print("Similarity between the words ___ and ___: ", getSimilarity(...))

### Word Analogies

Embeddings can be used to find word analogies.
Let's try it:
1.   Man : Woman  ::  King : _____
2.  Germany: Berlin :: France : ______
3.  Leaf : Tree  ::  Petal : _____

In [None]:
## Use get_analogies() funnction. The words have to be in the order Positive, negative,  Positve

# Man : Woman  ::  King : _____
# Positive=(woman, king), Negative=(man)
print(ft_en_vectors.get_analogies("woman", "man", "king",1))

# Germany: Berlin :: France : ______
# Positive=(berlin, frannce), Negative=(germany)
print(ft_en_vectors.get_analogies("berlin", "germany", "france",1))

# Leaf : Tree  ::  Petal : _____
# Positive=(tree, petal), Negative=(leaf)
print(ft_en_vectors.get_analogies("tree", "leaf", "petal",1))

# Hammer : Nail  ::  Comb : _____
# Positive=(nail, comb), Negative=(hammer)
print(ft_en_vectors.get_analogies("nail", "hammer", "comb",1))

But, does it always work?


1.   Poverty : Wealth  :: Sickness : _____
2.   train : board :: horse : _____

In [None]:
# Poverty : Wealth  :: Sickness : _____
print(ft_en_vectors.get_analogies("wealth", "poverty", "sickness",1))

# train : board :: horse : _____
print(ft_en_vectors.get_analogies("board", "train", "horse",1))

---
# Section 3: Neural Net with word embeddings

*Time estimate: ~16mins*

Let's use the pretrained FastText embeddings to train a neural network on the IMDB dataset. 

To recap, the data consists of reviews and sentiments attached to it. It is a binary classification task. As a simple preview of the upcoming neural networks, we are going to introduce neural net with word embeddings. We'll see detailed networks in the next tutorial.




## Coding Exercise 3.1: Simple Feed Forward Net

This will load 300 dim FastText embeddings. It will take around 2-3 minutes.

Define a vanilla neural network with linear layers. Then average the word embeddings to get an embedding for the entire review.
The neural net will have one hidden layer of size 128.

In [None]:
# @title Download embeddings and clear old variables to clean memory.
# @markdown #### Execute this cell!
if 'ft_en_vectors' in locals():
  del ft_en_vectors
if 'w2vmodel' in locals():
  del w2vmodel

embedding_fasttext = FastText('simple')

In [None]:
# @markdown Load the Dataset
TEXT, vocab_size, train_iter, valid_iter, test_iter = load_dataset(embedding_fasttext, seed=SEED)

In [None]:
class NeuralNet(nn.Module):
  def __init__(self, output_size, hidden_size, vocab_size, embedding_length,
               word_embeddings):
    super(NeuralNet, self).__init__()

    self.word_embeddings = nn.Embedding(vocab_size, embedding_length)
    self.word_embeddings.weight = nn.Parameter(word_embeddings,
                                               requires_grad=False)
    self.fc1 = nn.Linear(embedding_length, hidden_size)
    self.fc2 = nn.Linear(hidden_size, output_size)


  def forward(self, inputs):

    input = self.word_embeddings(inputs)  # convert text to embeddings
    ####################################################################
    # Fill in missing code below (...)
    raise NotImplementedError("Fill in the Neural Net")
    ####################################################################
    # Average the word embeddings in a sentence
    # Use torch.nn.functional.avg_pool2d to compute the averages
    pooled = ...

    # Pass the embeddings through the neural net
    # A fully-connected layer
    x = ...
    # ReLU activation
    x = ...
    # Another fully-connected layer
    x = ...
    output = F.log_softmax(x, dim=1)

    return output

# Uncomment to check your code
# nn_model = NeuralNet(2, 128, 100, 300, TEXT.vocab.vectors)
# print(nn_model)

Solution not provided yet

```
NeuralNet(
  (word_embeddings): Embedding(100, 300)
  (fc1): Linear(in_features=300, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=2, bias=True)
)
```

In [None]:
# @title Training and Testing Functions

# @markdown #### `train(model, device, train_iter, valid_iter, epochs, learning_rate)`
# @markdown #### `test(model, device, test_iter)`

def train(model, device, train_iter, valid_iter, epochs, learning_rate):
  criterion = nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

  train_loss, validation_loss = [], []
  train_acc, validation_acc = [], []

  for epoch in range(epochs):
    # train
    model.train()
    running_loss = 0.
    correct, total = 0, 0
    steps = 0

    for idx, batch in enumerate(train_iter):
      text = batch.text[0]
      target = batch.label
      target = torch.autograd.Variable(target).long()
      text, target = text.to(device), target.to(device)

      # add micro for coding training loop
      optimizer.zero_grad()
      output = model(text)
      loss = criterion(output, target)
      loss.backward()
      optimizer.step()
      steps += 1
      running_loss += loss.item()

      # get accuracy
      _, predicted = torch.max(output, 1)
      total += target.size(0)
      correct += (predicted == target).sum().item()
    train_loss.append(running_loss/len(train_iter))
    train_acc.append(correct/total)

    print(f'Epoch: {epoch + 1}, '
          f'Training Loss: {running_loss/len(train_iter):.4f}, '
          f'Training Accuracy: {100*correct/total: .2f}%')

    # evaluate on validation data
    model.eval()
    running_loss = 0.
    correct, total = 0, 0

    with torch.no_grad():
      for idx, batch in enumerate(valid_iter):
        text = batch.text[0]
        target = batch.label
        target = torch.autograd.Variable(target).long()
        text, target = text.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(text)

        loss = criterion(output, target)
        running_loss += loss.item()

        # get accuracy
        _, predicted = torch.max(output, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

    validation_loss.append(running_loss/len(valid_iter))
    validation_acc.append(correct/total)

    print (f'Validation Loss: {running_loss/len(valid_iter):.4f}, '
           f'Validation Accuracy: {100*correct/total: .2f}%')

  return train_loss, train_acc, validation_loss, validation_acc


def test(model, device, test_iter):
  model.eval()
  correct = 0
  total = 0
  with torch.no_grad():
    for idx, batch in enumerate(test_iter):
      text = batch.text[0]
      target = batch.label
      target = torch.autograd.Variable(target).long()
      text, target = text.to(device), target.to(device)

      outputs = model(text)
      _, predicted = torch.max(outputs, 1)
      total += target.size(0)
      correct += (predicted == target).sum().item()

    acc = 100 * correct / total
    return acc

In [None]:
# Model hyperparameters
learning_rate = 0.0003
output_size = 2
hidden_size = 128
embedding_length = 300
epochs = 15
word_embeddings = TEXT.vocab.vectors
vocab_size = len(TEXT.vocab)

# Model set-up
nn_model = NeuralNet(output_size,
                     hidden_size,
                     vocab_size,
                     embedding_length,
                     word_embeddings)
nn_model.to(DEVICE)
nn_start_time = time.time()
set_seed(522)
nn_train_loss, nn_train_acc, nn_validation_loss, nn_validation_acc = train(nn_model,
                                                                           DEVICE,
                                                                           train_iter,
                                                                           valid_iter,
                                                                           epochs,
                                                                           learning_rate)
print(f"--- Time taken to train = {(time.time() - nn_start_time)} seconds ---")
test_accuracy = test(nn_model, DEVICE, test_iter)
print(f'\n\nTest Accuracy: {test_accuracy}%')

In [None]:
# Plot accuracy curves
plt.figure()
plt.subplot(211)
plot_train_val(np.arange(0, epochs), nn_train_acc, nn_validation_acc,
               'train accuracy', 'val accuracy',
               'Neural Net on IMDB text classification', 'accuracy',
               color='C0')
plt.legend(loc='upper left')
plt.subplot(212)
plot_train_val(np.arange(0, epochs), nn_train_loss,
               nn_validation_loss,
               'train loss', 'val loss',
               '',
               'loss [a.u.]',
               color='C0')
plt.legend(loc='upper left')
plt.show()

---
# Summary

In this tutorial, we explored two different concepts linked to sequences, and text in particular, that will be the conceptual foundation for Recurrent Neural Networks.

The first concept was that of sequences and probabilities. We saw how we can model language as sequences of text, and use this analogy to generate text. Such a setup is also used to classify text or identify parts of speech. We can either build chains manually using simple python and numerical computation, or use a package such as ```hmmlearn``` that allows us to train models a lot easier. These notions of sequences and probabilities (i.e, creating language models!) are key to the internals of a recurrent neural network as well. 

The second concept is that of word embeddings, now a mainstay of natural language processing. By using a neural network to predict context of words, these neural networks learn internal representions of words that are a decent approximation of semantic meaning (i.e embeddings!). We saw how these embeddings can be visualised, as well as how they capture meaning. We finally saw how they can be integrated into neural networks to better classify text documents.