<a href="https://colab.research.google.com/github/Eminent01/AMMI-WORK/blob/main/NLP_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gensim: a package to train and use word vectors

Gensim is a Python package that allows to train and use word vectors.
A lot of functions for word vectors analysis are implemented by Gensim.

In [None]:
import gensim
import time

In [None]:
import gensim.downloader as api
import warnings

warnings.filterwarnings("ignore", "Conversion of the second argument of issubdtype from", FutureWarning)
warnings.filterwarnings("ignore", "This function is deprecated, use smart_open.open instead. See the migration notes for ", UserWarning)
warnings.filterwarnings("ignore", "arrays to stack must be passed as a", FutureWarning)

# This will load word vectors for a large vocabulary 
# It should take < 5 minutes
wv = api.load('glove-wiki-gigaword-50')



In the following cell, we download a list of analogies.

In [None]:
!wget https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt

--2019-12-02 18:30:46--  https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 603955 (590K) [text/plain]
Saving to: ‘questions-words.txt’


2019-12-02 18:30:46 (15.5 MB/s) - ‘questions-words.txt’ saved [603955/603955]



In the next cell, open the downloaded file and take a look to see what it contains.
Print the first 5 lines.

In [None]:
f = open('questions-words.txt', 'r')
for line in f.readlines()[:5]:
    print(line)

: capital-common-countries

Athens Greece Baghdad Iraq

Athens Greece Bangkok Thailand

Athens Greece Beijing China

Athens Greece Berlin Germany



Note that :
- Athens is to Greece what Baghdad is to Iraq
- Athens is to Greece what Bangkok is to Thailand
- Athens is to Greece what Beijing is to China
etc...

Remember the word analogy task : 
A is to B what C is to D

As we saw before, using word vectors, we can try to guess D using the word vectors of A, B and C !

We use the formula :
$$B-A + C \approx D$$

For each analogy (A,B,C,D) described in the file "questions-words.txt", we can try to :
- compute $\tilde{C} = B-A + C$ with the word vectors
- compute the nearest neighbors of $\tilde{C}$ in the word vectors.
- If the nearest neighbor is $C$, we've answered correctly to the analogy, otherwise we didn't.

Averaging correct / incorrect answers on each analogies gives us a metric. This metric can be used as an indicator of quality of the word embeddings.

In the next cell, we will ask Gensim to compute this metric for us.

In [None]:
# This should take approximately 5 minutes

start = time.time()
results = wv.evaluate_word_analogies(analogies='questions-words.txt')
print(f"Accuracy on the world analogy task {results[0]}")
print(f"Evaluating this took {time.time() - start}")

Accuracy on the world analogy task 0.463717540798522
Evaluating this took 276.7814600467682


### Similar words

Each word is associated with an index value (between 0 and $n_{words} - 1$).
If two words have similar spelling, this information will be lost as the words are replaced with their indices.
However, we can see that the model still discovers similar meanings automatically: if I take the two words "car" and "cars", the model has put these vectors close together. 

Can you guess why ?

We can use gensim to check the closest word to a word.


In [None]:
print(wv.most_similar(positive=['car'], topn=5))

[('truck', 0.9208585619926453), ('cars', 0.8870190382003784), ('vehicle', 0.8833684325218201), ('driver', 0.8464018702507019), ('driving', 0.8384189009666443)]


Question: What does this do ?

In [None]:
print(wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))


car


## Train your own word vectors

Download and extract the IMDB Sentiment classification corpus


In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar xzf aclImdb_v1.tar.gz

--2019-12-02 18:35:28--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2019-12-02 18:35:29 (67.6 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In the following cell, we define a class called GensimCorpus. Can you guess what it does ? The code look complicated ...

In [None]:
from gensim import utils
import os
from os.path import join

class GensimCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __init__(self, path="aclImdb/train/unsup"):
        self.path = path
        self.filenames = [fname for fname in os.listdir(path) if fname.endswith(".txt")]
        

    def __iter__(self):
        for fname in self.filenames:
            # assume there's one document per line, tokens separated by whitespace
            with open(join(self.path, fname), "r") as f:
                lines = f.readlines()
                assert len(lines) == 1
                line = lines[0].strip()
                yield utils.simple_preprocess(line)


In  order to understand what it does, we will try to use it. ```__iter__ ``` is defined for generators, let's try to see what this generator outputs !

In [None]:
sentences = GensimCorpus()

#TODO: look at a few examples 

Gensim can learn word embeddings ( == word vectors) from a text corpus. Let's learn word embeddings on our GensimCorpus.

In [None]:
import gensim.models

# This should take approximately 3 minutes
start = time.time()
model = gensim.models.Word2Vec(sentences=sentences, size=50)
print("Took %.2f" % (time.time() - start))

Took 150.86


The model we learned, ```model``` contains word vectors ```model.wv```. We can use ```evaluate_word_analogies``` as before to evaluate the quality of our newly learned word vectors.

In [None]:
#TODO: evaluate word analogies 

What about nearest neighbors ?

In [None]:
#TODO: look at nearest neighbors of common words such as king, queen, etc. 

Question: Is this model as accurate as the previous one ?

In [None]:
print(model.wv.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


## Text classification with the bag of words model

We have trained our own word vector model on data from IMDB. 
We now want to perform sentiment analysis: predict whether a review is positive or negative. 
First, start by creating a dataset with the positive and negative sentences.

In [None]:
positives = GensimCorpus(path=...)
negatives = GensimCorpus(path=...)

TypeError: ignored

In [None]:
import numpy as np

def create_embeddings(corpus, word_vectors):
  # TODO: create a function that takes a corpus and some gensim word vectors and returns a matrix with the average embedding of each sentence
  embeddings = []

  for sentence in corpus:

    embeddings.append(...)

  return np.stack(embeddings)


In [None]:
X_pos = create_embeddings(positives, model.wv)
X_neg = create_embeddings(negatives, model.wv)

In [None]:
print(X_pos.shape)
print(X_neg.shape)

(12500, 50)
(12500, 50)


In [None]:
n_pos = X_pos.shape[0]
n_neg = X_neg.shape[0]

X = np.concatenate([X_pos, X_neg])
y = np.zeros((n_pos + n_neg, ), dtype=int)

y[:n_pos] = 1

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
#TODO: Sklearn linear classifier
from sklearn.svm import LinearSVC

linear_model = LinearSVC()
linear_model.fit(X_train, y_train)

print(linear_model.score(X_train, y_train))
print(linear_model.score(X_test, y_test))

In [None]:
#TODO: find the best regularization constant C of the linear SVM when you train with only n_samp samples

n_samp = 100

# Language modeling

Language modeling is a way to train models to generate text. 
We will see how to use Pytorch models to do language modeling.

First, let's look at a few steps that are necessary for text processing.


## Tokenization

Tokenization creates a dictionary that contains all words, and creates an index for each word. 
Look at the class Dictionary and Corpus below. 
What do they do ?

In [None]:
import os
from io import open
import torch

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        print(path)
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

## RNN and LSTM models

We define a class for our RNN model. Fill the "..." in the __init__ and forward method, using the Pytorch documentation of nn.RNN and nn.LSTM

In [None]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F


class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, num_token, embedding_dim, hidden_dim, num_layers, dropout=0.5):
        super(RNNModel, self).__init__()

        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(num_token, embedding_dim)

        if rnn_type == 'LSTM':
            self.rnn = ...
        elif rnn_type == 'RNN':
            self.rnn = ...
        else:
            raise NotImplementedError("""Only RNN and LSTM are implemented yet""")
            
        self.decoder = nn.Linear(hidden_dim, num_token)

        self.init_weights()

        self.rnn_type = rnn_type
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))

        output, hidden = ...

        output = self.drop(output)

        decoded = ...

        return decoded, hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(self.num_layers, bsz, self.hidden_dim),
                    weight.new_zeros(self.num_layers, bsz, self.hidden_dim))
        else:
            return weight.new_zeros(self.num_layers, bsz, self.hidden_dim)

## Train your own language model

We are all set to train our language model! Let's get some data first

In [None]:
!wget -O train.txt https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/train.txt
!wget -O valid.txt https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/valid.txt
!wget -O test.txt https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/test.txt

--2022-03-16 18:10:15--  https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10797148 (10M) [text/plain]
Saving to: ‘train.txt’


2022-03-16 18:10:15 (170 MB/s) - ‘train.txt’ saved [10797148/10797148]

--2022-03-16 18:10:15--  https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1121681 (1.1M) [text/plain]
Saving to: ‘valid.txt’


2022-03-

We define the arguments that we will need to train our model.

In [None]:
import argparse
import time
import math
import os
import torch
import torch.nn as nn
import torch.onnx

args = argparse.Namespace(
  data='.',
  model='LSTM',
  emsize=200,
  nhid=200,
  nlayers=2,
  lr=20,
  clip=0.25,
  epochs=20,
  batch_size=20,
  bptt=35,
  dropout=0.2,
  seed=1111,
  cuda=True,
  log_interval=200,
  save='model.pt'
)

torch.manual_seed(args.seed)

if args.cuda:
    device = "cuda:0"
else:
    device = "cpu"

Look at the function batchify(), what does it do ?

In [None]:
###############################################################################
# Load data
###############################################################################

corpus = Corpus(args.data)

# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

eval_batch_size = 10
train_data = batchify(corpus.train, args.batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

./train.txt
./valid.txt
./test.txt


Build the model

In [None]:
ntokens = len(corpus.dictionary)
model = RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout).to(device)
criterion = nn.CrossEntropyLoss()


Fill up the todos in the train() function to forward the current batch in the model, compute the loss and the gradient.

In [None]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not
# done along the batch dimension (i.e. dimension 1), since that was handled
# by the batchify function. The chunks are along dimension 0, corresponding
# to the seq_len dimension in the LSTM.

def get_batch(source, i):
    seq_len = min(args.bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(model, data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, args.bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()

    return total_loss / (len(data_source) - 1)


def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(args.batch_size)

    for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        hidden = repackage_hidden(hidden)

        #TODO: forward through the model and compute the loss and the gradients

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
        with torch.no_grad():
          for p in model.parameters():
              p.add_(-lr, p.grad)

        total_loss += loss.item()

        if batch % args.log_interval == 0 and batch > 0:
            cur_loss = total_loss / args.log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // args.bptt, lr,
                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()


### Training loop

Training for 20 epochs should take ~15 minutes.

In [None]:
# Loop over epochs.
lr = args.lr
best_val_loss = None

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, args.epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(args.save, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')


AttributeError: ignored

In [None]:
# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)


| End of training | test loss  4.68 | test ppl   107.28


## Generate sentences with a trained language model

In [None]:
args.words = 200
args.temperature = 1

model.eval()

ntokens = len(corpus.dictionary)

hidden = model.init_hidden(1)
input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)

text = []
with torch.no_grad():  # no tracking history
    for i in range(args.words):
        output, hidden = model(input, hidden)
        word_weights = output.squeeze().div(args.temperature).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        input.fill_(word_idx)

        word = corpus.dictionary.idx2word[word_idx]
        text.append(word)

        if i % args.log_interval == 0:
            print('| Generated {}/{} words'.format(i, args.words))

print(" ".join(text))

| Generated 0/200 words
Man . In 1867 , Fu Riata recorded the most archaeological civic battery of the series . As its system to the United States was exhausted , the Soviet Union was financed because of a duty as attention to the world ; he gave several new works portraying her house and the details involved in an existing theatre . The city was a <unk> @-@ to @-@ answer industrial network three hours before Comair 's arrival in 2005 . In Mogadishu , Howard he worked The conclave of Germany 4 months after his death in 1975 . The county also began dating to Bristol Gardens to Los Angeles in 2011 , taking out in 1859 . In August 2014 , an article for the O Tempo <unk> was led to Eugène Monsen , William A. Humboldt and <unk> Gurion Street , based on his debut and then set up in 2010 . Shortly sales made to <unk> all of the popular <unk> drinking mechanics , and later in its first year of dense the cause of testing , Chulachomklao individuals match despite the <unk> of an million people t

## Demo of the best language models

https://transformer.huggingface.co/