In [1]:
import re
import numpy as np
from tqdm import tqdm

In [None]:
inp_data = "Since the so-called "statistical revolution"[15][16] in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus, is a set of documents, possibly with human or computer annotations) of typical real-world examples.
Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature (complex-valued embeddings,[17] and neural networks in general have also been proposed, for e.g. speech[18]). Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
Since the neural turn, statistical methods in NLP research have been largely replaced by neural networks. However, they continue to be relevant for contexts in which statistical interpretability and transparency is required.
Neural networks
Further information: Artificial neural network
A major drawback of statistical methods is that they require elaborate feature engineering. Since 2015,[19] the field has thus largely abandoned statistical methods and shifted to neural networks for machine learning. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT)."

In [None]:
# remove special characters
inp_data2 = re.sub('[^A-Za-z0-9]+', ' ', inp_data)

# remove 1 letter words
inp_data2 = re.sub(r'(?:^| )\w(?:$| )', ' ', inp_data2).strip()

# lower all characters
inp_data2 = inp_data2.lower()

In [None]:
inp_data2

'the speed of transmission is an important point of difference between the two viruses influenza has shorter median incubation period the time from infection to appearance of symptoms and shorter serial interval the time between successive cases than covid 19 virus the serial interval for covid 19 virus is estimated to be 6 days while for influenza virus the serial interval is days this means that influenza can spread faster than covid 19 further transmission in the first 5 days of illness or potentially pre symptomatic transmission transmission of the virus before the appearance of symptoms is major driver of transmission for influenza in contrast while we are learning that there are people who can shed covid 19 virus 24 48 hours prior to symptom onset at present this does not appear to be major driver of transmission the reproductive number the number of secondary infections generated from one infected individual is understood to be between and 5 for covid 19 virus higher than for in

In [None]:
tokens = inp_data2.split()
vocab = set(tokens)

In [None]:
word_to_index = {word: i for i, word in enumerate(vocab)}
index_to_word = {i: word for i, word in enumerate(vocab)}

In [None]:
context_size = 4 #Hyper-parameter
batch_size = 8
embedding_size = 10 #Hyper-parameter
vocab_size = len(vocab) #Number of unique tokens in the vocabulary

In [None]:
embedding_size, vocab_size

(10, 97)

In [None]:
data[:5]

[(['speed', 'transmission', 'the', 'is'], 'of'),
 (['of', 'is', 'speed', 'an'], 'transmission'),
 (['transmission', 'an', 'of', 'important'], 'is'),
 (['is', 'important', 'transmission', 'point'], 'an'),
 (['an', 'point', 'is', 'of'], 'important')]

In [None]:
data = []

for i in range(context_size, len(tokens) - context_size):
  target = tokens[i]
  context = []
  for j in range(1, context_size+1):
    context += [tokens[i-j], tokens[i+j]]  
  data.append((context, target))

print(data[:5])

[(['speed', 'transmission', 'the', 'is'], 'of'), (['of', 'is', 'speed', 'an'], 'transmission'), (['transmission', 'an', 'of', 'important'], 'is'), (['is', 'important', 'transmission', 'point'], 'an'), (['an', 'point', 'is', 'of'], 'important')]


In [None]:
len(data)

182

In [None]:
# Initialize Weights and Bias matrices
def initialize_weights(vocab_size, embedding_size):
  W1 = np.random.rand(embedding_size, vocab_size)
  W2 = np.random.rand(vocab_size, embedding_size)
  b1 = np.random.rand(embedding_size, 1)
  b2 = np.random.rand(vocab_size, 1)
  return W1, b1, W2, b2

In [None]:
def softmax_activation(z):
  return np.exp(z)/np.sum(np.exp(z), axis=0)

In [None]:
# Forward Propagation:
# 1. Linear Hidden State: h = W1.x + b1
# 2. Output Layer:        z = W2.h + b2
# 3. Softmax Activation:  z = softmax_activation(z)

def forward_propagation(X, W1, W2, b1, b2):
  # Compute Hidden Layer values
  h = np.dot(W1, X) + b1
  # Compute Output Layer Values
  z = np.dot(W2, h) + b2
  return z, h

In [None]:
def transform_data(context_words, target_word, vocab_size):
  input_vector = np.zeros((vocab_size, 1))
  target_vector = np.zeros((vocab_size, 1))
  
  for word in context_words:
    input_vector[word_to_index[word]] += 1
  
  input_vector = input_vector/len(context_words)

  target_vector[word_to_index[target_word]]=1

  return input_vector.reshape(1, -1), target_vector.reshape(1, -1)

In [None]:
def transform_batch_data(batch_data, vocab_size):
  X=np.zeros((vocab_size, len(batch_data)))
  y=np.zeros((vocab_size, len(batch_data)))

  for i in range(len(batch_data)):
    context_words = batch_data[i][0]
    target_word = batch_data[i][1]

    X[:, i], y[:, i] = transform_data(context_words, target_word, vocab_size)

  return X, y

In [None]:
def get_batch_indices(data, batch_size):
  batch_indices = []

  for i in range(0, len(data), batch_size):
    if (i+batch_size <= len(data)):
      batch_indices.append((i, i+batch_size))
    else:
      batch_indices.append((i, len(data)))
  return batch_indices

In [None]:
batch_indices = get_batch_indices(data, batch_size)

batch_indices

[(0, 8),
 (8, 16),
 (16, 24),
 (24, 32),
 (32, 40),
 (40, 48),
 (48, 56),
 (56, 64),
 (64, 72),
 (72, 80),
 (80, 88),
 (88, 96),
 (96, 104),
 (104, 112),
 (112, 120),
 (120, 128),
 (128, 136),
 (136, 144),
 (144, 152),
 (152, 160),
 (160, 168),
 (168, 176),
 (176, 182)]

In [None]:
# compute_cost: cross-entropy cost functioN
def compute_cost(y, yhat, batch_size):
    # cost function 
    logprobs = np.multiply(np.log(yhat),y)
    cost = - 1/batch_size * np.sum(logprobs)
    cost = np.squeeze(cost)
    return cost

In [None]:
def back_propagation(X, y_hat, y, h, W1, W2, b1, b2, batch_size):
    
    l1 = np.dot(W2.T,(y_hat-y))
    
    grad_W1 = (1/batch_size)*np.dot(l1, X.T) 
    
    grad_W2 = (1/batch_size)*np.dot(y_hat-y, h.T)
    
    grad_b1 = np.sum((1/batch_size)*np.dot(l1, X.T), axis=1, keepdims=True)
    
    grad_b2 = np.sum((1/batch_size)*np.dot(y_hat-y,h.T), axis=1, keepdims=True)
    
    return grad_W1, grad_W2, grad_b1, grad_b2

In [None]:
alpha = 0.3

W1, b1, W2, b2 = initialize_weights(vocab_size, embedding_size)

batch_indices = get_batch_indices(data, batch_size)

for i in range(1000):
  for start_index, end_index in batch_indices:
    X, y = transform_batch_data(data[start_index:end_index], vocab_size)
    z, h = forward_propagation(X, W1, W2, b1, b2)
    y_hat = softmax_activation(z)
    cost = compute_cost(y, y_hat, batch_size)

    grad_W1, grad_W2, grad_b1, grad_b2 = back_propagation(X, y_hat, y, h, W1, W2, b1, b2, batch_size)
          
    # Update weights and biases
    W1 -= alpha*grad_W1 
    W2 -= alpha*grad_W2
    b1 -= alpha*grad_b1
    b2 -= alpha*grad_b2
  print("Cost: ", cost, " after Iteration#: ", i+1)

Cost:  4.860873924003341  after Iteration#:  1
Cost:  4.639227311257319  after Iteration#:  2
Cost:  4.519472093866707  after Iteration#:  3
Cost:  4.444629739184146  after Iteration#:  4
Cost:  4.3932501319903  after Iteration#:  5
Cost:  4.3550904165667  after Iteration#:  6
Cost:  4.324736289432322  after Iteration#:  7
Cost:  4.2991473722883  after Iteration#:  8
Cost:  4.276523291297011  after Iteration#:  9
Cost:  4.2557433382090455  after Iteration#:  10
Cost:  4.236076826004041  after Iteration#:  11
Cost:  4.2170273981094475  after Iteration#:  12
Cost:  4.198246506302962  after Iteration#:  13
Cost:  4.179484125168107  after Iteration#:  14
Cost:  4.160560296126622  after Iteration#:  15
Cost:  4.141348676351108  after Iteration#:  16
Cost:  4.121767138279637  after Iteration#:  17
Cost:  4.101772608632421  after Iteration#:  18
Cost:  4.081358688029921  after Iteration#:  19
Cost:  4.060555523435427  after Iteration#:  20
Cost:  4.039431928228413  after Iteration#:  21
Cost:

In [None]:
W1.T.shape, W2.shape

((97, 10), (97, 10))

In [None]:
((W1.T+ W2)/2).shape

(97, 10)