<a href="https://colab.research.google.com/github/RodrigoRoman/ml_ai_portafolio/blob/main/recurrent_neural_net/rnn_from_zero_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Recurrent Neural Network Language Model Using Numpy</h1>
<h3>A Ground-Up Approach</h3>
<p>This project seeks to build a basic language model using newspaper articles, constructing a dictionary for word indexing. Entirely coded with numpy for clarity in the underlying mathematical processes, the model generates word embeddings that elucidate complex word relationships. The model faces limitations such as the vanishing gradient issue, which constrains the training to a limited set of articles. This self-contained approach offers insight into language model mechanics and neural network training subtleties.</p>

In [77]:
import numpy as np
import random

In [78]:
from google.colab import drive

<p>We need access to a file that is located in my GDrive. The file is set public, so you can download it and just change the route below as needed.</p>
<p>The link to the file is the following: https://drive.google.com/file/d/1JhKE-1NJKM033tOf8OYoN3vwvoydHhVK/view?usp=drive_link</p>

In [79]:
news_s_path = "/content/drive/MyDrive/newsSpace"

<h2>Data Preprocessing Functions</h2>
<h3>Data cleaning</h3>
<p>We focus on extracting the text of each article and segmenting it into individual words in order to create our word dictionary. During this process, we also refine each word by removing any special characters and converting them to lowercase. This ensures a standardized and clean dataset for our model, facilitating more effective training and analysis.</p>

In [80]:
import re
def is_url(s):
    # A simple regex to check for a basic URL structure
    return re.match(r'https?://', s) is not None
def tokenize_article(line):
  url_index = next((i for i, item in enumerate(line) if is_url(item)), None)
  if url_index is not None:
    return re.split(r'[ ,.;:!?()]+', ' '.join(line[url_index+1:]))
  return None


def process_file(filepath, num_articles):
  articles = []
  vocabulary = set()
  pattern = re.compile(r'[ ,.;:!?()]+')
  word_pattern = re.compile(r"\b[A-Za-z]+'?[A-Za-z]*(?=\s|\b)")
  try:
    with open(filepath, encoding='ISO-8859-1') as file:
      data = file.read()
      print(data)
      pattern = re.compile(r"\((Reuters|AP)\)[\t\n]+(.*?)[\t\n]+\d+[\t\n]+[0-9]{4}-[0-9]{2}-[0-9]{2}", re.DOTALL)
      raw_articles = pattern.findall(data)
      print("amount of articles")
      print(len(raw_articles))

      for article in raw_articles:
        if len(articles) < num_articles:
          article_text = article[1].strip()
          # Cleaning and processing the article text
          words = word_pattern.findall(article_text.lower())
          # cleaned_article = ' '.join(words)
          articles.append(words)
          # Update vocabulary
          vocabulary.update(words)
  except IOError as e:
    print("Error opening or reading the file:", e)
    return [], set()
  return articles, vocabulary



<h2>Data Loading</h2>
<p>Here we have the specific details of how many articles we will be working with</p>


In [81]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [82]:
num_articles = 15
news_s_path = "/content/drive/MyDrive/newsSpace"

data_articles, vocabulary = process_file(news_s_path, num_articles)
print(data_articles)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



amount of articles
57469
[['none', 'business', 'reuters', 'wall', "street's", 'long', 'playing', 'drama', 'waiting', 'for', 'google', 'is', 'about', 'to', 'reach', 'its', 'final', 'act', 'but', 'its', 'stock', 'market', 'debut', 'is', 'ending', 'up', 'as', 'more', 'of', 'a', 'nostalgia', 'event', 'than', 'the', 'catalyst', 'for', 'a', 'new', 'era'], ['none', 'business', 'reuters', 'short', 'sellers', 'wall', "street's", 'dwindling', 'band', 'of', 'ultra', 'cynics', 'are', 'seeing', 'green', 'again'], ['none', 'business', 'reuters', 'private', 'investment', 'firm', 'carlyle', 'group', 'which', 'has', 'a', 'reputation', 'for', 'making', 'well', 'timed', 'and', 'occasionally', 'controversial', 'plays', 'in', 'the', 'defense', 'industry', 'has', 'quietly', 'placed', 'its', 'bets', 'on', 'another', 'part', 'of', 'the', 'market'], ['none', 'business', 'reuters', 'soaring', 'crude', 'prices', 'plus', 'worries', 'about', 'the', 'economy', 'and', 'the', 'outlook', 'for', 'earnings', 'are', 'exp

In [83]:
#Check out how each article is loaded as a list
data_articles

[['none',
  'business',
  'reuters',
  'wall',
  "street's",
  'long',
  'playing',
  'drama',
  'waiting',
  'for',
  'google',
  'is',
  'about',
  'to',
  'reach',
  'its',
  'final',
  'act',
  'but',
  'its',
  'stock',
  'market',
  'debut',
  'is',
  'ending',
  'up',
  'as',
  'more',
  'of',
  'a',
  'nostalgia',
  'event',
  'than',
  'the',
  'catalyst',
  'for',
  'a',
  'new',
  'era'],
 ['none',
  'business',
  'reuters',
  'short',
  'sellers',
  'wall',
  "street's",
  'dwindling',
  'band',
  'of',
  'ultra',
  'cynics',
  'are',
  'seeing',
  'green',
  'again'],
 ['none',
  'business',
  'reuters',
  'private',
  'investment',
  'firm',
  'carlyle',
  'group',
  'which',
  'has',
  'a',
  'reputation',
  'for',
  'making',
  'well',
  'timed',
  'and',
  'occasionally',
  'controversial',
  'plays',
  'in',
  'the',
  'defense',
  'industry',
  'has',
  'quietly',
  'placed',
  'its',
  'bets',
  'on',
  'another',
  'part',
  'of',
  'the',
  'market'],
 ['none',
  

<h3>Split data into train and test</h3>
<p>Divide the dataset into two distinct sets: training and testing. The training set is used to teach the model, while the testing set evaluates its performance. Additionally, we organize the data into pairs of inputs and targets, where each input is a word from the text, and its corresponding target is the subsequent word. This structure is fundamental in training the model to predict the next word in a sequence based on the current word.</p>
<p>To facilitate word processing and retrieval in our model, we establish two dictionaries. The first dictionary, known as the word-to-index dictionary, enables us to convert a given word from our corpus into its corresponding numerical index. Conversely, the second dictionary, the index-to-word dictionary, allows us to retrieve the original word from its index. These dictionaries translate between words and their numerical representations, an important process for the computational handling of text data.</p>

In [84]:

# Preprocess articles to create input-target pairs
def create_input_target(articles):
  article_targets = []
  for article in articles:
    target_article = []
    for i in range(len(article) - 1):
      target_word = article[i + 1]
      target_article.append(target_word)
    article_targets.append(target_article)
  return article_targets

# Split data into training and test sets
def split_data(data, test_percentage):
  split_point = int(len(data) * test_percentage)
  test_set = data[:split_point]
  training_set = data[split_point:]
  return training_set, test_set


# Example usage
test_percentage = 0.2

# Create input-target pairs
article_targets  = create_input_target(data_articles)

# Vocabulary word-to-index and index-to-word
word_to_idx = {ch:i for (i,ch) in enumerate(list(vocabulary))}
idx_to_word = {i:ch for (i,ch) in enumerate(list(vocabulary))}

# Take input-target as x and y
x_train_words, x_test_words = split_data(data_articles,test_percentage)
y_train_words, y_test_words = split_data(article_targets,test_percentage)

# Change the data to their index versions
x_train = [[word_to_idx[word] for word in article if word in word_to_idx] for article in x_train_words]
y_train = [[word_to_idx[word] for word in article if word in word_to_idx] for article in y_train_words]
x_test = [[word_to_idx[word] for word in article if word in word_to_idx] for article in x_test_words]
y_test = [[word_to_idx[word] for word in article if word in word_to_idx] for article in y_test_words]


In [85]:
print(x_train_words[:100])
print(y_train_words[:100])

[['none', 'business', 'reuters', 'soaring', 'crude', 'prices', 'plus', 'worries', 'about', 'the', 'economy', 'and', 'the', 'outlook', 'for', 'earnings', 'are', 'expected', 'to', 'hang', 'over', 'the', 'stock', 'market', 'next', 'week', 'during', 'the', 'depth', 'of', 'the', 'summer', 'doldrums'], ['none', 'business', 'reuters', 'authorities', 'have', 'halted', 'oil', 'export', 'flows', 'from', 'the', 'main', 'pipeline', 'in', 'southern', 'iraq', 'after', 'intelligence', 'showed', 'a', 'rebel', 'militia', 'could', 'strike', 'infrastructure', 'an', 'oil', 'official', 'said', 'on', 'saturday'], ['none', 'business', 'reuters', 'stocks', 'ended', 'slightly', 'higher', 'on', 'friday', 'but', 'stayed', 'near', 'lows', 'for', 'the', 'year', 'as', 'oil', 'prices', 'surged', 'past', 'a', 'barrel', 'offsetting', 'a', 'positive', 'outlook', 'from', 'computer', 'maker', 'dell', 'inc', 'dell', 'o'], ['none', 'business', 'ap', 'assets', 'of', 'the', "nation's", 'retail', 'money', 'market', 'mutual', 

In [86]:
print(x_train[0])
print(y_train[0])

[3, 88, 30, 61, 293, 96, 21, 253, 51, 254, 162, 234, 254, 181, 215, 161, 301, 66, 82, 294, 89, 254, 218, 130, 247, 224, 243, 254, 235, 228, 254, 261, 222]
[88, 30, 61, 293, 96, 21, 253, 51, 254, 162, 234, 254, 181, 215, 161, 301, 66, 82, 294, 89, 254, 218, 130, 247, 224, 243, 254, 235, 228, 254, 261, 222]


<h3>The RNN Model</h3>
<p>This class, RNN, defines a basic recurrent neural network (RNN) for language modeling. It's initialized with parameters like hidden layer size, vocabulary size, and learning rate. The model utilizes matrices (W_e, W_y, W_h) and biases (bh, by) for computations. The softmax function converts raw scores to probabilities, while cross-entropy measures prediction accuracy. The forward pass calculates hidden states and output predictions, and the backward pass adjusts weights through gradient descent. Training involves iterating over epochs, processing input sequences, and updating model parameters based on the loss calculated from the predicted and actual next words. The model's accuracy is evaluated by comparing the predicted words with actual targets, demonstrating its ability to learn from textual data.</p>

In [87]:
class RNN:
  def __init__(self, hidden_size,vocab_size,learning_rate):
    self.hidden_size = hidden_size
    self.vocab_size = vocab_size
    # self.embedding_size = embedding_size
    self.learning_rate = learning_rate

    # Model parameters
    self.W_e = np.random.uniform(-np.sqrt(1./vocab_size), np.sqrt(1./vocab_size), (hidden_size, vocab_size))
    self.W_y = np.random.uniform(-np.sqrt(1./hidden_size), np.sqrt(1./hidden_size), (vocab_size, hidden_size))
    self.W_h = np.random.uniform(-np.sqrt(1./hidden_size), np.sqrt(1./hidden_size), (hidden_size, hidden_size))
    self.bh = np.zeros((hidden_size, 1)) # bias for hidden layer
    self.by = np.zeros((vocab_size, 1)) # bias for output


  # Convert values to probabilies
  def softmax(self,x):
    shift_x = x - np.max(x)
    exp_shift_x = np.exp(shift_x)
    softmax_output = exp_shift_x / np.sum(exp_shift_x)
    return softmax_output

  # Cross-entropy loss measures the difference between a predicted probability distribution and the correct distribution
  def cross_entropy(self, probs,targets):
    loss = 0
    epsilon = 1e-9  # Small constant for numerical stability
    clipping_threshold = 1e-5  # Threshold for clipping probabilities
    for t in range(len(targets)):
        # Clipping the probability to avoid extremely small values
        prob = max(min(probs[t][targets[t]][0], 1 - clipping_threshold), epsilon)
        loss += -np.log(prob)
    return loss

  # Compute the forward pass given  a series of inputs
  # Return dictionaries for the state of embedded words, hidden layer and output layer.
  def forward(self,inputs,hprev):
    es,hs,ys = {},{},{}
    ps = {i: 0 for i in range(self.vocab_size)}
    hs[-1] = np.copy(hprev)
    for t in range(len(inputs)):
      es[t] = np.zeros((self.vocab_size,1))
      es[t][inputs[t]] = 1 # one hot encoding , 1-of-k
      hs[t] = np.tanh(np.dot(self.W_e,es[t]) + np.dot(self.W_h,hs[t-1]) + self.bh) # hidden state
      ps[t] = np.dot(self.W_y,hs[t]) + self.by # unnormalised log probs for next char
      ys[t] = self.softmax(ps[t])
    return es,hs,ps,ys

  # Compute backpropagation of the network
  def backward(self,es,hs,ps,targets):
      dW_e, dW_h, dW_y =  np.zeros_like(self.W_e),np.zeros_like(self.W_h),np.zeros_like(self.W_y)
      dbh, dby =  np.zeros_like(self.bh),np.zeros_like(self.by)
      dh_next = np.zeros_like(hs[0])
      for t in reversed(range(len(targets))):
        # Gradients
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1
        dW_y += np.dot(dy,hs[t].T)
        dby += dy
        dh = np.dot(self.W_y.T, dy) + dh_next
        # Ouput before applying softmax
        dh_rec = (1 - hs[t] * hs[t]) * dh
        dbh += dh_rec
        dW_e += np.dot(dh_rec, es[t].T)
        dW_h += np.dot(dh_rec, hs[t-1].T)
        dh_next = np.dot(self.W_h, dh_rec)
      return dW_e, dW_h, dW_y, dh, dy

  def update_parameters(self, dW_e, dW_h, dW_y, dbh, dby):
    self.W_e -= self.learning_rate * dW_e
    self.W_h -= self.learning_rate * dW_h
    self.W_y -= self.learning_rate * dW_y
    self.bh -= self.learning_rate * dbh
    self.by -= self.learning_rate * dby

  def train(self,x_train,y_train,epochs):
    for epoch in range(epochs):
      print("Epoch -> ", epoch)
      rand_print =  random.randint(1, len(x_train)-1)
      for batch_idx, inputs in enumerate(x_train):
        total_loss = 0
        correct_predictions = 0
        total_predictions = 0
        if batch_idx==0:
          h_prev = np.zeros((self.hidden_size,1))
        targets = y_train[batch_idx]
        es,hs,ps,ys = self.forward(inputs, h_prev)
        dW_e, dW_h, dW_y, dh, dy = self.backward(es, hs, ps, targets)
        loss = self.cross_entropy(ps, targets)
        self.update_parameters(dW_e, dW_h, dW_y, dh, dy)
        h_prev = hs[len(inputs)-1]
        for t in range(len(targets)):
          predicted_index = np.argmax(ys[t])
          correct_predictions += (predicted_index == targets[t])
          total_predictions += 1
        if(batch_idx == rand_print):
          print("Input sentence:")
          print([idx_to_word[i] for i in targets])
          print("Predicted:")
          print([idx_to_word[np.argmax(ys[i])] for i in range(len(targets))])
          print("Amount of correct predictions")
          print(correct_predictions)
          accuracy = correct_predictions/total_predictions
          print("Accuracy -> ", accuracy)
          print("Loss -> ", loss)


In [88]:
print(len(vocabulary))

305


In [None]:
epochs = 100
vocabulary_size = len(vocabulary)
embedding_size = 400
hidden_size = 1000
rnn = RNN(hidden_size=hidden_size, vocab_size=vocabulary_size,learning_rate=0.01)
rnn.train(x_train,y_train,epochs)

Epoch ->  0
Input sentence:
['sci', 'tech', 'reuters', 'a', 'group', 'of', 'consumer', 'electronics', 'makers', 'said', 'on', 'wednesday', 'they', 'approved', 'the', 'format', 'for', 'a', 'new', 'generation', 'of', 'discs', 'that', 'can', 'store', 'five', 'times', 'the', 'data', 'of', 'dvds', 'at', 'the', 'same', 'cost', 'enough', 'to', 'put', 'a', 'full', 'season', 'of', 'the', 'sopranos', 'on', 'one', 'disc']
Predicted:
['business', 'sci', 'reuters', 'hang', 'firm', 'business', 'the', 'occasionally', 'sci', 'are', 'halfway', 'when', 'sci', 'fans', 'a', 'struck', 'business', 'brcm', 'america', 'internet', 'even', 'ahead', 'business', 'sci', 'sci', 'stock', 'sci', 'pc', 'interview', 'business', 'the', 'halted', 'sci', 'with', 'sci', 'its', 'past', 'plays', 'than', 'contained', 'maker', 'sci', 'the', 'europe', 'playing', 'when', 'sell']
Amount of correct predictions
2
Accuracy ->  0.0425531914893617
Loss ->  453.604850481159
Epoch ->  1
Input sentence:
['sci', 'tech', 'ap', 'the', 'norw

<h3>Conclusions</h3>
<p>Our model achieved a remarkable 100% accuracy in predicting sentences within its training data. This high accuracy, however, may not extend to unseen articles, highlighting the model's limitations in handling new data. The primary value of this exercise lies in the generation of word embeddings. These embeddings are crucial for tasks such as classifying the genre of new articles or aiding in word completion. They represent a significant step forward in context-based word representation, demonstrating the potential of our model in various natural language processing applications.</p>