# Text Generator
Implementing a text generation model from scratch using a transformer (decoder only).\
Steps:
1. Tokenization
2. Vectorization
3. Positional encoding
4. Masking
5. Self-attention
6. Decoder stack
7. Predicting token probabilities

## Creating Training Data

In [None]:
#conda install pytorch torchvision torchaudio -c pytorch

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import math
import pandas as pd

In [2]:
df = pd.read_csv('medium_articles.csv')

In [3]:
display(df.head())

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


In [9]:
text = df['text'][:100]
display(text.head())

0    Photo by Josh Riemer on Unsplash\n\nMerry Chri...
1    Your Brain On Coronavirus\n\nA guide to the cu...
2    Mind Your Nose\n\nHow smell training can chang...
3    Passionate about the synergy between science a...
4    You’ve heard of him, haven’t you? Phineas Gage...
Name: text, dtype: object

In [10]:
text.to_csv('training_data.csv')

## Tokenization

In [24]:
class Tokenizer():
    def __init__(self):
        self.dictionary = {}
        self.reverse_dictionary = {}
        
        # adding special tokens
        self.__add_to_dict('<pad>')
        self.__add_to_dict('<unk>')
        
        # add characters and numbers to dictionary
        for i in range(10):
            self.__add_to_dict(str(i))
        
        for i in range(26):
            self.__add_to_dict(chr(ord('a') + i))
            self.__add_to_dict(chr(ord('A') + i))
            
        # adding space and punctuation
        for char in ['.', ' ', ',', '!', '?', '\n']:
            self.__add_to_dict(char)
        
    def __add_to_dict(self, character):
        if character not in self.dictionary:
            index = self.size()
            self.dictionary[character] = index
            self.reverse_dictionary[index] = character
            
    def tokenize(self, text):
        return [self.dictionary.get(c, self.dictionary['<unk>']) for c in text]
    
    def character_to_token(self, character):
        return self.dictionary[character]
    
    def token_to_character(self, token):
        return self.reverse_dictionary[token]
    
    def size(self):
        return len(self.dictionary)

In [15]:
training_data = pd.read_csv('training_data.csv')
training_data = training_data['text']

In [16]:
training_data.head()

0    Photo by Josh Riemer on Unsplash\n\nMerry Chri...
1    Your Brain On Coronavirus\n\nA guide to the cu...
2    Mind Your Nose\n\nHow smell training can chang...
3    Passionate about the synergy between science a...
4    You’ve heard of him, haven’t you? Phineas Gage...
Name: text, dtype: object

In [26]:
# instantiating tokenizer
tokenizer = Tokenizer()
tokenized_data = training_data.apply(tokenizer.tokenize)
tensor_data = [torch.tensor(token) for token in tokenized_data]

In [31]:

# Print tokenized data for verification
for i, tokens in enumerate(tokenized_data):
    if (i == 1):
        break
    print(f"Original: {training_data[i]}")
    print(f"Tokenized: {tokens}")
    print(f"Tensor: {tensor_data[i]}")
    print()

Original: Photo by Josh Riemer on Unsplash

Merry Christmas and Happy Holidays, everyone!

We just wanted everyone to know how much we appreciate everyone and how thankful we are for all our readers and writers here. We wouldn’t be anywhere without you, so thank you all for bringing informative, vulnerable, and important pieces that destigmatize mental illness and mental health.

Without further ado, here are ten of our top stories from last week, all of which were curated:

“Just as the capacity to love and inspire is universal so is the capacity to hate and discourage. Irrespective of gender, race, age or religion none of us are exempt from aggressive proclivities. Those who are narcissistically disordered, and accordingly repress deep seated feelings of inferiority with inflated delusions of grandeur and superiority, are more prone to aggression and violence. They infiltrate our interactions in myriad environments from home, work, school and the cyber world. Hence, bullying does not

In [32]:
max_sequence_length = 20
for _ in range(max_sequence_length):
    tensor_data.insert(0, tokenizer.character_to_token('<pad>'))

In [37]:
print(tensor_data)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, tensor([43, 26, 40,  ..., 28, 50, 26]), tensor([61, 40, 52,  ..., 20, 48,  1]), tensor([37, 28, 38,  ..., 28, 38, 64]), tensor([43, 12, 48, 48, 28, 40, 38, 12, 50, 20, 65, 12, 14, 40, 52, 50, 65, 50,
        26, 20, 65, 48, 60, 38, 20, 46, 24, 60, 65, 14, 20, 50, 56, 20, 20, 38,
        65, 48, 16, 28, 20, 38, 16, 20, 65, 12, 38, 18, 65, 50, 20, 16, 26, 38,
        40, 34, 40, 24, 60, 65, 50, 40, 65, 42, 46, 40, 54, 28, 18, 20, 65, 14,
        20, 50, 50, 20, 46, 65, 16, 12, 46, 20, 64, 65, 17, 26, 20, 16, 32, 65,
        40, 52, 50, 65, 36, 60, 65, 38, 20, 56, 48, 34, 20, 50, 50, 20, 46,  1,
        65, 48, 16, 28, 20, 38, 16, 20, 22, 40, 46, 46, 20, 12, 34, 64, 48, 52,
        14, 48, 50, 12, 16, 32, 64, 16, 40, 36, 65,  1, 69, 69, 23, 40, 34, 34,
        40, 56]), tensor([61, 40, 52,  ..., 40, 46, 64]), tensor([37, 20, 38,  ..., 52, 46,  1]), tensor([27, 40, 56,  ..., 48, 50, 67]), tensor([19, 46, 65,  ..., 36, 48, 64]), t