<a href="https://colab.research.google.com/github/Beitner/Text_Generation/blob/main/Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN for text generation


In this notebook, we'll unleash the hidden creativity of a computer, by letting it generate Country songs. we'll train a character-level RNN-based language model, and use it to generate new songs.

## RNN for Text Generation
In this section, we'll use an LSTM to generate new songs. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [None]:
df = pd.read_parquet(r'/content/drive/MyDrive/YANDEX/DeepLearning/NLP/HW1_NLPIntro/metrolyrics.parquet')
# selected_text = df[df.genre == 'Country'].lyrics.values

In [None]:
df.genre.unique()

array(['Pop', 'Hip-Hop', 'Rock', 'Country', 'Metal'], dtype=object)

## Text and Reference Preperation

In [None]:
df["lyrics"] = df['lyrics'].str.lower()

df['lyrics'] = df['lyrics'].str.strip('[]')
df['lyrics'] = df['lyrics'].str.strip('()')
df["lyrics"] = df['lyrics'].str.replace('[^\w\s]','')
df["lyrics"] = df['lyrics'].str.replace('chorus','')
df["lyrics"] = df['lyrics'].str.replace(':','')
df["lyrics"] = df['lyrics'].str.replace(',','')
df["lyrics"] = df['lyrics'].str.replace('verse','')
df["lyrics"] = df['lyrics'].str.replace('x1','')
df["lyrics"] = df['lyrics'].str.replace('x2','')
df["lyrics"] = df['lyrics'].str.replace('x3','')
df["lyrics"] = df['lyrics'].str.replace('x4','')
df["lyrics"] = df['lyrics'].str.replace('x5','')
df["lyrics"] = df['lyrics'].str.replace('x6','')
df["lyrics"] = df['lyrics'].str.replace('x7','')
df["lyrics"] = df['lyrics'].str.replace('x8','')
df["lyrics"] = df['lyrics'].str.replace('x9','')
df["lyrics"] = df['lyrics'].str.encode('ascii', 'ignore').str.decode('ascii')


  """


In [None]:
df

Unnamed: 0,song,year,artist,genre,lyrics,num_chars,sent,num_words
204182,fully-dressed,2008,annie,Pop,healy\nspoken this is bert healy saying \nsing...,1041,healy spoken this bert healy saying singing he...,826
6116,surrounded-by-hoes,2006,50-cent,Hip-Hop,repeat 2x even when im tryin to be on the low...,1392,chorus repeat x even i tryin low i recognized ...,884
166369,taste-the-tears-thunderpuss-remix,2006,amber,Pop,how could you cause me so much pain\nand leave...,1113,how could cause much pain and leave heart rain...,756
198416,the-truth-will-set-me-free,2006,glenn-hughes,Rock,in a scarlet vision\nin a velvet room\ni come ...,779,in scarlet vision in velvet room i come decisi...,583
127800,the-last-goodbye,2008,aaron-pritchett,Country,sprintime in savannah\nit dont get much pretti...,881,sprintime savannah it dont get much prettier b...,639
...,...,...,...,...,...,...,...,...
33205,give-it-all-up-for-love,2007,bananarama,Pop,to all the men i knew before\nold love letters...,1159,to men i knew old love letters drawer mean not...,712
194149,all-i-m-thinking-about-is-you,2000,billy-ray-cyrus,Rock,well its a twentyfive mile drive from here to ...,1094,well twenty five mile drive town ther gray ski...,676
11649,bonsoir-mon-amour,2015,dalida,Pop,tu viens de partir pour de longs mois cest lon...,455,tu viens de partir pour de longs mois c est lo...,426
252283,i-m-not-gonna-miss-you,2014,glen-campbell,Pop,im still here but yet im gone\ni dont play gui...,527,i still yet i gone i play guitar sing songs th...,344


Use the following cell to reduce the size of the text, if the GPU can't handle the training.

In [None]:
selected_text = df[df['genre'] == 'Rock']
selected_text = selected_text.lyrics.values

In [None]:
dump, small_text_data = train_test_split(selected_text, random_state = 1, test_size = 0.99)

In [None]:
lyrics_data = [y for x in small_text_data for y in x]

In [None]:
len(small_text_data)

11616

#### Sparse Dataset into letters, and create Mappings

In [None]:
text = [y for x in lyrics_data for y in x]
char_set = set(text)

In [None]:
len(text)

10978321

Create Mappings

In [None]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted)

text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype=np.int32)

print('Text encoded shape: ', text_encoded.shape)
print(text[:15], '     == Encoding ==> ', text_encoded[:15])
print('\n')
print(text_encoded[15:21], ' == Reverse  ==> ', ''.join(char_array[text_encoded[15:21]]))

Text encoded shape:  (10978321,)
['i', ' ', 'k', 'n', 'o', 'w', ' ', 'a', ' ', 'c', 'a', 'r', 'p', 'e', 'n']      == Encoding ==>  [22  2 24 27 28 36  2 14  2 16 14 31 29 18 27]


[33 18 31  2 36 21]  == Reverse  ==>  ter wh


### Split texts into chunk sizes for training and prediction.

In [None]:
seq_length = 50
chunk_size = seq_length + 1 #Each chunk is seq_length +1. In the Dataset, we take the label as the last letter, and the input as all letters before it

text_chunks = [text_encoded[i:i+chunk_size] #Create a single chunk as a list, for each chunk available to make.
               for i in range(len(text_encoded)-chunk_size+1)] 

## inspection:
for seq in text_chunks[:1]:
    input_seq = seq[:seq_length]
    target = seq[seq_length] 
    print(input_seq, ' -> ', target)
    print(repr(''.join(char_array[input_seq])), 
          ' -> ', repr(''.join(char_array[target])))

[22  2 24 27 28 36  2 14  2 16 14 31 29 18 27 33 18 31  2 36 21 28  2 21
 14 17  2 14  2 17 31 18 14 26  1 24 22 25 25 18 17  2 33 21 18  2 26 14
 27  2]  ->  15
'i know a carpenter who had a dream\nkilled the man '  ->  'b'


## Create Dataset and DataLoaders

In [None]:
del df, small_text_data
import gc
gc.collect()

397

In [None]:
import torch
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return torch.tensor(text_chunk[:-1].long()), torch.tensor(text_chunk[1:].long()) #text chunk is size seq_length +1. Last letter is the label.
    
seq_dataset = TextDataset(torch.tensor(text_chunks))

In [None]:
torch.cuda.set_device(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')# device = 'cpu'
from torch.utils.data import DataLoader 
batch_size = 8

torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True) #Drop the last, non full batch.

## LSTM Model

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, num_layers, drop_prob):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, 
                           batch_first=True, num_layers=num_layers,  dropout=drop_prob)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = torch.tensor(self.embedding(x).unsqueeze(1))
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1) #flattens the tensor
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(num_layers, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(num_layers, batch_size, self.rnn_hidden_size)
        return hidden.to(device), cell.to(device)
        

### Parameters

In [None]:
vocab_size = len(char_array)
embed_dim = 512
rnn_hidden_size = 256
num_layers = 2
num_epochs = 3000 
loss_fn = nn.CrossEntropyLoss()
drop_prob = 0.2


## Training and Prediction

In [None]:
from torch.optim import lr_scheduler
torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size, num_layers, drop_prob) 
# model = Model(char_array)
model = model.to(device)
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.01, max_lr=0.1)

torch.manual_seed(1)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch = seq_batch.to(device)
    target_batch = target_batch.to(device)
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell) 
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    scheduler.step()
    loss = loss.item()/seq_length
    if epoch % 10 == 0:
      print(f'Epoch {epoch} loss: {loss:.4f}')

  del sys.path[0]


Epoch 0 loss: 3.6973
Epoch 10 loss: 3.0600
Epoch 20 loss: 2.7697
Epoch 30 loss: 2.5731
Epoch 40 loss: 2.3871
Epoch 50 loss: 2.3802
Epoch 60 loss: 2.4137
Epoch 70 loss: 2.0850
Epoch 80 loss: 2.2583
Epoch 90 loss: 2.0907
Epoch 100 loss: 2.0446
Epoch 110 loss: 2.1849
Epoch 120 loss: 2.2020
Epoch 130 loss: 2.0144
Epoch 140 loss: 2.1534
Epoch 150 loss: 1.9271
Epoch 160 loss: 2.0273
Epoch 170 loss: 2.1423
Epoch 180 loss: 2.1431
Epoch 190 loss: 2.3875
Epoch 200 loss: 1.9641
Epoch 210 loss: 2.0063
Epoch 220 loss: 2.0386
Epoch 230 loss: 2.0288
Epoch 240 loss: 1.8256
Epoch 250 loss: 2.1129
Epoch 260 loss: 1.8152
Epoch 270 loss: 1.7784
Epoch 280 loss: 1.9005
Epoch 290 loss: 1.8259
Epoch 300 loss: 1.9332
Epoch 310 loss: 1.8370
Epoch 320 loss: 1.9518
Epoch 330 loss: 1.9239
Epoch 340 loss: 1.9729
Epoch 350 loss: 1.9346
Epoch 360 loss: 1.8457
Epoch 370 loss: 1.9233
Epoch 380 loss: 1.8566
Epoch 390 loss: 1.8122
Epoch 400 loss: 1.6432
Epoch 410 loss: 1.8513
Epoch 420 loss: 1.7429
Epoch 430 loss: 1.9034

In [None]:
from torch.distributions.categorical import Categorical
def sample(model, text, 
           len_generated_text=500, 
           scale_factor=1):

    encoded_input = torch.tensor([char2int[s] for s in text]).to(device)
    encoded_input = torch.reshape(encoded_input, (1, -1))

    generated_str = text

    model.eval()
    hidden, cell = model.init_hidden(1)
    hidden = hidden.to(device) #.to('cpu')
    cell = cell.to(device) #.to('cpu')
    for c in range(len(text)-1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell) 
    
    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(last_char.view(1), hidden, cell) 
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])
        
    return generated_str

torch.manual_seed(1)
model.to(device)
# print(sample(model, 'have a little love on a little honeymoon, you got a little dish and you got a little spoon. a little bitty house and a little bitty yard'))
print(sample(model, 'i love you'))

  del sys.path[0]


i love youre she look in my heart ad day
to less just wanna creep pieting of the light you
mets gonna be for lets now just and a love to mind
to fever go
you live collid lonely to be unschoken one go
always do out that a youve just till be bited
so ivr heek the destill sg for a just we have to a gleme to te desion
its please my body get the but i waiting it for youh
its waiting we wont hard to tlose to be come you looker womblie beautiful here
you more as well it squence away
law
was a shialonel intellen 


In [None]:
print(sample(model,'we are the champions'))

  del sys.path[0]


we are the champions
and a retude
needlity mey goodbye is weve got to know where a plues mind
as the heaving meners
du im joken
all the body baby know
where the matter meetybody sky
firso
got see home one kir
hurda fact to osterd the were to know
hide we nothing i wish how i cant be we go for the long
somelone everybody to dead long
i would preast red twont truth was becore call my cestle one just everywhere some hard
when the flew yi swear case lose our breaks at the pety guess something now
go do out to were new 
