# <center>Style Specific Hinglish Text Generator</center>

The datasets used for the pretraining of the model can be found at https://github.com/google-research-datasets/Hinglish-TOP-Dataset 

These datasets have been combined into one dataset stored as PreTrain.csv, containing only the Hinglish sentences. It contains 136459 sentences.

The dataset used for the fine tuning of the model was collected from Discord, from a specific user and contains 6289 sentences.

Our goal is to create a word-level text generator that generates sentences having a similar style to that of the user the fine tuning dataset was collected from.

## Importing Libraries

In [1]:
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
import string
from model import new_model
import time

2023-01-15 11:57:36.247612: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-15 11:57:36.550907: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-01-15 11:57:37.715143: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-01-15 11:57:37.715269: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

## Loading Datasets
There are two datasets, one for pretraining, and the other for fine tuning.

In [2]:
pretrain_df = pd.read_csv('PreTrain.csv')
finetune_df = pd.read_csv('FineTune.csv')
pretrain_df = pretrain_df['content'].squeeze()
finetune_df = finetune_df['content'].squeeze()

In [3]:
print(pretrain_df.head(), '\nshape = ' + str(pretrain_df.shape))
print(finetune_df.head(), '\nshape = ' + str(finetune_df.shape))

0    is mahine buffalo ny me konse concerts aarahe hai
1           is weekend naperville mei free ice skating
2            kya el paso me kabhi tornadoes aa rahe he
3    holidays ke liye wilkes barre me karne ke liye...
4                        latest coldplay song ko bajao
Name: content, dtype: object 
shape = (136454,)
0    tumlog karlena baad mein
1                rehne do fir
2       mai kal raat ko jaara
3             mai kal jaaraha
4          waha kya karra hai
Name: content, dtype: object 
shape = (6165,)


Creating a dataset full_df that contains all the sentences from the pretrain and finetune datasets.
It is used to create the vocabulary of the model.

In [4]:
full_df = pd.concat([pretrain_df, finetune_df])
print(full_df.head(), '\nshape = ' + str(full_df.shape))

0    is mahine buffalo ny me konse concerts aarahe hai
1           is weekend naperville mei free ice skating
2            kya el paso me kabhi tornadoes aa rahe he
3    holidays ke liye wilkes barre me karne ke liye...
4                        latest coldplay song ko bajao
Name: content, dtype: object 
shape = (142619,)


In [5]:
tokenizer = Tokenizer(lower = False)
tokenizer.fit_on_texts(full_df)
vocab = tokenizer.word_index
idx2word = {u: v for v, u in vocab.items()}
vocab_size = len(vocab) + 1

## Pretraining the Model
Tokenizing the input sentences of the pretraining dataset into sequences.

In [6]:
def get_sequences(tokenizer, df):
    sequences = []
    for sentence in df:
        token_list = tokenizer.texts_to_sequences([sentence])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            sequences.append(n_gram_sequence)
    return sequences

In [77]:
sequences = get_sequences(tokenizer, pretrain_df)
print(sequences[:10], '\nlength = ' + str(len(sequences)))

[[12, 113], [12, 113, 1073], [12, 113, 1073, 570], [12, 113, 1073, 570, 6], [12, 113, 1073, 570, 6, 187], [12, 113, 1073, 570, 6, 187, 123], [12, 113, 1073, 570, 6, 187, 123, 4143], [12, 113, 1073, 570, 6, 187, 123, 4143, 2], [12, 37], [12, 37, 2759]] 
length = 950273


Padding the sequences of tokens of the pretraining dataset.

In [78]:
sequences = tf.keras.utils.pad_sequences(sequences)
print(sequences[:10], '\nlength = ' + str(len(sequences)))

[[   0    0    0 ...    0   12  113]
 [   0    0    0 ...   12  113 1073]
 [   0    0    0 ...  113 1073  570]
 ...
 [   0    0    0 ...  123 4143    2]
 [   0    0    0 ...    0   12   37]
 [   0    0    0 ...   12   37 2759]] 
length = 950273


Creating the input sequences (X) and the predicted words (y).

In [79]:
X, y = sequences[:, :-1], sequences[:, -1]

Converting the array of predicted words into one-hot vectors.

In [10]:
y = tf.keras.utils.to_categorical(y, vocab_size)
print(y.shape)

(950273, 19207)


As the dataset is very large, all of it cannot be used to train the model at once. Therefore we are creating a data generator, that will feed the data in batches to the model.

In [11]:
from tensorflow.keras.utils import Sequence
import numpy as np   

class DataGenerator(Sequence):
    def __init__(self, X, y, batch_size):
        self.x, self.y = X, y
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

train_gen = DataGenerator(X, y, 1024)

In [12]:
model = new_model(vocab_size = vocab_size, seq_length = X.shape[1])
print(model.summary())

2023-01-15 11:58:01.098502: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-15 11:58:01.941621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7425 MB memory:  -> device: 0, name: Tesla M60, pci bus id: 0001:00:00.0, compute capability: 5.2


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 213, 64)           1229312   
                                                                 
 spatial_dropout1d (SpatialD  (None, 213, 64)          0         
 ropout1D)                                                       
                                                                 
 bidirectional (Bidirectiona  (None, 213, 256)         197632    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 256)              394240    
 nal)                                                            
                                                                 
 dense (Dense)               (None, 19207)             4936199   
                                                        

In [13]:
model.fit(train_gen, epochs = 20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


2023-01-15 11:58:09.649007: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8201


<keras.callbacks.History at 0x7ff3497eaca0>

In [15]:
model.save_weights('weights_pretrain.h5')

## Finetuning the model
Tokenizing the input sentences of the finetuning dataset into sequences.

In [80]:
sequences = get_sequences(tokenizer, finetune_df)
print(sequences[:10], '\nlength = ' + str(len(sequences)))

[[8423, 17628], [8423, 17628, 212], [8423, 17628, 212, 151], [620, 73], [620, 73, 316], [32, 25], [32, 25, 42], [32, 25, 42, 3], [32, 25, 42, 3, 11121], [32, 25]] 
length = 27383


Padding the sequences of tokens of the finetuning dataset.

In [81]:
sequences = tf.keras.utils.pad_sequences(sequences)
print(sequences[:10], '\nlength = ' + str(len(sequences)))

[[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0  8423 17628]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
   8423 17628   212]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0  8423
  17628   212   151]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0   620    73]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
    620    73   316]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0

Creating the input sequences (X) and the predicted words (y).

In [88]:
X, y = sequences[:, :-1], sequences[:, -1]

Converting the array of predicted words into one-hot vectors.

In [89]:
y = tf.keras.utils.to_categorical(y, vocab_size - 1)
print(y.shape)

(27382, 19207)


In [90]:
finetune_gen = DataGenerator(X, y, 1024)

In [91]:
model.fit(X, y, epochs = 100)

2023-01-15 16:17:38.700055: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 2103704296 exceeds 10% of free system memory.
2023-01-15 16:17:41.346892: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 2103704296 exceeds 10% of free system memory.


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7fe164a240d0>

In [92]:
model.save_weights('weights_finetuned.h5')

Creating function for generating sentences.

In [93]:
def generate_sentence(model, length):
    start = np.random.randint(1, vocab_size, size = (1, 213))
    temp = np.argmax(model.predict(start))
    start = np.zeros((1, 213))
    start[0][-1] = temp
    word = idx2word[start[0][-1]]
    for i in range(length):
        new_idx = np.random.choice(range(1, vocab_size), p = model.predict(start).flatten())
        while new_idx == start[0][-1]:
            new_idx = np.random.choice(range(1, vocab_size), p = model.predict(start).flatten())
        word += ' ' + idx2word[new_idx]
        start[0][:-1] = start[0][1:]
        start[0][-1] = new_idx
    print(word)

In [101]:
generate_sentence(model, 10)

quiz night dijiye bhi toofan ice hoga iska traffic chota yahi


## Possible Future Improvements
- Using a char-level RNN for the model
- Finetuning the network architecture and parameters
- Using word embeddings