# Recurrent Neural Networks and Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [16]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext, datasets, math

In [17]:
from datasets import load_dataset

# Download the dataset, load_dataset function is coming from datasets library which provided by codeparrot.
train = load_dataset("codeparrot/github-jupyter-code-to-text", split="train")
test = load_dataset("codeparrot/github-jupyter-code-to-text", split="test")

Using custom data configuration codeparrot--github-jupyter-code-to-text-cf9b56d996fd17e1
Found cached dataset parquet (/root/.cache/huggingface/datasets/codeparrot___parquet/codeparrot--github-jupyter-code-to-text-cf9b56d996fd17e1/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Using custom data configuration codeparrot--github-jupyter-code-to-text-cf9b56d996fd17e1
Found cached dataset parquet (/root/.cache/huggingface/datasets/codeparrot___parquet/codeparrot--github-jupyter-code-to-text-cf9b56d996fd17e1/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [18]:
# Check the features.
train

Dataset({
    features: ['repo_name', 'path', 'license', 'content'],
    num_rows: 47452
})

In [19]:
# Peek at how sample looks
# Look good.

datasets_sample = train["content"][0]
print(datasets_sample)

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

"""
Explanation: Simple MNIST convnet
Author: fchollet<br>
Date created: 2015/06/19<br>
Last modified: 2020/04/21<br>
Description: A simple convnet that achieves ~99% test accuracy on MNIST.
Setup
End of explanation
"""


# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test

In [28]:
# In one sample there are more than one sentences, if we directly put it into tokenizer 
# it will exceed the range so we need to transform it a bit.

train_transform = list()
test_transform = list()

# Transform the train data
for text in train["content"]:
    for sent in text.split("\n"):
        if sent == "":
            pass
        else:
            train_transform.append(sent)

# Transform the test data
for text in test["content"]:
    for sent in text.split("\n"):
        if sent == "":
            pass
        else:
            test_transform.append(sent)


In [29]:
# Check the transform data
print(train_transform[0])
print(len(train_transform))


import numpy as np
11367363


In [32]:
from transformers import AutoTokenizer

# Load codeparrot tokenizer trained for Python code tokenization
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenized_dataset = tokenizer(train_transform)