# Starter code for the skeleton notebook

If you are running this notebook **locally**, make sure you use the python associated with the `conda` environment you created for this project. \
If you are running this notebook on **Google Colab**, make sure you are using a GPU runtime.


Checkout [data/starting_dataset.py](data/StartingDataset.py) for the dataset processing code. \
Checkout [networks/StartingNetwork.py](networks/StartingNetwork.py) for the neural network code. \
Checkout [train_functions/starting_train.py](train_functions/starting_train.py) for the training code.

### Mount Drive (Google Colab)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Grab scripts from GitHub Repo

In [2]:
!git clone https://github.com/LLeon360/aiprojects-nlp-quora-questions scripts
!mv  -v scripts/* .

Cloning into 'scripts'...
remote: Enumerating objects: 208, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 208 (delta 1), reused 3 (delta 0), pack-reused 203[K
Receiving objects: 100% (208/208), 94.71 KiB | 1.21 MiB/s, done.
Resolving deltas: 100% (95/95), done.
renamed 'scripts/acmprojects.yml' -> './acmprojects.yml'
renamed 'scripts/constants.py' -> './constants.py'
renamed 'scripts/data' -> './data'
renamed 'scripts/kaggle.json' -> './kaggle.json'
renamed 'scripts/main.ipynb' -> './main.ipynb'
renamed 'scripts/networks' -> './networks'
renamed 'scripts/README.md' -> './README.md'
renamed 'scripts/train_functions' -> './train_functions'


### Imports

In [3]:
import os

import constants

from data.StartingDataset import StartingDataset
from networks.StartingNetwork import StartingNetwork
from train_functions.lstm_train import lstm_train

from data.EmbeddingDataset import EmbeddingDataset
from networks.LSTMEncoder import LSTMEncoder

import torch
from torch.utils.data import random_split

import pandas as pd

import csv
import numpy as np

from sklearn.model_selection import train_test_split

### Constants

In [4]:
# EPOCHS = 100
# BATCH_SIZE = 32
# N_EVAL = 100
# VAL_SPLIT = 0.1

from constants import EPOCHS, BATCH_SIZE, N_EVAL, VAL_SPLIT
VAL_SPLIT = 0.05


### GPU Support


In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Define hyperparameters

In [6]:
hyperparameters = {"epochs": EPOCHS, "batch_size": BATCH_SIZE}

### Load Embeddings

You need to have the embeddings installed and stored in the matching filepath

In [7]:
full_content = pd.read_csv('/content/drive/MyDrive/AI/quora_nlp/glove.6B.300d.txt', delim_whitespace = True, quoting=csv.QUOTE_NONE)

In [8]:
# full_content.head()

In [9]:
#separate words and embeddings
i_word = full_content.iloc[:,0]
i_embeddings = full_content.iloc[:,1:]

In [10]:
# from series to numpy
vocab_npa = np.array(i_word)
embs_npa = np.array(i_embeddings)

In [11]:
# prepend special padding token and unknown token
vocab_npa = np.insert(vocab_npa, 0, '<pad>')
vocab_npa = np.insert(vocab_npa, 1, '<unk>')

In [12]:
pad_emb_npa = np.zeros((1, embs_npa.shape[1]))
unk_emb_npa = np.mean(embs_npa, axis=0, keepdims=True)

#insert embeddings for pad and unk tokens to embs_npa.
embs_npa = np.vstack((pad_emb_npa,unk_emb_npa,embs_npa))

In [13]:
print(vocab_npa.shape)
print(embs_npa.shape)

(400001,)
(400001, 300)


### Split data

In [14]:
entire_df = pd.read_csv("/content/drive/MyDrive/AI/quora_nlp/train.csv")

In [15]:
# pull out negative and positives
negative_df = entire_df[entire_df["target"] == 0]
positive_df = entire_df[entire_df["target"] == 1]
print(len(negative_df))
print(len(positive_df))

1225312
80810


In [16]:
positive_df.head()

Unnamed: 0,qid,question_text,target
22,0000e91571b60c2fb487,Has the United States become the largest dicta...,1
30,00013ceca3f624b09f42,Which babies are more sweeter to their parents...,1
110,0004a7fcb2bf73076489,If blacks support school choice and mandatory ...,1
114,00052793eaa287aff1e1,I am gay boy and I love my cousin (boy). He is...,1
115,000537213b01fd77b58a,Which races have the smallest penis?,1


In [17]:
# build the training dataframe, this is temporary, final goal is to split even class distribution during sampling not cutting down overall dataset
train_df = pd.concat([negative_df[:5000], positive_df[:5000]])

len(train_df)

10000

In [18]:
train_df, val_df = train_test_split(train_df, test_size=VAL_SPLIT)
test_df = pd.read_csv("/content/drive/MyDrive/AI/quora_nlp/test.csv")

In [19]:
print(len(train_df))
print(len(val_df))
print(len(test_df))

9500
500
375806


### Initialize datasets and model


In [20]:
config = {
    #model configurations
    'batch_size':32,
    'max_seq_length':100,
    'lr':1e-3,
    'label_count':2,
    'dropout_prob':2e-1,
    'hidden_size':256,
    'lstm_unit_cnt':2,

    #embeddings configurations
    'pretrained_embeddings':embs_npa,
    'freeze_embeddings':True,
    'vocab':vocab_npa,
    'pad_token':'<pad>',
    'unk_token':'<unk>',

    #data
    'train_df': train_df,
    'val_df': val_df,
    'test_df': test_df,

    'device': device,
}

In [21]:
# starting fc network, ignore for embeddings and lstm
# data_path = "mini_train.csv"

# train_dataset = StartingDataset(data_path)
# #val split
# generator1 = torch.Generator().manual_seed(42)
# train_dataset, val_dataset = random_split(train_dataset, [1-VAL_SPLIT, VAL_SPLIT], generator = generator1)
# model = StartingNetwork()


In [22]:
# print(len(train_dataset))
# print(len(val_dataset))

In [23]:
model = LSTMEncoder(config)
model.to(device)

LSTMEncoder(
  (embedding): Embedding(400001, 300)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True)
  (fc1): Linear(in_features=256, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [24]:
train_dataset = EmbeddingDataset(
    df = config['train_df'],
    vocab = config['vocab'],
    max_seq_length = config['max_seq_length'],
    pad_token = config['pad_token'],
    unk_token = config['unk_token']
)

val_dataset = EmbeddingDataset(
    df = config['val_df'],
    vocab = config['vocab'],
    max_seq_length = config['max_seq_length'],
    pad_token = config['pad_token'],
    unk_token = config['unk_token']
)


### Train model

Before you start, check out [train_functions/starting_train.py](train_functions/starting_train.py). You might have to do something to get the training loop running properly.

In [None]:
lstm_train(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    model=model,
    hyperparameters=hyperparameters,
    n_eval=N_EVAL,
    device=device
)


Epoch 1 of 100


100%|██████████| 297/297 [00:55<00:00,  5.35it/s]



Epoch 2 of 100


100%|██████████| 297/297 [00:53<00:00,  5.52it/s]



Epoch 3 of 100


100%|██████████| 297/297 [00:58<00:00,  5.05it/s]



Epoch 4 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 5 of 100


100%|██████████| 297/297 [00:58<00:00,  5.12it/s]



Epoch 6 of 100


100%|██████████| 297/297 [00:57<00:00,  5.19it/s]



Epoch 7 of 100


100%|██████████| 297/297 [00:56<00:00,  5.24it/s]



Epoch 8 of 100


100%|██████████| 297/297 [00:56<00:00,  5.27it/s]



Epoch 9 of 100


100%|██████████| 297/297 [00:55<00:00,  5.34it/s]



Epoch 10 of 100


100%|██████████| 297/297 [00:52<00:00,  5.62it/s]



Epoch 11 of 100


100%|██████████| 297/297 [00:54<00:00,  5.44it/s]



Epoch 12 of 100


100%|██████████| 297/297 [00:54<00:00,  5.45it/s]



Epoch 13 of 100


100%|██████████| 297/297 [00:55<00:00,  5.38it/s]



Epoch 14 of 100


100%|██████████| 297/297 [00:55<00:00,  5.40it/s]



Epoch 15 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 16 of 100


100%|██████████| 297/297 [00:55<00:00,  5.35it/s]



Epoch 17 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 18 of 100


100%|██████████| 297/297 [00:55<00:00,  5.30it/s]



Epoch 19 of 100


100%|██████████| 297/297 [00:57<00:00,  5.17it/s]



Epoch 20 of 100


100%|██████████| 297/297 [00:54<00:00,  5.44it/s]



Epoch 21 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 22 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 23 of 100


100%|██████████| 297/297 [00:53<00:00,  5.57it/s]



Epoch 24 of 100


100%|██████████| 297/297 [00:55<00:00,  5.32it/s]



Epoch 25 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 26 of 100


100%|██████████| 297/297 [00:54<00:00,  5.47it/s]



Epoch 27 of 100


100%|██████████| 297/297 [00:55<00:00,  5.38it/s]



Epoch 28 of 100


100%|██████████| 297/297 [00:55<00:00,  5.40it/s]



Epoch 29 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 30 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]



Epoch 31 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 32 of 100


100%|██████████| 297/297 [00:53<00:00,  5.53it/s]



Epoch 33 of 100


100%|██████████| 297/297 [00:55<00:00,  5.34it/s]



Epoch 34 of 100


100%|██████████| 297/297 [00:51<00:00,  5.72it/s]



Epoch 35 of 100


100%|██████████| 297/297 [00:55<00:00,  5.38it/s]



Epoch 36 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]



Epoch 37 of 100


100%|██████████| 297/297 [00:55<00:00,  5.39it/s]



Epoch 38 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 39 of 100


100%|██████████| 297/297 [00:53<00:00,  5.50it/s]



Epoch 40 of 100


100%|██████████| 297/297 [00:54<00:00,  5.40it/s]



Epoch 41 of 100


100%|██████████| 297/297 [00:54<00:00,  5.45it/s]



Epoch 42 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 43 of 100


100%|██████████| 297/297 [00:53<00:00,  5.53it/s]



Epoch 44 of 100


100%|██████████| 297/297 [00:53<00:00,  5.54it/s]



Epoch 45 of 100


100%|██████████| 297/297 [00:53<00:00,  5.55it/s]



Epoch 46 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 47 of 100


100%|██████████| 297/297 [00:56<00:00,  5.22it/s]



Epoch 48 of 100


100%|██████████| 297/297 [00:59<00:00,  5.02it/s]



Epoch 49 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 50 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 51 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 52 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 53 of 100


100%|██████████| 297/297 [00:52<00:00,  5.70it/s]



Epoch 54 of 100


100%|██████████| 297/297 [00:53<00:00,  5.54it/s]



Epoch 55 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 56 of 100


100%|██████████| 297/297 [00:57<00:00,  5.21it/s]



Epoch 57 of 100


100%|██████████| 297/297 [00:52<00:00,  5.65it/s]



Epoch 58 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 59 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]



Epoch 60 of 100


100%|██████████| 297/297 [00:52<00:00,  5.71it/s]



Epoch 61 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 62 of 100


100%|██████████| 297/297 [00:52<00:00,  5.69it/s]



Epoch 63 of 100


100%|██████████| 297/297 [00:55<00:00,  5.40it/s]



Epoch 64 of 100


100%|██████████| 297/297 [00:51<00:00,  5.74it/s]



Epoch 65 of 100


100%|██████████| 297/297 [00:54<00:00,  5.50it/s]



Epoch 66 of 100


100%|██████████| 297/297 [00:54<00:00,  5.45it/s]



Epoch 67 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 68 of 100


100%|██████████| 297/297 [00:53<00:00,  5.55it/s]



Epoch 69 of 100


100%|██████████| 297/297 [00:57<00:00,  5.20it/s]



Epoch 70 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 71 of 100


100%|██████████| 297/297 [00:54<00:00,  5.43it/s]



Epoch 72 of 100


100%|██████████| 297/297 [00:52<00:00,  5.64it/s]



Epoch 73 of 100


100%|██████████| 297/297 [00:55<00:00,  5.35it/s]



Epoch 74 of 100


100%|██████████| 297/297 [00:56<00:00,  5.29it/s]



Epoch 75 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 76 of 100


100%|██████████| 297/297 [00:56<00:00,  5.28it/s]



Epoch 77 of 100


100%|██████████| 297/297 [00:52<00:00,  5.67it/s]



Epoch 78 of 100


100%|██████████| 297/297 [00:53<00:00,  5.51it/s]



Epoch 79 of 100


100%|██████████| 297/297 [00:53<00:00,  5.60it/s]



Epoch 80 of 100


100%|██████████| 297/297 [00:56<00:00,  5.27it/s]



Epoch 81 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 82 of 100


100%|██████████| 297/297 [00:55<00:00,  5.31it/s]



Epoch 83 of 100


100%|██████████| 297/297 [00:56<00:00,  5.29it/s]



Epoch 84 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 85 of 100


100%|██████████| 297/297 [00:55<00:00,  5.31it/s]



Epoch 86 of 100


100%|██████████| 297/297 [00:55<00:00,  5.31it/s]



Epoch 87 of 100


100%|██████████| 297/297 [00:56<00:00,  5.28it/s]



Epoch 88 of 100


100%|██████████| 297/297 [00:57<00:00,  5.21it/s]



Epoch 89 of 100


100%|██████████| 297/297 [00:59<00:00,  4.98it/s]



Epoch 90 of 100


100%|██████████| 297/297 [00:56<00:00,  5.26it/s]



Epoch 91 of 100


100%|██████████| 297/297 [00:55<00:00,  5.34it/s]



Epoch 92 of 100


100%|██████████| 297/297 [00:56<00:00,  5.25it/s]



Epoch 93 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 94 of 100


100%|██████████| 297/297 [00:54<00:00,  5.43it/s]



Epoch 95 of 100


100%|██████████| 297/297 [00:56<00:00,  5.26it/s]



Epoch 96 of 100


100%|██████████| 297/297 [00:55<00:00,  5.39it/s]



Epoch 97 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 98 of 100


100%|██████████| 297/297 [00:54<00:00,  5.40it/s]



Epoch 99 of 100


100%|██████████| 297/297 [00:53<00:00,  5.55it/s]



Epoch 100 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]





