# Starter code for the skeleton notebook

If you are running this notebook on **Google Colab**, make sure you are using a GPU runtime.

This notebook mounts drive to load embeddings and data you can find them on the kaggle link on https://www.kaggle.com/competitions/quora-insincere-questions-classification/data

When running Colab, it automatically grabs scripts from
https://github.com/LLeon360/aiprojects-nlp-quora-questions

Checkout [data/starting_dataset.py](data/EmbeddingsDataset.py) for the dataset processing code. \
Checkout [networks/StartingNetwork.py](networks/LSTMEncoder.py) for the neural network code. \
Checkout [train_functions/starting_train.py](train_functions/lstm_train.py) for the training code.

### Mount Drive (Google Colab)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Grab scripts from GitHub Repo

In [None]:
!git clone https://github.com/LLeon360/aiprojects-nlp-quora-questions scripts
!mv  -v scripts/* .

Cloning into 'scripts'...
remote: Enumerating objects: 229, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 229 (delta 10), reused 21 (delta 7), pack-reused 203[K
Receiving objects: 100% (229/229), 104.97 KiB | 3.62 MiB/s, done.
Resolving deltas: 100% (104/104), done.
renamed 'scripts/acmprojects.yml' -> './acmprojects.yml'
renamed 'scripts/constants.py' -> './constants.py'
renamed 'scripts/data' -> './data'
renamed 'scripts/kaggle.json' -> './kaggle.json'
renamed 'scripts/main.ipynb' -> './main.ipynb'
renamed 'scripts/networks' -> './networks'
renamed 'scripts/README.md' -> './README.md'
renamed 'scripts/train_functions' -> './train_functions'


### Imports

In [None]:
import os

import constants

from data.StartingDataset import StartingDataset
from networks.StartingNetwork import StartingNetwork
from train_functions.lstm_train import lstm_train

from data.EmbeddingDataset import EmbeddingDataset
from networks.LSTMEncoder import LSTMEncoder

import torch
from torch.utils.data import random_split, WeightedRandomSampler, BatchSampler

import pandas as pd

import csv
import numpy as np

from sklearn.model_selection import train_test_split

### Constants

In [None]:
# EPOCHS = 100
# BATCH_SIZE = 32
# N_EVAL = 100
# VAL_SPLIT = 0.1

from constants import EPOCHS, BATCH_SIZE, N_EVAL, VAL_SPLIT
VAL_SPLIT = 0.05


### GPU Support


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Define hyperparameters

In [None]:
hyperparameters = {"epochs": EPOCHS, "batch_size": BATCH_SIZE}

### Load Embeddings

You need to have the embeddings installed and stored in the matching filepath

In [None]:
full_content = pd.read_csv('/content/drive/MyDrive/AI/quora_nlp/glove.6B.300d.txt', delim_whitespace = True, quoting=csv.QUOTE_NONE)

In [None]:
# full_content.head()

In [None]:
#separate words and embeddings
i_word = full_content.iloc[:,0]
i_embeddings = full_content.iloc[:,1:]

In [None]:
# from series to numpy
vocab_npa = np.array(i_word)
embs_npa = np.array(i_embeddings)

In [None]:
# prepend special padding token and unknown token
vocab_npa = np.insert(vocab_npa, 0, '<pad>')
vocab_npa = np.insert(vocab_npa, 1, '<unk>')

In [None]:
pad_emb_npa = np.zeros((1, embs_npa.shape[1]))
unk_emb_npa = np.mean(embs_npa, axis=0, keepdims=True)

#insert embeddings for pad and unk tokens to embs_npa.
embs_npa = np.vstack((pad_emb_npa,unk_emb_npa,embs_npa))

In [None]:
print(vocab_npa.shape)
print(embs_npa.shape)

(400001,)
(400001, 300)


### Split data

In [None]:
entire_df = pd.read_csv("/content/drive/MyDrive/AI/quora_nlp/train.csv")
# entire_df = pd.read_csv("train.csv")

In [None]:
train_df, val_df = train_test_split(entire_df, test_size=VAL_SPLIT)
test_df = pd.read_csv("/content/drive/MyDrive/AI/quora_nlp/test.csv")

In [None]:
print(len(train_df))
print(len(val_df))
# print(len(test_df))

1240815
65307


#### Class imbalance

In [None]:
# pull out negative and positives
negative_df = entire_df[entire_df["target"] == 0]
positive_df = entire_df[entire_df["target"] == 1]
print(len(negative_df))
print(len(positive_df))
print(len(negative_df) / len(positive_df))

1225312
80810


#### Weighted Sampler

There is a pretty significant class imbalance, mostly negative cases so use weighted sampler to train the model on a balance of both

In [None]:
weights = np.ones(len(train_df))
weights[train_df.target==1] *= 15
weights /= (len(train_df)) # Pytorch docs says probabilities don't have to add up to 1, but when you don't do this it doesn't work :(

sampler = WeightedRandomSampler(weights=weights, num_samples=len(train_df), replacement=True)

### Initialize datasets and model


In [None]:
config = {
    #model configurations
    'batch_size':32,
    'max_seq_length':100,
    'lr':1e-3,
    'label_count':2,
    'dropout_prob':2e-1,
    'hidden_size':256,
    'lstm_unit_cnt':2,

    #embeddings configurations
    'pretrained_embeddings':embs_npa,
    'freeze_embeddings':True,
    'vocab':vocab_npa,
    'pad_token':'<pad>',
    'unk_token':'<unk>',

    #data
    'train_df': train_df,
    'val_df': val_df,
    'test_df': test_df,

    'device': device,
}

In [None]:
# starting fc network, ignore for embeddings and lstm
# data_path = "mini_train.csv"

# train_dataset = StartingDataset(data_path)
# #val split
# generator1 = torch.Generator().manual_seed(42)
# train_dataset, val_dataset = random_split(train_dataset, [1-VAL_SPLIT, VAL_SPLIT], generator = generator1)
# model = StartingNetwork()


In [None]:
# print(len(train_dataset))
# print(len(val_dataset))

In [None]:
model = LSTMEncoder(config)
model.to(device)

LSTMEncoder(
  (embedding): Embedding(400001, 300)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True)
  (fc1): Linear(in_features=256, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [None]:
train_dataset = EmbeddingDataset(
    df = config['train_df'],
    vocab = config['vocab'],
    max_seq_length = config['max_seq_length'],
    pad_token = config['pad_token'],
    unk_token = config['unk_token']
)

val_dataset = EmbeddingDataset(
    df = config['val_df'],
    vocab = config['vocab'],
    max_seq_length = config['max_seq_length'],
    pad_token = config['pad_token'],
    unk_token = config['unk_token']
)


### Test Sampler

In [None]:
# print(sampler.weights[:30])

tensor([8.0592e-07, 8.0592e-07, 8.0592e-07, 8.0592e-07, 8.0592e-07, 8.0592e-07,
        8.0592e-07, 8.0592e-07, 8.0592e-07, 8.0592e-07, 8.0592e-07, 8.0592e-07,
        8.0592e-07, 1.2089e-05, 8.0592e-07], dtype=torch.float64)


In [None]:
# train_df.head(30)

Unnamed: 0,qid,question_text,target
897869,afec8f46143fa9dde146,How did you classify your virtual assistants?,0
620341,797b6e6e4fd0ce072ee1,"If I took the SSD out of my 2014 Mac, how do I...",0
331676,41050ea09d6898c2a53a,Who is the richest man in Kerala?,0
789575,9ab381c37a2dc0027542,How do I recover my SBI user name?,0
1085331,d4b0971e2bdb09608a33,Is it a good idea to share school project on G...,0
506219,631ea439c7b684686346,What else to learn during learning competitive...,0
1118660,db381f17391866eb173f,What is the extra preparation needed for clear...,0
519213,65a67943ec8a9354420d,What can I do when I can't concentrate on anyt...,0
19521,03d2ad28073a4dee4006,What is the line between loving oneself and na...,0
304228,3b9650a4120295059802,Why does physical comfort feel so beautiful?,0


In [None]:
# train_loader = torch.utils.data.DataLoader(
#     train_dataset, batch_sampler=BatchSampler(sampler,32, True)
# )

In [None]:
# batch = next(iter(train_loader))

In [None]:
# print(batch["labels"])

tensor([1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1.,
        1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 0.])


### Train model

Before you start, check out [train_functions/starting_train.py](train_functions/starting_train.py). You might have to do something to get the training loop running properly.

In [None]:
lstm_train(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    sampler = sampler,
    model=model,
    hyperparameters=hyperparameters,
    n_eval=N_EVAL,
    device=device
)


Epoch 1 of 100


100%|██████████| 297/297 [00:55<00:00,  5.35it/s]



Epoch 2 of 100


100%|██████████| 297/297 [00:53<00:00,  5.52it/s]



Epoch 3 of 100


100%|██████████| 297/297 [00:58<00:00,  5.05it/s]



Epoch 4 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 5 of 100


100%|██████████| 297/297 [00:58<00:00,  5.12it/s]



Epoch 6 of 100


100%|██████████| 297/297 [00:57<00:00,  5.19it/s]



Epoch 7 of 100


100%|██████████| 297/297 [00:56<00:00,  5.24it/s]



Epoch 8 of 100


100%|██████████| 297/297 [00:56<00:00,  5.27it/s]



Epoch 9 of 100


100%|██████████| 297/297 [00:55<00:00,  5.34it/s]



Epoch 10 of 100


100%|██████████| 297/297 [00:52<00:00,  5.62it/s]



Epoch 11 of 100


100%|██████████| 297/297 [00:54<00:00,  5.44it/s]



Epoch 12 of 100


100%|██████████| 297/297 [00:54<00:00,  5.45it/s]



Epoch 13 of 100


100%|██████████| 297/297 [00:55<00:00,  5.38it/s]



Epoch 14 of 100


100%|██████████| 297/297 [00:55<00:00,  5.40it/s]



Epoch 15 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 16 of 100


100%|██████████| 297/297 [00:55<00:00,  5.35it/s]



Epoch 17 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 18 of 100


100%|██████████| 297/297 [00:55<00:00,  5.30it/s]



Epoch 19 of 100


100%|██████████| 297/297 [00:57<00:00,  5.17it/s]



Epoch 20 of 100


100%|██████████| 297/297 [00:54<00:00,  5.44it/s]



Epoch 21 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 22 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 23 of 100


100%|██████████| 297/297 [00:53<00:00,  5.57it/s]



Epoch 24 of 100


100%|██████████| 297/297 [00:55<00:00,  5.32it/s]



Epoch 25 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 26 of 100


100%|██████████| 297/297 [00:54<00:00,  5.47it/s]



Epoch 27 of 100


100%|██████████| 297/297 [00:55<00:00,  5.38it/s]



Epoch 28 of 100


100%|██████████| 297/297 [00:55<00:00,  5.40it/s]



Epoch 29 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 30 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]



Epoch 31 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 32 of 100


100%|██████████| 297/297 [00:53<00:00,  5.53it/s]



Epoch 33 of 100


100%|██████████| 297/297 [00:55<00:00,  5.34it/s]



Epoch 34 of 100


100%|██████████| 297/297 [00:51<00:00,  5.72it/s]



Epoch 35 of 100


100%|██████████| 297/297 [00:55<00:00,  5.38it/s]



Epoch 36 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]



Epoch 37 of 100


100%|██████████| 297/297 [00:55<00:00,  5.39it/s]



Epoch 38 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 39 of 100


100%|██████████| 297/297 [00:53<00:00,  5.50it/s]



Epoch 40 of 100


100%|██████████| 297/297 [00:54<00:00,  5.40it/s]



Epoch 41 of 100


100%|██████████| 297/297 [00:54<00:00,  5.45it/s]



Epoch 42 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 43 of 100


100%|██████████| 297/297 [00:53<00:00,  5.53it/s]



Epoch 44 of 100


100%|██████████| 297/297 [00:53<00:00,  5.54it/s]



Epoch 45 of 100


100%|██████████| 297/297 [00:53<00:00,  5.55it/s]



Epoch 46 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 47 of 100


100%|██████████| 297/297 [00:56<00:00,  5.22it/s]



Epoch 48 of 100


100%|██████████| 297/297 [00:59<00:00,  5.02it/s]



Epoch 49 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 50 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 51 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 52 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 53 of 100


100%|██████████| 297/297 [00:52<00:00,  5.70it/s]



Epoch 54 of 100


100%|██████████| 297/297 [00:53<00:00,  5.54it/s]



Epoch 55 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 56 of 100


100%|██████████| 297/297 [00:57<00:00,  5.21it/s]



Epoch 57 of 100


100%|██████████| 297/297 [00:52<00:00,  5.65it/s]



Epoch 58 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 59 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]



Epoch 60 of 100


100%|██████████| 297/297 [00:52<00:00,  5.71it/s]



Epoch 61 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 62 of 100


100%|██████████| 297/297 [00:52<00:00,  5.69it/s]



Epoch 63 of 100


100%|██████████| 297/297 [00:55<00:00,  5.40it/s]



Epoch 64 of 100


100%|██████████| 297/297 [00:51<00:00,  5.74it/s]



Epoch 65 of 100


100%|██████████| 297/297 [00:54<00:00,  5.50it/s]



Epoch 66 of 100


100%|██████████| 297/297 [00:54<00:00,  5.45it/s]



Epoch 67 of 100


100%|██████████| 297/297 [00:53<00:00,  5.56it/s]



Epoch 68 of 100


100%|██████████| 297/297 [00:53<00:00,  5.55it/s]



Epoch 69 of 100


100%|██████████| 297/297 [00:57<00:00,  5.20it/s]



Epoch 70 of 100


100%|██████████| 297/297 [00:53<00:00,  5.58it/s]



Epoch 71 of 100


100%|██████████| 297/297 [00:54<00:00,  5.43it/s]



Epoch 72 of 100


100%|██████████| 297/297 [00:52<00:00,  5.64it/s]



Epoch 73 of 100


100%|██████████| 297/297 [00:55<00:00,  5.35it/s]



Epoch 74 of 100


100%|██████████| 297/297 [00:56<00:00,  5.29it/s]



Epoch 75 of 100


100%|██████████| 297/297 [00:54<00:00,  5.49it/s]



Epoch 76 of 100


100%|██████████| 297/297 [00:56<00:00,  5.28it/s]



Epoch 77 of 100


100%|██████████| 297/297 [00:52<00:00,  5.67it/s]



Epoch 78 of 100


100%|██████████| 297/297 [00:53<00:00,  5.51it/s]



Epoch 79 of 100


100%|██████████| 297/297 [00:53<00:00,  5.60it/s]



Epoch 80 of 100


100%|██████████| 297/297 [00:56<00:00,  5.27it/s]



Epoch 81 of 100


100%|██████████| 297/297 [00:54<00:00,  5.48it/s]



Epoch 82 of 100


100%|██████████| 297/297 [00:55<00:00,  5.31it/s]



Epoch 83 of 100


100%|██████████| 297/297 [00:56<00:00,  5.29it/s]



Epoch 84 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 85 of 100


100%|██████████| 297/297 [00:55<00:00,  5.31it/s]



Epoch 86 of 100


100%|██████████| 297/297 [00:55<00:00,  5.31it/s]



Epoch 87 of 100


100%|██████████| 297/297 [00:56<00:00,  5.28it/s]



Epoch 88 of 100


100%|██████████| 297/297 [00:57<00:00,  5.21it/s]



Epoch 89 of 100


100%|██████████| 297/297 [00:59<00:00,  4.98it/s]



Epoch 90 of 100


100%|██████████| 297/297 [00:56<00:00,  5.26it/s]



Epoch 91 of 100


100%|██████████| 297/297 [00:55<00:00,  5.34it/s]



Epoch 92 of 100


100%|██████████| 297/297 [00:56<00:00,  5.25it/s]



Epoch 93 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 94 of 100


100%|██████████| 297/297 [00:54<00:00,  5.43it/s]



Epoch 95 of 100


100%|██████████| 297/297 [00:56<00:00,  5.26it/s]



Epoch 96 of 100


100%|██████████| 297/297 [00:55<00:00,  5.39it/s]



Epoch 97 of 100


100%|██████████| 297/297 [00:55<00:00,  5.37it/s]



Epoch 98 of 100


100%|██████████| 297/297 [00:54<00:00,  5.40it/s]



Epoch 99 of 100


100%|██████████| 297/297 [00:53<00:00,  5.55it/s]



Epoch 100 of 100


100%|██████████| 297/297 [00:54<00:00,  5.46it/s]





