<a href="https://colab.research.google.com/github/Rheddes/recsys-twitter/blob/master/recsys_twitter_gpu_in_batches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Necessary imports & definitions

Copy files from drive to local disk, not necessary it is also possible to work directly from drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Get validation set

Get the validation/prediction set from the challenge.

In [None]:
!wget -O val.tsv "https://elasticbeanstalk-us-west-2-800068098556.s3.amazonaws.com/challenge-website/public_data/val.tsv?AWSAccessKeyId=AKIA3UR6GLH6F73MJVWF&Signature=uURRfbcpN3%2BW7tWrUaL6Av8ZX5c%3D&Expires=1588061161"
!cp val.tsv ./drive/My\ Drive/RecSys/val.tsv

In [None]:
# !cp ./drive/My\ Drive/RecSys/train_updated.tsv train_updated.tsv
# !cp ./drive/My\ Drive/RecSys/sample.tsv sample.tsv 
# !cp bert_classification_features.csv  ./drive/My\ Drive/RecSys/bert_22500.csv

## Set correct batch index

In order to circumvent runtime timeouts, we process the dataset in batches.
The validation set has been split up in to 4 different files:

```
./drive/My Drive/RecSys/val{1..4}.tsv
```

So we need to run the notebook essentially 6 times, each time changing the `TRANSFORM_ITERATION` constant.

In [1]:
TRANSFORM_ITERATION=1

In [None]:
pred_file = './drive/My Drive/RecSys/val.{}.tsv'.format(TRANSFORM_ITERATION)

## Install transformers (for BERT models)


In [None]:
!pip install transformers tqdm

## Install helpers from GitHub

To simplify this notebook several helper functions have been abstracted to separate python files in the git repo.

In [None]:
!rm -rf recsys-twitter helpers
!git clone https://github.com/Rheddes/recsys-twitter.git
!cp -r recsys-twitter/helpers helpers

## Nvidia stats & info

In [None]:
# !nvcc --version
!nvidia-smi

## Imports

In [None]:
import pandas as pd
import numpy as np
import gc
import torch
from helpers.dataset import PredictionDataset
from helpers.bert_functions import make_bert_model, get_bert_classification_vectors, create_attention_mask_from
from torch.utils.data import DataLoader
from tqdm import tqdm
import pickle

## Tell pytorch to use cuda if available

In [None]:
use_cuda = True

print("Cuda is available: ", torch.cuda.is_available())
device = torch.device("cuda:0" if use_cuda and torch.cuda.is_available() else "cpu")

print("using device: ", device)

## Load pretrained models

In [None]:
model = make_bert_model()
print('done')

# Read the desired dataset

This piece of code can be used to create dataset and loader objects which allow stream reading the dataset, as to not occupy to much memory.

## Create dataset

Custom dataset type to iterate throught the training file, also performs some preprocessing (see `helpers/dataset.py` for details).

In [None]:
iterable_dataset = PredictionDataset(pred_file, 512)

## Read dataset with pandas

As to read out the tweet_id & engaging_user_id as to form a primary key for every record.

In [None]:
all_features = ["text_tokens", "hashtags", "tweet_id", "present_media", "present_links", "present_domains",
                "tweet_type", "language", "tweet_timestamp", "engaged_with_user_id", "engaged_with_user_follower_count",
                "engaged_with_user_following_count", "engaged_with_user_is_verified",
                "engaged_with_user_account_creation", "enaging_user_id", "enaging_user_follower_count",
                "enaging_user_following_count",
                "enaging_user_is_verified", "enaging_user_account_creation", "engagee_follows_engager"]
selected_features = ['tweet_id', 'enaging_user_id']
unused_features = list(set(all_features) - set(selected_features))

validation = pd.read_csv(pred_file, header=None, sep="\x01")
validation.columns = all_features

for unused_feature in unused_features:
  del validation[unused_feature]
gc.collect()

np_tweet_ids = validation['tweet_id'].to_numpy()
np_engaging_ids = validation['enaging_user_id'].to_numpy()

## Create loader

The loader reads batches from the dataset and outputs it as an iterable.

In [None]:
loader = DataLoader(iterable_dataset, batch_size=150)

# Model 1: (distil)BERT

This model transform the list of ordered BERT id's in to a feature vector on which we can use regular classfiers (i.e. logistics classifiers, or kNN).

### Clean GPU memory

After running the model some things are left in the memory of the GPU this attempts to clean up as much as possible. Certainly not perfect.

In [None]:
# Clean GPU cache
if use_cuda:
  gc.collect()
  torch.cuda.empty_cache()

## Run model on DataLoader (automatically batched)

In order to work on larger datasets we can work in batches.

Indices:
```
TOKENS_INDEX = 0
REPLIED_INDEX = 1
RETWEETED_INDEX = 2
RETWEETED_WITH_COMMENT_INDEX = 3
LIKE_INDEX = 4
```

In [None]:
features = None
labels = None
with torch.no_grad():
  if use_cuda:
    model.cuda()
  for batch in tqdm(loader):
    batch_ids = batch[0]    # Input text_tokens
    mask = create_attention_mask_from(batch_ids)

    if use_cuda:
      batch_ids, mask = batch_ids.to(device), mask.to(device)

    last_hidden_states = model(batch_ids, attention_mask=mask)
    last_features = get_bert_classification_vectors(last_hidden_states, use_cuda)

    features = np.concatenate((features, last_features)) if features is not None else last_features
    # print("one iteration done")

export_features = np.c_[np_tweet_ids, np_engaging_ids, features]

pickle.dump(export_features, open('bert_classification_val.p', 'rb'))
!cp bert_classification_val.p  ./drive/My\ Drive/RecSys/bert_classification_val.p

### Done transforming data

Done for now, the generated features can be easily loaded in to other models.