<a href="https://colab.research.google.com/github/Rheddes/recsys-twitter/blob/master/recsys_twitter_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Necessary imports & definitions

Copy files from drive to local disk, not necessary it is also possible to work directly from drive.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

## Get validation set

Get the validation/prediction set from the challenge.

In [0]:
!wget -O val.tsv "https://elasticbeanstalk-us-west-2-800068098556.s3.amazonaws.com/challenge-website/public_data/val.tsv?AWSAccessKeyId=AKIA3UR6GLH6F73MJVWF&Signature=uURRfbcpN3%2BW7tWrUaL6Av8ZX5c%3D&Expires=1588061161"
!cp val.tsv ./drive/My\ Drive/RecSys/val.tsv

In [0]:
# !cp ./drive/My\ Drive/RecSys/train_updated.tsv train_updated.tsv
# !cp ./drive/My\ Drive/RecSys/sample.tsv sample.tsv 
# !cp bert_classification_features.csv  ./drive/My\ Drive/RecSys/bert_22500.csv

Set train file variable to correct path

In [0]:
# train_file = './drive/My Drive/RecSys/sample.tsv'
# pred_file = ./val.tsv

# OR IF USING THE ONE FROM DRIVE
pred_file = './drive/My Drive/RecSys/val.tsv'

## Install transformers (for BERT models)


In [0]:
!pip install transformers tqdm

## Install helpers from GitHub

To simplify this notebook several helper functions have been abstracted to separate python files in the git repo.

In [4]:
!rm -rf recsys-twitter helpers
!git clone https://github.com/Rheddes/recsys-twitter.git
!cp -r recsys-twitter/helpers helpers

Cloning into 'recsys-twitter'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (62/62), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 62 (delta 31), reused 33 (delta 11), pack-reused 0[K
Unpacking objects: 100% (62/62), done.


## Nvidia stats & info

In [0]:
# !nvcc --version
!nvidia-smi

## Imports

In [0]:
import pandas as pd
import numpy as np
import csv
import math
import gc
import torch
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from helpers.dataset import MyIterableDataset, PredictionDataset
from helpers.bert_functions import make_bert_model, get_bert_classification_vectors, create_attention_mask_from
from torch.utils.data import DataLoader
from itertools import islice
from tqdm import tqdm

## Tell pytorch to use cuda if available

In [12]:
use_cuda = True

print("Cuda is available: ", torch.cuda.is_available())
device = torch.device("cuda:0" if use_cuda and torch.cuda.is_available() else "cpu")

print("using device: ", device)

Cuda is available:  True
using device:  cuda:0


## Load pretrained models

In [7]:
model = make_bert_model()
print('done')

HBox(children=(IntProgress(value=0, description='Downloading', max=466, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=541808922, style=ProgressStyle(description_…


done


# Read the desired dataset

This piece of code can be used to create dataset and loader objects which allow stream reading the dataset, as to not occupy to much memory.

## Create dataset

Custom dataset type to iterate throught the training file, also performs some preprocessing (see `helpers/dataset.py` for details).

In [0]:
iterable_dataset = PredictionDataset(pred_file, 512)

## Create loader

The loader reads batches from the dataset and outputs it as an iterable.

In [0]:
loader = DataLoader(iterable_dataset, batch_size=150)

# Model 1: (distil)BERT

This model transform the list of ordered BERT id's in to a feature vector on which we can use regular classfiers (i.e. logistics classifiers, or kNN).

### Clean GPU memory

After running the model some things are left in the memory of the GPU this attempts to clean up as much as possible. Certainly not perfect.

In [0]:
# Clean GPU cache
if use_cuda:
  gc.collect()
  torch.cuda.empty_cache()

## Run model on DataLoader (automatically batched)

In order to work on larger datasets we can work in batches.

Indices:
```
TOKENS_INDEX = 0
REPLIED_INDEX = 1
RETWEETED_INDEX = 2
RETWEETED_WITH_COMMENT_INDEX = 3
LIKE_INDEX = 4
```

In [0]:
features = None
labels = None
with torch.no_grad():
  if use_cuda:
    model.cuda()
  for batch in tqdm(loader, 2):
    batch_ids = batch[0]    # Input text_tokens
    mask = create_attention_mask_from(batch_ids)

    if use_cuda:
      batch_ids, mask = batch_ids.to(device), mask.to(device)

    last_hidden_states = model(batch_ids, attention_mask=mask)
    last_features = get_bert_classification_vectors(last_hidden_states, use_cuda)

    features = np.concatenate((features, last_features)) if features is not None else last_features
    # print("one iteration done")

pd.DataFrame(features).to_csv("./bert_classification_val.csv")
!cp bert_classification_val.csv  ./drive/My\ Drive/RecSys/bert_classification_val.csv

### Done transforming data

Done for now, the generated features can be easily loaded in to other models.

### Debugging

Code used to inspect output from BERT model.

In [0]:
features = pd.read_csv('./bert_classification_features.csv').values
labels = pd.read_csv(train_file, header=None, sep="\x01")[23].apply(lambda x: 1 if not math.isnan(x) else 0)[:len(features)]
print(len(labels))

225000


In [0]:
print(len(features))
print(len(labels))

# print(features)
# print(labels)

225000
225000


# Model 2: Logistics classifier
We got our output from the BERT model we can now train our logistics classifier to actually classify tweet engagements.

First we split our training set up into train & test set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Next train our Logistics Classifier

In [0]:
lr_clf = LogisticRegression(C=100.0)
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Evaluating classifier

Now that we have our trained classifier let's see how it performs

In [0]:
lr_clf.score(test_features, test_labels)

0.5986133333333333

# Model 2b - SVM

Let's classify it with a Support Vector Machine

In [0]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LinearSVC(), parameters)
grid_search.fit(train_features[:200], train_labels[:200])

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)



best parameters:  {'C': 100.0}
best scrores:  0.5349999999999999




In [0]:
svm_clf = LinearSVC(C=1000.0)
svm_clf.fit(train_features, train_labels)




LinearSVC(C=1000.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [0]:
from sklearn import linear_model
reg = linear_model.RidgeCV()
reg.fit(train_features, train_labels)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=None, fit_intercept=True,
        gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

In [0]:
reg.score(test_features, test_labels)

0.09021386654969878

## Evaluating SVM classifier

Let's see how it performs.

In [0]:
svm_clf.score(test_features, test_labels)

0.43196444444444443

# Dummy compare

Let's compare that to a dummy classifier

In [0]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))



Dummy classifier score: 0.522 (+/- 0.01)


So we currently perform ~10% better than a dummy classifier, awesome.