<a href="https://colab.research.google.com/github/Rheddes/recsys-twitter/blob/master/recsys_twitter_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Necessary imports & definitions

Copy files from drive to local disk, not necessary it is also possible to work directly from drive.

In [0]:
# !cp ./drive/My\ Drive/RecSys/train_updated.tsv train_updated.tsv
# !cp ./drive/My\ Drive/RecSys/sample.tsv sample.tsv 

Set train file variable to correct path

In [0]:
train_file = './drive/My Drive/RecSys/sample.tsv'

## Install transformers (for BERT models)


In [0]:
!pip install transformers

## Nvidia stats & info

In [2]:
# !nvcc --version
!nvidia-smi

Sun Apr 26 14:59:53 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Imports

In [0]:
import pandas as pd
import numpy as np
import csv
import torch
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from helpers.dataset import MyIterableDataset
from helpers.bert_functions import make_bert_model, get_bert_classification_vectors, create_attention_mask_from
from torch.utils.data import DataLoader
from itertools import islice

## Tell pytorch to use cuda if available

In [11]:
use_cuda = True

print("Cuda is available: ", torch.cuda.is_available())
device = torch.device("cuda:0" if use_cuda and torch.cuda.is_available() else "cpu")

print("using device: ", device)

Cuda is available:  True
using device:  cuda:0


## Load pretrained models

In [13]:
model = make_bert_model()
print('done')

done


# Read the desired dataset

This piece of code can be used to read the desired dataset into memory as a Pandas dataframe.

In [5]:
iterable_dataset = MyIterableDataset('../data/sample.tsv')
loader = DataLoader(iterable_dataset, batch_size=12)

done


# Model 1: (distil)BERT

This model transform the list of ordered BERT id's in to a feature vector on which we can use regular classfiers (i.e. logistics classifiers, or kNN).

### Clean GPU memory

After running the model some things are left in the memory of the GPU this attempts to clean up as much as possible. Certainly not perfect.

In [0]:
# Clean GPU cache
if use_cuda:
  torch.cuda.empty_cache()

## Run model on DataLoader (automatically batched)

In order to work on larger datasets we can work in batches.

In [None]:
features = None
labels = None
index = 0
with torch.no_grad():
  if use_cuda:
    model.cuda()
  for index, batch in islice(loader, 2):
    batch_ids = batch[0]
    batch_labels = batch[4]
    mask = create_attention_mask_from(batch_ids)

    if use_cuda:
      batch_ids, mask = batch_ids.to(device), mask.to(device)

    last_hidden_states = model(batch_ids, attention_mask=mask)
    last_features = get_bert_classification_vectors(last_hidden_states)
    features = np.concatenate((features, last_features)) if features is not None else last_features
    labels = np.concatenate((labels, batch_labels)) if labels is not None else batch_labels
    print(index)
    index += 1

In [None]:
print(len(features))

### Get output from model

See if lengths of feature set and labels set match.

In [None]:
print(len(features))
print(len(labels))

# Model 2: Logistics classifier
We got our output from the BERT model we can now train our logistics classifier to actually classify tweet engagements.

First we split our training set up into train & test set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Next train our Logistics Classifier

In [20]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Evaluating classifier

Now that we have our trained classifier let's see how it performs

In [21]:
lr_clf.score(test_features, test_labels)

0.3333333333333333

Let's compare that to a dummy classifier

In [22]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.600 (+/- 0.75)


So we currently perform ~10% better than a dummy classifier, awesome.