<a href="https://colab.research.google.com/github/Rheddes/recsys-twitter/blob/master/recsys_twitter_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Necessary imports & definitions

Copy files from drive to local disk, not necessary it is also possible to work directly from drive.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# !cp ./drive/My\ Drive/RecSys/train_updated.tsv train_updated.tsv
# !cp ./drive/My\ Drive/RecSys/sample.tsv sample.tsv 

Set train file variable to correct path

In [0]:
train_file = './drive/My Drive/RecSys/sample.tsv'

## Install transformers (for BERT models)


In [0]:
!pip install transformers

## Install helpers from GitHub

To simplify this notebook several helper functions have been abstracted to separate python files in the git repo.

In [0]:
#!rm -rf recsys-twitter helpers
!git clone https://github.com/Rheddes/recsys-twitter.git
!cp -r recsys-twitter/helpers helpers

## Nvidia stats & info

In [0]:
# !nvcc --version
!nvidia-smi

## Imports

In [0]:
import pandas as pd
import numpy as np
import csv
import gc
import torch
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from helpers.dataset import MyIterableDataset
from helpers.bert_functions import make_bert_model, get_bert_classification_vectors, create_attention_mask_from
from torch.utils.data import DataLoader
from itertools import islice

## Tell pytorch to use cuda if available

In [6]:
use_cuda = True

print("Cuda is available: ", torch.cuda.is_available())
device = torch.device("cuda:0" if use_cuda and torch.cuda.is_available() else "cpu")

print("using device: ", device)

Cuda is available:  True
using device:  cuda:0


## Load pretrained models

In [7]:
model = make_bert_model()
print('done')

HBox(children=(IntProgress(value=0, description='Downloading', max=466, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=541808922, style=ProgressStyle(description_…


done


# Read the desired dataset

This piece of code can be used to create dataset and loader objects which allow stream reading the dataset, as to not occupy to much memory.

## Create dataset

Custom dataset type to iterate throught the training file, also performs some preprocessing (see `helpers/dataset.py` for details).

In [0]:
iterable_dataset = MyIterableDataset(train_file)

## Create loader

The loader reads batches from the dataset and outputs it as an iterable.

In [0]:
loader = DataLoader(iterable_dataset, batch_size=100)

# Model 1: (distil)BERT

This model transform the list of ordered BERT id's in to a feature vector on which we can use regular classfiers (i.e. logistics classifiers, or kNN).

### Clean GPU memory

After running the model some things are left in the memory of the GPU this attempts to clean up as much as possible. Certainly not perfect.

In [0]:
# Clean GPU cache
if use_cuda:
  torch.cuda.empty_cache()

## Run model on DataLoader (automatically batched)

In order to work on larger datasets we can work in batches.

In [15]:
features = None
labels = None
index = 0
number_of_iterations = 10
with torch.no_grad():
  if use_cuda:
    model.cuda()
  for batch in islice(loader, number_of_iterations):
    batch_ids = batch[0]    # Input text_tokens
    batch_labels = batch[4] # Likes
    mask = create_attention_mask_from(batch_ids)

    if use_cuda:
      batch_ids, mask = batch_ids.to(device), mask.to(device)

    last_hidden_states = model(batch_ids, attention_mask=mask)
    last_features = get_bert_classification_vectors(last_hidden_states, use_cuda)

    features = np.concatenate((features, last_features)) if features is not None else last_features
    labels = np.concatenate((labels, batch_labels)) if labels is not None else batch_labels

    print(index)
    index += 1

0
1
2
3
4
5
6
7
8
9


### Debugging

Code used to inspect output from BERT model.

In [0]:
print(len(features))
print(len(labels))

print(features)
print(labels)

# Model 2: Logistics classifier
We got our output from the BERT model we can now train our logistics classifier to actually classify tweet engagements.

First we split our training set up into train & test set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Next train our Logistics Classifier

In [18]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Evaluating classifier

Now that we have our trained classifier let's see how it performs

In [21]:
lr_clf.score(test_features, test_labels)

0.648

Let's compare that to a dummy classifier

In [22]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.511 (+/- 0.14)




So we currently perform ~10% better than a dummy classifier, awesome.