<a href="https://colab.research.google.com/github/Rheddes/recsys-twitter/blob/master/recsys_twitter_cpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Necessary imports & definitions

Copy files from drive to local disk, not necessary it is also possible to work directly from drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# !cp ./drive/My\ Drive/RecSys/train_updated.tsv train_updated.tsv
# !cp ./drive/My\ Drive/RecSys/sample.tsv sample.tsv 

Set train file variable to correct path

In [0]:
train_file = './drive/My Drive/RecSys/sample.tsv'

## Install transformers (for BERT models)


In [0]:
!pip install transformers

## Imports

In [0]:
import pandas as pd
import numpy as np
import csv
import math
import torch
import gc
import transformers as ppb
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

## Load pretrained models

In [3]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-multilingual-cased')

## Want BERT instead of distilBERT? Uncomment the following line:
# model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-multilingual-cased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights, do_lower_case=False)
model = model_class.from_pretrained(
    pretrained_weights,
    output_attentions = False,
    output_hidden_states = True,
)
model.eval()

print('done')

done


# Read the desired dataset

This piece of code can be used to read the desired dataset into memory as a Pandas dataframe.

In [4]:
all_features = ["text_tokens", "hashtags", "tweet_id", "present_media", "present_links", "present_domains",
                "tweet_type", "language", "tweet_timestamp", "engaged_with_user_id", "engaged_with_user_follower_count",
                "engaged_with_user_following_count", "engaged_with_user_is_verified",
                "engaged_with_user_account_creation", "enaging_user_id", "enaging_user_follower_count",
                "enaging_user_following_count",
                "enaging_user_is_verified", "enaging_user_account_creation", "engagee_follows_engager",
                "reply_timestamp", "retweet_timestamp", "retweet_with_comment_timestamp", "like_timestamp"]
                
dataset = pd.read_csv(train_file, delimiter="\x01", encoding='utf-8', header=None)
dataset.columns = all_features
print("done")

done


# Model 1: (distil)BERT

This model transform the list of ordered BERT id's in to a feature vector on which we can use regular classfiers (i.e. logistics classifiers, or kNN).

## Prepare model for distilBERT

Make small batch set that easily fits in memory, with only necessary data included. It becomes a DataFrame of two columns: `text_token` which contains Numpy arrays of BERT ID's, and `like_timestamp` which contains `1` if liked and `0` if not liked.

### Create batch

In order to not run out of memory we have to work in small batches of the dataset at a time.

In [5]:
batch_1 = dataset[:200]
batch_1 = batch_1[['text_tokens','like_timestamp']]

batch_1.text_tokens = batch_1.text_tokens.apply(lambda x: np.fromstring(x, dtype=int, sep="\t"))
batch_1.like_timestamp = batch_1.like_timestamp.apply(lambda x: 0 if math.isnan(x) else 1)

print(batch_1)

                                           text_tokens  like_timestamp
0    [101, 56898, 137, 33909, 10107, 11490, 10288, ...               0
1    [101, 137, 74039, 12436, 12396, 12436, 27746, ...               1
2    [101, 13229, 21885, 10681, 10380, 31747, 71309...               1
3    [101, 14820, 100, 188, 83279, 10142, 10751, 10...               0
4    [101, 56898, 137, 12001, 10731, 20498, 20467, ...               0
..                                                 ...             ...
195  [101, 56898, 137, 156, 11703, 10162, 11447, 34...               0
196  [101, 56898, 137, 58768, 28558, 68748, 131, 12...               0
197  [101, 11518, 45632, 10192, 10312, 43330, 10107...               1
198  [101, 59533, 12028, 89512, 184, 11703, 10164, ...               0
199  [101, 1894, 5900, 2226, 2179, 100775, 3365, 20...               1

[200 rows x 2 columns]


### Optional

Get some stats about the batch

In [6]:
batch_1.like_timestamp.value_counts()

0    121
1     79
Name: like_timestamp, dtype: int64

### Pad text tokens vector

In order for BERT model to work with the dataset we have to pad the input matrix rows to same size. So all `text_tokens` arrays have to be same length. Therefore we first calculate what the maximum vector length is in the `text_tokens` column

In [7]:
max_len = 0
for i in batch_1.text_tokens.values:
  if len(i) > max_len:
      max_len = len(i)
      
print(max_len)

125


With the `max_len` we can padd all arrays in `text_tokens` to same length and export it to a 2d numpy array.

In [8]:
padded = np.array([np.concatenate([i, np.zeros(max_len-len(i), dtype=int)]) for i in batch_1.text_tokens.values])
print(padded)


[[  101 56898   137 ...     0     0     0]
 [  101   137 74039 ...     0     0     0]
 [  101 13229 21885 ...     0     0     0]
 ...
 [  101 11518 45632 ...     0     0     0]
 [  101 59533 12028 ...     0     0     0]
 [  101  1894  5900 ...     0     0     0]]


In [9]:
np.array(padded).shape

(200, 125)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [10]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(200, 125)

## Run model to get hidden states

Running the output yields a 768 length vector for each row in the dataset.

### Load tensors and Run

This creates the tensors from input data (and sends them to GPU if available) and runs the model.

In [0]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
  last_hidden_states = model(input_ids, attention_mask=attention_mask)

### Get output from model

Next we have to get the classification features from the output of BERT to a proper matrix. For more details on this see: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

In [12]:
features = last_hidden_states[0][:,0,:].numpy()
print(features)

[[ 0.17406203 -0.0433972  -0.18433005 ...  0.5506898   0.18670747
  -0.198578  ]
 [-0.11964368 -0.26719     0.14500006 ...  0.30625767  0.03193825
  -0.04966999]
 [ 0.10105906 -0.01484996  0.16492397 ...  0.29651228  0.28076926
   0.03425452]
 ...
 [-0.13193937  0.15147121  0.02599217 ...  0.28464457  0.00857005
  -0.0662995 ]
 [ 0.06861539 -0.06619139 -0.10860651 ...  0.31563112  0.14870207
  -0.04414   ]
 [-0.2608274  -0.06339766  0.00664555 ...  0.30480263  0.01772955
  -0.18186548]]


Now that we have the feature set, it is good to construct our label set as well.

In [0]:
labels = batch_1.like_timestamp

In [15]:
print(len(features))
print(len(labels))

200
200


In [20]:
print(labels)

0      0
1      1
2      1
3      0
4      0
      ..
195    0
196    0
197    1
198    0
199    1
Name: like_timestamp, Length: 200, dtype: int64


# Model 2: Logistics classifier

We got our output from the BERT model we can now train our logistics classifier to actually classify tweet engagements.

First we split our training set up into train & test set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Next train our Logistics Classifier

In [22]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Evaluating classifier

Now that we have our trained classifier let's see how it performs

In [23]:
lr_clf.score(test_features, test_labels)

0.6

Let's compare that to a dummy classifier

In [24]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.493 (+/- 0.10)


In [0]:
So we currently perform ~10% better than a dummy classifier, awesome.

12
12


# Other stuff (utilities) - not needed to execute

Was needed to prepare data for reading etcetera.
Some random stuff that was useful before

First we split our training set up into train & test set.

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Next train our Logistics Classifier

In [0]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Evaluating classifier

Now that we have our trained classifier let's see how it performs

In [0]:
lr_clf.score(test_features, test_labels)

0.3333333333333333

Let's compare that to a dummy classifier

In [0]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.600 (+/- 0.75)


So we currently perform ~10% better than a dummy classifier, awesome.

# Other stuff (utilities)

Some random stuff that was useful before

## Downloading and uploading to GCP

Following code blocks define code to download and upload training data to the GCP storage bucket.

In [0]:
# Setup GCP credentials
project_id = 'positive-nuance-274811'
bucket_name = 'recsys-twitter-challenge-us-jku-quarantwits'

from google.colab import auth
auth.authenticate_user()

!gcloud config set project {project_id}

In [0]:
# Download training data
!gsutil cp gs://{bucket_name}/train_updated.tsv train_updated.tsv

In [0]:
# Upload partitioned training data to GCP

# !gsutil cp train.001.tsv gs://{bucket_name}/train.001.tsv
# !gsutil cp train.002.tsv gs://{bucket_name}/train.002.tsv
# !gsutil cp train.003.tsv gs://{bucket_name}/train.003.tsv
# !gsutil cp train.004.tsv gs://{bucket_name}/train.004.tsv
# !gsutil cp train.005.tsv gs://{bucket_name}/train.005.tsv
# !gsutil cp train.006.tsv gs://{bucket_name}/train.006.tsv

# !gsutil cp train_updated.tsv gs://{bucket_name}/train_updated.tsv

## Download deleted tweet & user ID's

Tweets & users can be removed over time and we have to remove the corresponding tweet engagements to adhere to GDPR. This piece of code downloads all these ID's so that they can be removed from our dataset in the next step.

In [0]:
# !wget -O tweetIDs_1.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_engaged_with_tweet_id/2020/03/01'
# !wget -O tweetIDs_2.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_engaged_with_tweet_id/2020/03/08'
# !wget -O tweetIDs_3.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_engaged_with_tweet_id/2020/03/15'
# !wget -O tweetIDs_4.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_engaged_with_tweet_id/2020/03/22'
# !wget -O tweetIDs_5.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_engaged_with_tweet_id/2020/03/29'
# !wget -O tweetIDs_6.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_engaged_with_tweet_id/2020/04/12'

# !wget -O userIDs_1.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_user_id/2020/03/01'
# !wget -O userIDs_2.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_user_id/2020/03/08'
# !wget -O userIDs_3.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_user_id/2020/03/15'
# !wget -O userIDs_4.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_user_id/2020/03/22'
# !wget -O userIDs_5.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_user_id/2020/03/29'
# !wget -O userIDs_6.txt 'https://elasticbeanstalk-us-west-2-800068098556.s3-us-west-2.amazonaws.com/challenge-website/public_data/training/diffs/tsv_deleted_user_id/2020/04/12'

## Removing removed user & tweet ID's

After getting the removed ID's in the previous step this piece of code removes those ID's from the dataset and stores it as a new one. NOTE: this has to be run manually for each partition currently, by running it 6 times and changing the `00x` in the file names on lines `12` and `25` to the number of the partition.

In [0]:
all_features = ["text_tokens", "hashtags", "tweet_id", "present_media", "present_links", "present_domains",
                "tweet_type", "language", "tweet_timestamp", "engaged_with_user_id", "engaged_with_user_follower_count",
                "engaged_with_user_following_count", "engaged_with_user_is_verified",
                "engaged_with_user_account_creation", "enaging_user_id", "enaging_user_follower_count",
                "enaging_user_following_count",
                "enaging_user_is_verified", "enaging_user_account_creation", "engagee_follows_engager",
                "reply_timestamp", "retweet_timestamp", "retweet_with_comment_timestamp", "like_timestamp"]

new_dataset = pd.read_csv('./train.006.tsv', delimiter="\x01", encoding='utf-8', header=None)
new_dataset.columns = all_features

# print(new_dataset.tweet_id)
for i in range(1, 6):
  removed_tweet_ids = pd.read_csv("tweetIDs_{}.txt".format(i), header=None)
  new_dataset = new_dataset[~new_dataset.tweet_id.isin(removed_tweet_ids[0])]

for i in range(1, 6):
  removed_user_ids = pd.read_csv("userIDs_{}.txt".format(i), header=None)
  new_dataset = new_dataset[~new_dataset.engaged_with_user_id.isin(removed_user_ids[0])]
  new_dataset = new_dataset[~new_dataset.enaging_user_id.isin(removed_user_ids[0])]

new_dataset.to_csv('./train_updated.006.tsv', sep="\x01", header=False, index=False, quoting=csv.QUOTE_NONE)


## Unix tools used to split the dataset into partitions

In order to have more manageable dataset pieces, they are partitioned into sets of 10GB so that they can be read into memory.

In [0]:
# !chmod +x split.sh
# !apt update && apt install -y bc
# !./split.sh
# !wc -l train.001.tsv
cat train_updated.001.tsv train_updated.002.tsv train_updated.003.tsv train_updated.004.tsv train_updated.005.tsv train_updated.006.tsv > train_updated.tsv

## Dataset sampler

Can be used to create a random sample from a large training set. `sample_frequency` can be defined to what your heart desires.

In [0]:
sample_frequency = 1/6

sampled_set = dataset.sample(frac=sample_frequency, random_state=4173141592)
sampled_set.to_csv('./sample.tsv', sep="\x01", header=False, index=False, quoting=csv.QUOTE_NONE)


## Playground

Playground based on the code snippet by the RecSys challenge itself.
Can be used to efficiently read the first few lines of the dataset.

In [0]:
all_features = ["text_tokens", "hashtags", "tweet_id", "present_media", "present_links", "present_domains",
                "tweet_type", "language", "tweet_timestamp", "engaged_with_user_id", "engaged_with_user_follower_count",
                "engaged_with_user_following_count", "engaged_with_user_is_verified",
                "engaged_with_user_account_creation", "enaging_user_id", "enaging_user_follower_count",
                "enaging_user_following_count",
                "enaging_user_is_verified", "enaging_user_account_creation", "engagee_follows_engager"]

all_features_to_idx = dict(zip(all_features, range(len(all_features))))
labels_to_idx = {"reply_timestamp": 20, "retweet_timestamp": 21, "retweet_with_comment_timestamp": 22,
                 "like_timestamp": 23}


def print_features(features):
  print(len(features))
  print(features)
  print('-----------------')
  for feature, idx in all_features_to_idx.items():
        print("feature {} has value {}".format(feature, features[idx]))

  for label, idx in labels_to_idx.items():
        print("label {} has value {}".format(label, features[idx]))

with open(train_file, encoding="utf-8") as fileobject:
    i = 0
    for line in fileobject:
        if i == 10:
            break
        line = line.strip()
        features = line.split("\x01")
        i += 1
        print_features(features)

        

In [0]:
dataset = sampled_set

## Vectorizing text_tokens

In [0]:
dataset['text_tokens'] = dataset['text_tokens'].str.split("\t")
print(dataset)

In [0]:
dataset['text_tokens'] = dataset['text_tokens'].apply(lambda x: np.fromstring(x, dtype=int, sep="\t"))