

# **Step-by-step guide: Smiley prediction on Twitter data**

In this notebook we will finetune CT-BERT for sentiment (emoji) classification.

Learn more about the project [here](https://github.com/CS-433/cs-433-project-2-mlakes/)

### **Before proceeding**
Create a copy of this notebook by going to:
 `File ü°í Save a Copy in Drive`

### **Contents**
See "Sommaire" (table of contents) in the sidebar to the left.

# Colab set-up

## 0.1 Training with a GPU

Make sure to change the runtime type to GPU under:

`Ex√©cution ü°í Modifier le type d'ex√©cution 
To GPU and if possible M√©moire RAM √©lev√©e (Colab Pro)`

Verify that we are using GPUs, otherwise training will take a very long time. 

In [None]:
import torch
if torch.cuda.is_available():    
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available :( using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


## 0.2 Mounting Google Drive. 





In [None]:
# Mounting the Drive allows us to save and access our files
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [None]:
# Go to main directory
import os
os.chdir("/content/drive/My Drive")

## 0.3 Cloning the repository


In [None]:
# Clone our repo 
!git clone https://paola-md:<>@github.com/CS-433/cs-433-project-2-mlakes MLProject2

fatal: destination path 'MLProject2' already exists and is not an empty directory.


In [None]:
# Move to repository and visualize files (similar to cd MLProject2 and ls)
os.chdir("/content/drive/My Drive/MLProject2")
os.listdir()

['.git',
 '.gitignore',
 'Dockerfile-notebook',
 'README.md',
 'data',
 'docs',
 'models',
 'notebooks',
 'predictions',
 'requirements.txt',
 'src',
 'test']

# Download Dataset
Download the data from [here](https://www.aicrowd.com/challenges/epfl-ml-text-classification/dataset_files) and place it in folder data. You can also download our [copy](https://drive.google.com/file/d/1ve0X5Mj6RAhtb4XFZUa5W_A9HIt4-bgE/view?usp=sharing) from Drive 

In [None]:
!unzip data/twitter-datasets.zip -d data/

Archive:  data/twitter-datasets.zip
  inflating: data/twitter-datasets/sample_submission.csv  
  inflating: data/twitter-datasets/test_data.txt  
  inflating: data/twitter-datasets/train_neg_full.txt  
  inflating: data/twitter-datasets/train_neg.txt  
  inflating: data/twitter-datasets/train_pos_full.txt  
  inflating: data/twitter-datasets/train_pos.txt  


In [None]:
# Move to main data folder
!mv data/twitter-datasets/train_neg.txt data/train_neg.txt 
!mv data/twitter-datasets/train_pos.txt data/train_pos.txt 
!mv data/twitter-datasets/train_neg_full.txt data/train_neg_full.txt 
!mv data/twitter-datasets/train_pos_full.txt data/train_pos_full.txt 
!mv data/twitter-datasets/test_data.txt data/test_data.txt

# Install and import libraries
Install the required dependencies

In [None]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git
!pip install emoji
!pip install unidecode
!pip install flair

In [None]:
import sys
sys.path.append("/content/drive/My Drive/MLProject2")

In [None]:
import pandas as pd
import numpy as np

from src.preprocessing import apply_preprocessing, apply_preprocessing_bert
from src.data_loading import load_tweets, load_test_tweets, split_data, seed_everything, split_data_bert
from src.models.bi_lstm import run_bidirectional_lstm
from src.models.machine_learning_models import run_tfidf_ml_model
from src.models.few_shot import run_zero_shot
from src.models.bert import run_bert, predict_bert

# Training

In [None]:
# Global variables
model_name = 'digitalepidemiologylab/covid-twitter-bert'

# For reproductibility 
seed_everything()

## 1.1 Load the training set

In [None]:
os.chdir("/content/drive/My Drive/MLProject2/src")

In order to demostrate the training pipeline we are going to use a small fraction of the dataset.
Please note that the model training was used with ALL the data

In [None]:
#tweets = load_tweets(sample=False, frac=1)
tweets = load_tweets(sample=True, frac=0.01)

Positive tweets: 97902
Negative tweets: 99068
Most frequent label model: 0.503


## 1.2 Preprocess the training set

Preprocessing does two things:
1. cleans the tweets 
2. tokenizes the cleaned tweets by:
  * adding [CLS] and [SEP] tokens
  * adds padding (maximum length 100)
  * creates attention masks




In [None]:
# Example of clean function
raw_tweet = tweets.tweet[9]
cleaned_tweet = clean_text(raw_tweet)
print(raw_tweet)
print(cleaned_tweet)

raw_tweet = tweets.tweet[17]
cleaned_tweet = clean_text(raw_tweet)
print(raw_tweet)
print(cleaned_tweet)

<user> translation pleaseee for international fans
translation please for international fans
<user> i #believe one day u will notice me ..  ...
i believe one day you will notice me .. ...


In [None]:
tweets = apply_preprocessing_bert(tweets)

After the preprocessing, we have a tensor of three parts.
1. Tokens
2. Attention mask
3. Label

See the example below

In [None]:
tweets[17]

(tensor([ 101, 1045, 2903, 2028, 2154, 2017, 2097, 5060, 2033, 1012, 1012, 1012,
         1012, 1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 tensor(1))

## 1.3 Transformer training



For this example we used 90% for training and 10% for testing. Note that for the final model 100% of the data was used to train.

In [None]:
train_tweets, val_tweets = split_data_bert(tweets, ratio=0.9)


In [None]:
run_bert(train_tweets=train_tweets,
          val_tweets=val_tweets,
          save_model = True,
          learning_rate = 5e-6,
          model_name = 'bert',
          epochs = 3)


----------------------------------------------------------------------------------------------------
MODEL TO RUN: Neural network with bert tokens
----------------------------------------------------------------------------------------------------

Creating batches...


Some weights of the model checkpoint at digitalepidemiologylab/covid-twitter-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassifi

Saving model...


Iteration:   0%|          | 0/552 [00:00<?, ?it/s]

./../models/bert_0
Done!


Iteration: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 552/552 [02:46<00:00,  3.31it/s]
Iteration:   0%|          | 0/4962 [00:00<?, ?it/s]

EPOCH: 0. Losses: train = 0.33622857421059993, val = 0.31632226176451944.             Accuracy: 0.8713654891304348


Iteration: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4962/4962 [1:18:05<00:00,  1.06it/s]


Saving model...


Iteration:   0%|          | 0/552 [00:00<?, ?it/s]

./../models/bert_1
Done!


Iteration: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 552/552 [02:46<00:00,  3.32it/s]
Iteration:   0%|          | 0/4962 [00:00<?, ?it/s]

EPOCH: 1. Losses: train = 0.21659452503930712, val = 0.3242368784739865.             Accuracy: 0.8776947463768117


Iteration: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4962/4962 [1:18:04<00:00,  1.06it/s]


Saving model...


Iteration:   0%|          | 0/552 [00:00<?, ?it/s]

./../models/bert_2
Done!


Iteration: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 552/552 [02:46<00:00,  3.32it/s]

EPOCH: 2. Losses: train = 0.1304507609538423, val = 0.42078522598350665.             Accuracy: 0.8735054347826087





# Testing
Now we are going to use the trained model to predict.

You can also use our pre-trained model with all the data. Download the model [here](https://drive.google.com/drive/folders/1aLWxJdPFwOyqvNkkc_QyzhBC9ofY9tgS?usp=sharing) and place it in the models folder. 

## 2.1 Load the testing set

In [None]:
test_tweets = load_test_tweets()
test_tweets['polarity'] =  pd.to_numeric(test_tweets.id)  # to use same preprocessing function

## 2.2 Preprocess the testing set

In [None]:
dataset = apply_preprocessing_bert(test_tweets)

## 2.3 Make the predictions with our model

In [None]:
test_ids_list, binary_preds_list = predict_bert(dataset, 'bert_2')

Iteration: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 288/288 [01:26<00:00,  3.32it/s]


## 2.4 Format and save the predictions

In [None]:
test_ids = np.concatenate(test_ids_list).ravel()
binary_preds = np.concatenate(binary_preds_list).ravel()
binary_preds = np.where(binary_preds==0, -1, binary_preds) 
results = pd.DataFrame({'Id': test_ids, 'Prediction': binary_preds})
results.to_csv("./../predictions/predictions.csv", index=False)

In [None]:
results.head()

Unnamed: 0,Id,Prediction
0,1,-1
1,2,-1
2,3,1
3,4,1
4,5,-1


# Run models with embeddings

The BiLSTM can be trained with glove and word2vec embeddings.


## 3.1 Word2Vec
Constructs a a vocabulary list of words appearing at least 5 times.


In [None]:
os.chdir("/content/drive/My Drive/MLProject2")

In [None]:
!sh src/preprocessing_glove/build_vocab.sh
!sh src/preprocessing_glove/cut_vocab.sh
!python src/preprocessing_glove/pickle_vocab.py


In [None]:
os.chdir("/content/drive/My Drive/MLProject2/src")

In [None]:
tweets = load_tweets(sample=True, frac=0.001)
tweets = apply_preprocessing(tweets)
run_bidirectional_lstm(tweets=tweets[['tweet']],
                        labels=tweets[['polarity']],
                        save_model=False,
                        embeddings='word2vec')

## 3.2 GloVe

In [None]:
os.chdir("/content/drive/My Drive/MLProject2")

In [None]:
!wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
!mv glove.twitter.27B.zip data/embeddings/glove.twitter.27B.zip
!unzip data/embeddings/glove.twitter.27B.zip -d data/embeddings

In [None]:
os.chdir("/content/drive/My Drive/MLProject2/src")

In [None]:
run_bidirectional_lstm(tweets=tweets[['tweet']],
                        labels=tweets[['polarity']],
                        save_model=False,
                        embeddings='glove')