# README 

Hello and good morning/afternoon/night! Let us explain a few things about this jupyter notebook : 
It was created in a first place to be run with google colab. There are a lot of large data implied here (large models & files). 
All paths refer to the following google drive folder, with public access : https://drive.google.com/drive/folders/11-iqSDHChz9ihD_9gY5L3SKspiwuwyil?usp=sharing


Simply click the small arrow on the right of the title, "ML_Project2" and add it as a short link to your personal drive, at the /MyDrive/ level. It contains everything you need to run this google colab, create models, train and test them. Then you can run bert.ipynb in Google Colab as usual. 

The following file is divided in multiple sections:
- Initialisation and library installation
- Data Import & preprocessing
- Bertweet & RoBERTa
- Creation of the submission dataset
- Result view 

Library import and installation is done in the first step, where paths to various files are defined in their corresponding sections, using CAPS. 

Have a good reading! 

# Initialisation and library installation 

In this part we simply import the libraries necessary for the further code. 

In [None]:
# Imports 
import sys
import os
import torch
import random
import numpy as np 
import pandas as pd
import csv
import pickle
import re
from sklearn.model_selection import train_test_split

# Mount google colab 
from google.colab import drive 
drive._mount("/content/drive")

In [None]:
# Installations
# The -q arguement (quiet) prevents the display of all install steps
!pip install simpletransformers -q 
!pip install emoji -q
from simpletransformers.classification import ClassificationModel
from simpletransformers.classification import ClassificationArgs
import emoji 

In [None]:
# Check if cuda (GPU) is available 
cuda = torch.cuda.is_available()
if cuda:
  print("Cuda available - Uses GPU")
else: 
  print("Cuda unavailable - Uses CPU")

Cuda available - Uses GPU


In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Dec 21 09:51:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    34W / 250W |   8737MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Data import & preprocessing

Here we import the data contained in the train_pos and train_neg files as well as the test_data used below, for AI Crowd submissions. We slightly process the two first datasets in order to make them suitable for model training. 

In [None]:
POS_DATA = "/content/drive/MyDrive/ML_Project2/twitter-datasets/train_pos_full.txt"
NEG_DATA = "/content/drive/MyDrive/ML_Project2/twitter-datasets/train_neg_full.txt"
TEST_DATA = "/content/drive/MyDrive/ML_Project2/twitter-datasets/test_data.txt"

In [None]:
pos = pd.read_csv(POS_DATA, names = ["text"], sep='\n', header = None, dtype='str', quoting = csv.QUOTE_NONE)
neg = pd.read_csv(NEG_DATA, names = ["text"], sep='\n', header = None, dtype='str', quoting = csv.QUOTE_NONE)

In [None]:
# we concatenate the two and add labels
pos["labels"] = 1
neg["labels"] = 0
df = pd.concat([pos, neg])

Let's take a look at how many data we have. 

In [None]:
pos.shape[0] + neg.shape[0]

2500000

With more than 2.5 million data (negative and positive), we will probably have to chose a subset to avoid extremely long runtimes. 

In [None]:
# We randomly pick some part of the dataset to train & test the data on 
# as testing on the whole could be a bit long 
DATA_SIZE = 100000

# a random seed is set 
SEED = random.seed(123)

df = df.sample(n = DATA_SIZE, random_state = SEED)

Here is a quick view of the data at this stage. 

We also have to "normalize" the data in order to correspond to the format of the pre-trained model. To do so, we will modify the `<user>` and url `<tags>`. We also "demojize" the remaining emojis. Note both the pre-trained model mentioned before are trained with this type of tags. 

In a second step, we also decided to normalize digits into an unique sign. This way we hope to keep the numbers format, without keeping the digit diversity.  


In [None]:
def adapt_normalizer(tweet, numbers = False):
  """
  Adapts current data normalization to the desired one. 
  @tweet: tweet to normalize
  @numbers: whether to normalize number or not
  return: normalized tweet 
  """
    normalized_tweet = tweet
    if "<user>" in tweet:
      normalized_tweet = normalized_tweet.replace("<user>","@USER")
    if "<url>" in tweet:
      normalized_tweet = normalized_tweet.replace("<url>","HTTPURL")
    normalized_tweet = emoji.demojize(normalized_tweet)
    # only if we want to convert also numbers
    if numbers:
      normalized_tweet = re.sub("\d", "§", normalized_tweet)
    return normalized_tweet

In [None]:
df["text"] = df["text"].apply(lambda x : adapt_normalizer(x))
df

Unnamed: 0,text,labels
543273,best part of the day #goodnight everyone n swe...,1
143254,""" @USER only you can understand why ill be try...",0
828767,in the darkest depths of cornwall sat in the c...,1
636681,rt @USER @USER congrats on your 4 billboard aw...,1
260418,droppin a piece of slate on ur foot is not the...,0
...,...,...
647110,only just going to sleep after finishing colle...,0
1167867,"@USER i can spell , it's just hard to type on ...",1
1021564,"ive changed a lot of my ways of thinking , loo...",1
383261,"i feel stupid now #ifonly i didnt hate life , ...",0


In [None]:
# we split in train & test sets
TEST_SIZE = 0.2
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["labels"], test_size = TEST_SIZE, random_state=SEED)

In [None]:
# we also create a few "list" format variables for further needs 
def conv_list(series):
  return series.to_numpy().tolist()

X_train_l = conv_list(X_train)
X_test_l = conv_list(X_test)
y_train_l = conv_list(y_train)
y_test_l = conv_list(y_test)

In [None]:
# we join back everything
train_df = pd.concat([X_train, y_train], axis = 1)
test_df = pd.concat([X_test, y_test], axis = 1)

# quick view of the data at this stage : 
train_df

Unnamed: 0,text,labels
665765,from mommy wars to doggie wars in the campaign...,0
876716,really wanna go to the summer time ball,0
311174,"after the summer , i'm starting my locs )",1
132388,the boy in the striped pajamas is such a great...,0
269996,@USER thanks ! hope you have a wonderful day !,1
...,...,...
1183753,@USER ooh the life good times,1
859668,"anyone looking for a job , my work is hiring",1
454234,im willing to bet money that buddy will be fam...,1
459959,@USER im sorry im not as cool as you and gets ...,0


# Bertweet & RoBERTa

First, let's explain a bit our process. We found a couple of "community models" already trained on tweet specific tasks. These models are pre-trained on specific tweet data, instead of casual text data. This is particularly useful giving the specific form and orthograph of tweets. 

## Vinai - No train 
We chose an already existing model, namely the "vinai bertweet base". Its advantages are to be specifically trained for twitter data, on 850M tweets. It is originally made for word prediction, so probably a deeper training will be required to fit to our dataset. 

In [None]:
# we use the bertweet model
# for the amount of data we have, we do not want the model to be saved at every step
# same goes on for epochs 
bert = ClassificationModel("bertweet", "vinai/bertweet-base", use_cuda = cuda)

Some weights of the model checkpoint at vinai/bertweet-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at vinai/bertweet-base and are newly initialized: 

In [None]:
preds, model_outputs = bert.predict(X_test_l)

  0%|          | 0/2000 [00:00<?, ?it/s]

  0%|          | 0/250 [00:00<?, ?it/s]

In [None]:
accuracy = np.count_nonzero(preds == y_test)/y_test.size
print("Accuracy using vinai with no training is {}".format(accuracy))

Accuracy using vinai with no training is 0.5065


Indeed, this accuracy is no really better than random. We will perform retraining on this model. 

## Pysentimiento - No train

Note that pysentimiento is a further training of Vinai, aimed to be used as a "blackbox" by everyone, including people knowing very few about ML. However it is an interesting model, because it is trained specifically for our purpose (finding if sentiments are positif or negative). 
In addition it has the interesting particularity to find "neutral" tweets.
Let's see how well it performs. 

In [None]:
bert2 = ClassificationModel("bertweet", "finiteautomata/bertweet-base-sentiment-analysis", use_cuda = cuda)

Downloading:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/515M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/824k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/295 [00:00<?, ?B/s]

In [None]:
preds2, model_outputs2 = bert2.predict(X_test_l)

  0%|          | 0/20000 [00:00<?, ?it/s]

  0%|          | 0/2500 [00:00<?, ?it/s]

### Removing neutrals. 
Here, `model_outputs2`gives the "probabilities" for each tweet to be (negative, neutral, positive). We only take the maximum between negative and positive.

In [None]:
def ps_accuracy(model_outputs, y):
  """Computes the accuracy for the PySetimiento model (PS), based on 
  model outputs. 
  @model_outputs: outputs of model 
  @y: expected values of y
  return: accuracy"""
  pos_neg_outputs = np.delete(model_outputs, 1, 1)  # delete second column of preds
  zo_preds = np.argmax(pos_neg_outputs, axis = 1) # takes the maximum over negative "probability" and positive "probability"
  accuracy = np.count_nonzero(zo_preds == y)/y.size
  return accuracy

In [None]:
print("Accuracy for Pysentimiento is {}".format(ps_accuracy(model_outputs2, y_test)))

Accuracy for Pysentimiento is 0.6455


### Accuracy for non-neutral predictions. 

Let's check also the accuracy rate for twitts not predicted as neutral. This may be not directly useful for our task, but it is still interesting to see how the model performs. 

In [None]:
def non_neutral_acc(pred, y):
  """Checks the accuracy EXCLUDING neutral tweets. This means it will
  only compare """
  similar = 0
  non_neutrals = 0
  for i in range(len(pred)):
    if pred[i] != 1:
      non_neutrals += 1
      # in predictions, positive sentiment is represented by 2, where it is 
      # represented by 1 in the y_test 
      if ((pred[i] == 0) & (y[i] == 0)) | ((pred[i] == 2) & (y[i] == 1)):
        similar += 1
  return similar/non_neutrals
  
    

In [None]:
non_neutral_accuracy = non_neutral_acc(preds2, y_test_l)
print("Accuracy for non-neutrals in Pysentimiento is {}".format(non_neutral_accuracy))

Accuracy for non-neutrals in Pysentimiento is 0.7583612040133779


The obtained accuracy is not bad, but further training could strongly increase them. Therefore we will perform more training on Vinai and Pysentimiento. 

We will now train all the previous models, and RoBERTa, a model trained on a different but higher dataset (160GB). It is not trained specifically on tweets, but should also perform quite well for our task. 

## Vinai - train

We start by defining an output directory for the trained model. 

In [None]:
MODEL_DIR_VINAI= "/content/drive/MyDrive/ML_Project2/vinai_models100kNEW"

In [None]:
bert.train_model(train_df, output_dir = MODEL_DIR_VINAI)

  0%|          | 0/80000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/10000 [00:00<?, ?it/s]

(10000, 0.32966543171405793)

## Pysentimiento - train

In [None]:
MODEL_DIR_PS = "/content/drive/MyDrive/ML_Project2/pysentimiento_models100k"

Before training, we need to adapt our dataframe to the requiremetns of pysentimiento, i.e. positive sentiments are matched to 2 instead of 1. 

In [None]:
# we do a deep copy to avoid modifying the initial data
ps_train_df = train_df.copy(deep = True)
ps_train_df["labels"][ps_train_df["labels"] == 1] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
bert2.train_model(ps_train_df, output_dir = MODEL_DIR_PS)

  0%|          | 0/80000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/10000 [00:00<?, ?it/s]



(10000, 0.34718909907341006)

## RoBERTa - train

In [None]:
bert3 = ClassificationModel("roberta", "roberta-base", use_cuda = cuda)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

In [None]:
MODEL_DIR_ROBERTA= "/content/drive/MyDrive/ML_Project2/roberta_models100k"

In [None]:
bert3.train_model(train_df, output_dir = MODEL_DIR_ROBERTA)

  0%|          | 0/80000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/10000 [00:00<?, ?it/s]

(10000, 0.398639911031723)

## Check results from previous model

Here we check the results from the previous models when they have been saved. 

In [None]:
MODEL_DIR_VINAI = "/content/drive/MyDrive/ML_Project2/vinai_models100k/checkpoint-10000-epoch-1"
MODEL_DIR_ROBERTA = "/content/drive/MyDrive/ML_Project2/roberta_models100k/checkpoint-10000-epoch-1"
MODEL_DIR_PS= "/content/drive/MyDrive/ML_Project2/pysentimiento_models100k/checkpoint-10000-epoch-1"

In [None]:
def compute_acc(array1, array2, model = "Vinai"):
  """This function computes the accuracy of our model. 
  The calculation depends on the model computed, as the 
  prediction is directly contained in "pred" for Vinai, 
  but we have to compute it from the model outputs for Pysentimiento (PS)
  due to the "neutral" case
  @array1: either the preds (for vinai) or the model outputs (for PS)
  @array2: expected y values
  @model: chosen model. either "Vinai" or "PS"
  return: accuracy
  """
  if model == "Vinai":
    return np.count_nonzero(array1 == array2)/array2.size
  elif model == "PS": 
    return ps_accuracy(array1, array2)

### Vinai


In [None]:
vinai = ClassificationModel("bertweet", MODEL_DIR_VINAI, use_cuda = cuda)

In [None]:
preds, model_outputs = vinai.predict(X_test_l)

  0%|          | 0/20000 [00:00<?, ?it/s]

  0%|          | 0/2500 [00:00<?, ?it/s]

In [None]:
accuracy = compute_acc(preds, y_test)
print("Accuracy for Vinai is {}".format(accuracy))

Accuracy for Vinai is 0.89595


### Pysentimiento

In [None]:
ps = ClassificationModel("bertweet", MODEL_DIR_PS, use_cuda = cuda)

In [None]:
# predictions will be computed from model_outputs_2
_, model_outputs2 = ps.predict(X_test_l)

  0%|          | 0/20000 [00:00<?, ?it/s]

  0%|          | 0/2500 [00:00<?, ?it/s]

In [None]:
accuracy = compute_acc(model_outputs2, y_test, "PS")
print("Accuracy for Pysentimiento is {}".format(accuracy))

Accuracy for Pysentimiento is 0.8952


### RoBERTa


In [None]:
roberta = ClassificationModel("roberta", MODEL_DIR_ROBERTA, use_cuda = cuda)
preds3, model_outputs3 = roberta.predict(X_test_l)

  0%|          | 0/20000 [00:00<?, ?it/s]

  0%|          | 0/2500 [00:00<?, ?it/s]

In [None]:
accuracy = compute_acc(preds3, y_test)
print("Accuracy for Roberta is {}".format(accuracy))

Accuracy for Roberta is 0.86505


## Saving accuracy

Then we save the obtained accuracy for further comparison. To do so, we add it in a csv containing also information on the dataset size it has been performed. 

In [None]:
DIR_ACCURACY = "/content/drive/MyDrive/ML_Project2/accuracy/accuracy.csv"

Then we add a few information to help identify which accuracy has been obtained on which model. 

In [None]:
accuracy_infos = ["100k", "Vinai", accuracy]

In [None]:
with open(DIR_ACCURACY, 'a') as csvfile:
    write = csv.writer(csvfile, delimiter = ',')
    write.writerow(accuracy_infos)

# Run on **test** (submission) dataset

We run our models on the dataset and output the prediction csv. 

In [None]:
test = pd.read_csv(TEST_DATA, names = ["text"], sep='\n', header = None, dtype='str', quoting = csv.QUOTE_NONE)

In [None]:
# change the model name depending which one you want to use 
preds_test, model_outputs = vinai.predict(test.to_numpy().tolist())

We have to make this fit to the desired output for AICrowd submission. We just put everything in a dataframe with the corresponding indexes and replace 0 by -1. We also rename the columns.

In [None]:
def to_df(preds):
  """Here we convert the predictions to a dataframe
  @preds: predictions as an array of [0, 1] values
  return: dataframe, index starting at 1, [-1, 1] values"""
  df_preds = pd.DataFrame(preds, columns= ["Prediction"])
  df_preds.index += 1
  df_preds.index.rename("Id", inplace = True)
  df_preds[df_preds["Prediction"] == 0] = -1
  return df_preds

## Vinai & RoBERTa

In [None]:
# use to construct the df for vinai or roberta
df_preds = to_df(preds_test)

## Pysentimiento

In [None]:
# use to construct the df for pysentimiento 
pos_neg_outputs = np.delete(model_outputs, 1, 1)  # delete second column of preds
ps_preds = np.argmax(pos_neg_outputs, axis = 1) # take argmax for remaining columns
df_preds = to_df(ps_preds)

## Check & exports

Here is how the df looks at this stage.

In [None]:
df_preds

Unnamed: 0_level_0,Prediction
Id,Unnamed: 1_level_1
1,-1
2,-1
3,1
4,1
5,-1
...,...
9996,1
9997,-1
9998,-1
9999,1


In [None]:
# change export name depending on the model used 
# here it is set up with the csv who achieved the best results
PRED_CSV_PATH = "/content/drive/MyDrive/ML_Project2/result_csv/bert_vinai_100k.csv"
df_preds.to_csv(PRED_CSV_PATH)

# Results view

We import from the previously accuracy table our results. Set indicates the size of the (re)-training set, where +N indicates a pre-processing on numbers. 

In [None]:
df_results = pd.read_csv(DIR_ACCURACY).sort_values(["Model", "Set"])
df_results

Unnamed: 0,Set,Model,Accuracy
5,100k,Pysentimiento,0.8952
4,10k,Pysentimiento,0.88
2,100k,Roberta,0.86505
0,10k,Roberta,0.8525
6,100k,Vinai,0.89595
3,100k+N,Vinai,0.88855
7,10k,Vinai,0.886
1,500k,Vinai,0.89948
