# Fine Tuning XLNet Model for Text Classification

### Download the data from Kaggle: 
 - https://www.kaggle.com/c/nlp-getting-started/data
 
In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
%%capture
!pip install wandb

In [3]:
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [5]:
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
df_train.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [7]:
df_train.keyword.isnull().sum()/df_train.shape[0]*100

0.8012610009194797

In [8]:
df_train.location.isnull().sum()/df_train.shape[0]*100

33.27203467752528

In [9]:
df_train.sample(10)['text'].tolist()

["Deadpool is already one of my favourite marvel characters and all I know is he wears a red suit so the bad guys can't tell if he's bleeding",
 "Did you miss the #BitCoin explosion - Don't miss out - #Hangout tonight at 8:30PM EST ===&gt;&gt;&gt; http://t.co/qKaHXwLWXa",
 'Death certificates safes weapons and Teslas: DEF CON 23 #Security http://t.co/KMDQm3NlnS',
 '#fitness Knee Damage Solution http://t.co/pUMbrNeBJE',
 'PHOTOS: Green Line derailment near Cottage Grove and Garfield: http://t.co/4d9Cd4mnVh http://t.co/UNhqCQ6Bex',
 '@KabarMesir @badr58 \nNever dies a big Crime like RABAA MASSACRE as long the revolution is being observed.\n#rememberRABAA',
 '@RAYCHIELOVESU On the block we hear sirens&amp; stories of kids getting Lemonade only to see their life get minute made. we talking semi paid',
 'love 106.1 The Twister @1061thetwister  and Maddie and Tae #OKTXDUO',
 "@zaynmalik don't overwork yourself. Your album is gonna be fire just don't overwork or stress! I love you take care"]

## Cleaning
 - Replace `#`
 - Remove username starting with `@`
 - Remove `links`

In [10]:
! pip install tweet-preprocessor
import preprocessor as p

def clean_text(text):
  text = text.replace("#","")
  return p.clean(text)

Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [11]:
from tqdm.notebook import tqdm
tqdm.pandas()

df_train['clean_text'] = df_train['text'].astype(str).progress_map(clean_text)
df_test['clean_text'] = df_test['text'].astype(str).progress_map(clean_text)

  0%|          | 0/7613 [00:00<?, ?it/s]

  0%|          | 0/3263 [00:00<?, ?it/s]

In [12]:
# splitting the data into training and test dataset
X = df_train['clean_text']
y = df_train['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [13]:
train_df = pd.DataFrame(X_train)
train_df['target'] = y_train

eval_df = pd.DataFrame(X_test)
eval_df['target'] = y_test

In [14]:
train_df.shape, eval_df.shape

((6090, 2), (1523, 2))

In [15]:
# transformers - SOTA implementation of pretrained models
!pip install -U simpletransformers 

Collecting simpletransformers
  Downloading simpletransformers-0.63.3-py3-none-any.whl (247 kB)
[?25l[K     |█▎                              | 10 kB 18.7 MB/s eta 0:00:01[K     |██▋                             | 20 kB 10.3 MB/s eta 0:00:01[K     |████                            | 30 kB 7.8 MB/s eta 0:00:01[K     |█████▎                          | 40 kB 6.9 MB/s eta 0:00:01[K     |██████▋                         | 51 kB 5.0 MB/s eta 0:00:01[K     |████████                        | 61 kB 5.7 MB/s eta 0:00:01[K     |█████████▎                      | 71 kB 5.7 MB/s eta 0:00:01[K     |██████████▋                     | 81 kB 6.3 MB/s eta 0:00:01[K     |████████████                    | 92 kB 5.1 MB/s eta 0:00:01[K     |█████████████▎                  | 102 kB 5.0 MB/s eta 0:00:01[K     |██████████████▌                 | 112 kB 5.0 MB/s eta 0:00:01[K     |███████████████▉                | 122 kB 5.0 MB/s eta 0:00:01[K     |█████████████████▏              | 133 kB 5.

In [16]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
'''
args = {
   'output_dir': 'outputs/',
   'cache_dir': 'cache/',
   'fp16': True,
   'fp16_opt_level': 'O1',
   'max_seq_length': 256,
   'train_batch_size': 8,
   'eval_batch_size': 8,
   'gradient_accumulation_steps': 1,
   'num_train_epochs': 3,
   'weight_decay': 0,
   'learning_rate': 4e-5,
   'adam_epsilon': 1e-8,
   'warmup_ratio': 0.06,
   'warmup_steps': 0,
   'max_grad_norm': 1.0,
   'logging_steps': 50,
   'evaluate_during_training': False,
   'save_steps': 2000,
   'eval_all_checkpoints': True,
   'use_tensorboard': True,
   'overwrite_output_dir': True,
   'reprocess_input_data': False,
}

'''

# Create a ClassificationModel
model = ClassificationModel("roberta", "roberta-base", args={'num_train_epochs':10, 'train_batch_size':32, 'max_seq_length':128,"wandb_project": "robertasimple",
    "wandb_kwargs": {"name": "roberta-base"},}) # You can set class weights by using the optional weight argument

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(eval_df, acc=sklearn.metrics.accuracy_score)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/6090 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_roberta_128_2_2


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Initializing WandB run for training.
[34m[1mwandb[0m: Currently logged in as: [33mhuma_[0m (use `wandb login --relogin` to force relogin)


Running Epoch 0 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.
  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1523 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_roberta_128_2_2


Running Evaluation:   0%|          | 0/191 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training loss,███▇▆▅▆▆▆▅▃▄▃▃▃▂▃▂▃▃▄▅▂▁▃▃▁▃▃▁▁▁▃▂▁▂▁▄
global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
lr,▄▇███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁

0,1
Training loss,0.20071
global_step,1900.0
lr,0.0


INFO:simpletransformers.classification.classification_model:{'mcc': 0.6311063270623622, 'tp': 522, 'tn': 725, 'fp': 144, 'fn': 132, 'auroc': 0.8739359100234724, 'auprc': 0.837260233203642, 'acc': 0.8187787261982928, 'eval_loss': 1.0221283376528956}


In [17]:
result

{'acc': 0.8187787261982928,
 'auprc': 0.837260233203642,
 'auroc': 0.8739359100234724,
 'eval_loss': 1.0221283376528956,
 'fn': 132,
 'fp': 144,
 'mcc': 0.6311063270623622,
 'tn': 725,
 'tp': 522}

In [18]:
predictions, raw_outputs = model.predict(df_test.clean_text.tolist())

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/408 [00:00<?, ?it/s]

In [20]:
sample_sub=pd.read_csv("sample_submission.csv")
sample_sub['target'] = predictions

sample_sub.to_csv("submission_roberta_base.csv", index=False)