# Fine Tuning XLNet Model for Text Classification

### Download the data from Kaggle: 
 - https://www.kaggle.com/c/nlp-getting-started/data
 
In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [3]:
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
df_train.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [5]:
df_train.keyword.isnull().sum()/df_train.shape[0]*100

0.8012610009194797

In [6]:
df_train.location.isnull().sum()/df_train.shape[0]*100

33.27203467752528

In [7]:
df_train.sample(10)['text'].tolist()

['Japan Marks 70th Anniversary of Hiroshima Atomic Bombing http://t.co/93vqkdFgnr',
 'Video:  Fire burns two apartment buildings and blows up car in Manchester http://t.co/5BGcw3EzB5',
 '@Bloodbath_TV favourite YouTube channel going right now.\nLove everything you guys do and thank you introducing me to Dude Bro Party Massacre',
 "Don't tell the bride gives me the fear",
 "@CacheAdvance besides your nasty thunderstorm or snowstorm nah. Can't say that I have.",
 'AUTH LOUIS VUITTON BROWN SAUMUR 35 CROSS BODY SHOULDER BAG MONOGRAM 7.23 419-3 - Full read\x89Û_ http://t.co/HCDiwE5flc http://t.co/zLvEbEoavG',
 'Petition | Heartless owner that whipped horse until it collapsed is told he can KEEP his animal! Act Now! http://t.co/ym3cWw28dJ',
 'MP train derailment: Village youth saved many lives\nhttp://t.co/lTYeFJdM3A #IndiaTV http://t.co/0La1aw9uUd',
 "Our tipster previews Chelsea v Swansea &amp; there's a 48/1 double! http://t.co/PFSrYJS1pc \n#Chelsea #Hazard http://t.co/SKdBot7TGF",
 '@bet

## Cleaning
 - Replace `#`
 - Remove username starting with `@`
 - Remove `links`

In [8]:
! pip install tweet-preprocessor
import preprocessor as p

def clean_text(text):
  text = text.replace("#","")
  return p.clean(text)

Collecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [9]:
from tqdm.notebook import tqdm
tqdm.pandas()

df_train['clean_text'] = df_train['text'].astype(str).progress_map(clean_text)
df_test['clean_text'] = df_test['text'].astype(str).progress_map(clean_text)

  0%|          | 0/7613 [00:00<?, ?it/s]

  0%|          | 0/3263 [00:00<?, ?it/s]

In [10]:
# splitting the data into training and test dataset
X = df_train['clean_text']
y = df_train['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [11]:
train_df = pd.DataFrame(X_train)
train_df['target'] = y_train

eval_df = pd.DataFrame(X_test)
eval_df['target'] = y_test

In [12]:
train_df.shape, eval_df.shape

((6090, 2), (1523, 2))

In [13]:
# transformers - SOTA implementation of pretrained models
!pip install -U simpletransformers 

Collecting simpletransformers
  Downloading simpletransformers-0.63.3-py3-none-any.whl (247 kB)
[K     |████████████████████████████████| 247 kB 7.6 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 20.0 MB/s 
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.0 MB/s 
Collecting streamlit
  Downloading streamlit-1.3.1-py2.py3-none-any.whl (9.2 MB)
[K     |████████████████████████████████| 9.2 MB 19.6 MB/s 
[?25hCollecting transformers>=4.6.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 49.4 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 42.8 MB/s 
[?25hCollecting wandb>=0.10.32
  Downloading wandb-0.12.9-py2.py3-none-any.whl (1.7 MB)
[K

In [14]:
%%capture
!pip install wandb

In [15]:
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [16]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
'''
args = {
   'output_dir': 'outputs/',
   'cache_dir': 'cache/',
   'fp16': True,
   'fp16_opt_level': 'O1',
   'max_seq_length': 256,
   'train_batch_size': 8,
   'eval_batch_size': 8,
   'gradient_accumulation_steps': 1,
   'num_train_epochs': 3,
   'weight_decay': 0,
   'learning_rate': 4e-5,
   'adam_epsilon': 1e-8,
   'warmup_ratio': 0.06,
   'warmup_steps': 0,
   'max_grad_norm': 1.0,
   'logging_steps': 50,
   'evaluate_during_training': False,
   'save_steps': 2000,
   'eval_all_checkpoints': True,
   'use_tensorboard': True,
   'overwrite_output_dir': True,
   'reprocess_input_data': False,
   "wandb_project": "Question Answer Application",
    "wandb_kwargs": {"name": model_name},
}

'''

# Create a ClassificationModel
model = ClassificationModel('xlnet', 'xlnet-base-cased', args={'num_train_epochs':10, 'train_batch_size':32, 'max_seq_length':128,"wandb_project": "xlnetsimple",
    "wandb_kwargs": {"name": 'xlnet-base-cased'},}) # You can set class weights by using the optional weight argument

# Train the model
model.train_model(train_df)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(eval_df, acc=sklearn.metrics.accuracy_score)

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'logits_proj.bias', 'sequence_summary.summary.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/6090 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_xlnet_128_2_2


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Initializing WandB run for training.
[34m[1mwandb[0m: Currently logged in as: [33mhuma_[0m (use `wandb login --relogin` to force relogin)


Running Epoch 0 of 10:   0%|          | 0/191 [00:00<?, ?it/s]



Running Epoch 1 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/191 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of xlnet model complete. Saved to outputs/.
  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1523 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_xlnet_128_2_2


Running Evaluation:   0%|          | 0/191 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Initializing WandB run for evaluation.


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training loss,▆█▇▄▇▄▆▄▄▄▃▂▃▃▁▃▄▂▁▁▃▁▁▅▁▃▁▁▁▂▁▁▃▃▁▁▃▁
global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
lr,▄▇███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁

0,1
Training loss,0.00199
global_step,1900.0
lr,0.0


INFO:simpletransformers.classification.classification_model:{'mcc': 0.618871783785417, 'tp': 529, 'tn': 707, 'fp': 162, 'fn': 125, 'auroc': 0.8809653262388136, 'auprc': 0.879675707961266, 'acc': 0.8115561391989494, 'eval_loss': 1.070433967010513}


In [17]:
result

{'acc': 0.8115561391989494,
 'auprc': 0.879675707961266,
 'auroc': 0.8809653262388136,
 'eval_loss': 1.070433967010513,
 'fn': 125,
 'fp': 162,
 'mcc': 0.618871783785417,
 'tn': 707,
 'tp': 529}

In [18]:
predictions, raw_outputs = model.predict(df_test.clean_text.tolist())

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/3263 [00:00<?, ?it/s]

  0%|          | 0/408 [00:00<?, ?it/s]

In [20]:
sample_sub=pd.read_csv("sample_submission.csv")
sample_sub['target'] = predictions

sample_sub.to_csv("submission_09092020_xlnet_base.csv", index=False)