Hii everyone this is my first notebook please show your support and let me know if you have remarques or questions.
Enjoy !!!! XD


In this tutorial, I'll show you how to finetune the pretrained XLNet model with the huggingface library in order to achieve a sentiment analysis on Tunisian dataset already available on Kaggle.

In fact, Hugging Face initially supported only PyTorch, but now TF 2.0 is also well supported.

Following is a general pipeline for any transformer model:
**Tokenizer definition →Tokenization of Documents →Model Definition →Model Training →Inference**


1. Import libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sns
import transformers

import nltk
import re


from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

plt.style.use('seaborn')

In [2]:
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))

2.1.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


This is the link to the dataset https://www.kaggle.com/naim99/tunisian-texts 

In [4]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tunisian-texts/tun.xlsx


In [6]:
dt= pd.read_excel('../input/tunisian-texts/tun.xlsx',header=0)

In [7]:
dt

Unnamed: 0,texts,data_labels
0,[المجلس الأعلى للأمن الجزائر برئاسة الرئيس عبد...,1
1,[سنتخذ إجراءات لحماية حدودنا وإقليمنا وسنفعل د...,1
2,[الصين تطلق الصاروخ الفضائي القوي (longue marc...,1
3,[المصدرfrance 24],1
4,[هذا عام ينقصي أعمارنا وعام يقبل علينا فماذا ف...,1
...,...,...
32813,[علم موقع نسمة، باخرة إيطالية، وصلت ساعة متأخر...,0
32814,[رجع الهم، زايد بلاد بلا راجل، موش سكرتو الحدو...,0
32815,[نداء الى رئيس الجمهورية],0
32816,[هبط الجيش واقفل الحدود وأعلن الحالة القصوى وك...,0


** Text cleanning: **
- remove strange characters
- remove URLs (they doesn't tell us pretty much)
- replace usernames for "@" character
- remove line breaks

In [13]:
def clean_text(text):
    #Remove emojis and special chars
    clean=text
    reg = re.compile('\\.+?(?=\B|$)')
    clean = text.apply(lambda r: re.sub(reg, string=r, repl=''))
    reg = re.compile('\x89Û_')
    clean = clean.apply(lambda r: re.sub(reg, string=r, repl=' '))
    reg = re.compile('\&amp')
    clean = clean.apply(lambda r: re.sub(reg, string=r, repl='&'))
    reg = re.compile('\\n')
    clean = clean.apply(lambda r: re.sub(reg, string=r, repl=' '))

    #Remove hashtag symbol (#)
    clean = clean.apply(lambda r: r.replace('#', ''))

    #Remove user names
    reg = re.compile('@[a-zA-Z0-9\_]+')
    clean = clean.apply(lambda r: re.sub(reg, string=r, repl='@'))

    #Remove URLs
    reg = re.compile('https?\S+(?=\s|$)')
    clean = clean.apply(lambda r: re.sub(reg, string=r, repl='www'))

    #Lowercase
    clean = clean.apply(lambda r: r.lower())
    return clean

In [14]:
dt['clean'] = clean_text(dt['texts'])

**Tokenizer Definition**

Every transformer based model has a unique tokenization technique, unique use of special tokens. The transformer library takes care of this for us. It supports tokenization for every model which is associated with it.

In [11]:
from transformers import TFXLNetModel, XLNetTokenizer

We are going to fine-tune a Pretrained transformer model on custom dataset.

In [18]:
xlnet_model = 'xlnet-large-cased'
xlnet_tokenizer = XLNetTokenizer.from_pretrained(xlnet_model)

In [15]:
def create_xlnet(mname):
    """ Creates the model. It is composed of the XLNet main block and then
    a classification head its added
    """
    # Define token ids as inputs
    word_inputs = tf.keras.Input(shape=(120,), name='word_inputs', dtype='int32')

    # Call XLNet model
    xlnet = TFXLNetModel.from_pretrained(mname)
    xlnet_encodings = xlnet(word_inputs)[0]

    # CLASSIFICATION HEAD 
    # Collect last step from last hidden state (CLS)
    doc_encoding = tf.squeeze(xlnet_encodings[:, -1:, :], axis=1)
    # Apply dropout for regularization
    doc_encoding = tf.keras.layers.Dropout(.1)(doc_encoding)
    # Final output 
    outputs = tf.keras.layers.Dense(1, activation='sigmoid', name='outputs')(doc_encoding)

    # Compile model
    model = tf.keras.Model(inputs=[word_inputs], outputs=[outputs])
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=2e-5), loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), ])

    return model

In [16]:
xlnet = create_xlnet(xlnet_model)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: unexpected indent (<unknown>, line 71)


In [17]:
xlnet.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
word_inputs (InputLayer)     [(None, 120)]             0         
_________________________________________________________________
tfxl_net_model (TFXLNetModel ((None, 120, 1024),)      360268800 
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 1, 1024)]         0         
_________________________________________________________________
tf_op_layer_Squeeze (TensorF [(None, 1024)]            0         
_________________________________________________________________
dropout_73 (Dropout)         (None, 1024)              0         
_________________________________________________________________
outputs (Dense)              (None, 1)                 1025      
Total params: 360,269,825
Trainable params: 360,269,825
Non-trainable params: 0
_______________________________________________

Clean and split the data

In [19]:
tweets = dt['clean']
labels = dt['data_labels']

X_train, X_test, y_train, y_test = train_test_split(tweets, labels, test_size=0.15, random_state=196)

Next step is now to perform tokenization on documents. It can be performed either by encode() or encode_plus() method.

XLNet requires specifically formatted inputs. For each tokenized input sentence, we need to create:

**input ids**: a sequence of integers identifying each input token to its index number in the XLNet tokenizer vocabulary

**segment mask**: (optional) a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long.(for two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence)

**attention mask**: (optional) a sequence of 1s and 0s, with 1s for all input tokens and 0s for all padding tokens (we'll detail this in the next paragraph)

**max_length**: Max length of any sentence to tokenize, its a hyperparameter. (originally BERT has 512 max length)

**pad_to_max_length**: perform padding operation.

![](http://)**add_special_tokens**: used to add special character like < cls >, < sep >,< unk >, etc

In [20]:
def get_inputs(tweets, tokenizer, max_len=120):
    """ Gets tensors from text using the tokenizer provided"""
    inps = [tokenizer.encode_plus(t, max_length=max_len, pad_to_max_length=True, add_special_tokens=True) for t in tweets]
    inp_tok = np.array([a['input_ids'] for a in inps])
    ids = np.array([a['attention_mask'] for a in inps])
    segments = np.array([a['token_type_ids'] for a in inps])
    return inp_tok, ids, segments

def warmup(epoch, lr):
    """Used for increasing the learning rate slowly, this tends to achieve better convergence.
    However, as we are finetuning for few epoch it's not crucial.
    """
    return max(lr +1e-6, 2e-5)

def plot_metrics(pred, true_labels):
    """Plots a ROC curve with the accuracy and the AUC"""
    acc = accuracy_score(true_labels, np.array(pred.flatten() >= .5, dtype='int'))
    fpr, tpr, thresholds = roc_curve(true_labels, pred)
    auc = roc_auc_score(true_labels, pred)

    fig, ax = plt.subplots(1, figsize=(8,8))
    ax.plot(fpr, tpr, color='red')
    ax.plot([0,1], [0,1], color='black', linestyle='--')
    ax.set_title(f"AUC: {auc}\nACC: {acc}");
    return fig

Create the input data (tensors)

In [21]:
inp_tok, ids, segments = get_inputs(X_train, xlnet_tokenizer)

### Training

In [22]:
callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=4, min_delta=0.02, restore_best_weights=True),
    tf.keras.callbacks.LearningRateScheduler(warmup, verbose=0),
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy', factor=1e-6, patience=2, verbose=0, mode='auto', min_delta=0.001, cooldown=0, min_lr=1e-6)
]

I used one epoch but ideally you can try 4

In [23]:
hist = xlnet.fit(x=inp_tok, y=y_train, epochs=1, batch_size=16, validation_split=.15, callbacks=callbacks)

Train on 23710 samples, validate on 4185 samples


### Testing

In [24]:
text=['المنوج باهي برشا ومحلاه']
inp_tok, ids, segments = get_inputs(text, xlnet_tokenizer)
preds = xlnet.predict(inp_tok, verbose=True)



In [25]:
preds

array([[0.79322964]], dtype=float32)

In [29]:
text=['كانت الخدمة super!']
inp_tok, ids, segments = get_inputs(text, xlnet_tokenizer)
preds = xlnet.predict(inp_tok, verbose=True)
preds



array([[0.8066348]], dtype=float32)