# NLP from bag of words to transformer



__This notebook is using commonlitreadibilityprize__ data to showcase the different possible approaches to solve an NLP regression problem.
In real word application this would be just the last part of the long data science pipeline.
In this notebook we will only focus on the most common modelling techniques, going from the most basic to the most complex one

## Models :
1. Bag of words 
2. TF-IDF 
3. Word2Vec
3. Decision tree & ensamble
4. Support Vector machine
5. Transformer

## Pre Processing

The first step, obviously, is to import all the necessary packages to implement the NLP prediction process. 
Specifically, to carry out the different machine learning tasks we import:

* NLTK (data preprocessing)
* scikitlearn (models implementation) 
* re (Regex)
* gensim(word2vec)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.tokenize import word_tokenize as wt 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import gensim
from gensim.models import Word2Vec

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Loading data from the challenge:

In [None]:
dataset=pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")
dataset=dataset[['target','excerpt']]
dataset.head()

The first step in our process is cleaning the data and making them machine-readable. 
To this purpose, we perform:

* Stemming 
* Stop Words Removing
* Tokenization
* Lowercase Standardization

In [None]:
data = []

for i in range(dataset.shape[0]):

    sms = dataset.iloc[i, 1]

    # remove non alphabatic characters
    sms = re.sub('[^A-Za-z]', ' ', sms)

    # make words lowercase, because Go and go will be considered as two words
    sms = sms.lower()

    # tokenising
    tokenized_sms = wt(sms)

    # remove stop words and stemming
 
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(stemmer.stem(word))

    sms_text = " ".join(sms_processed)
    data.append(sms_text)

Most importantly, we build a matrix that allows the subsequent models to be implemented. In fact, machine learning models receive as an input a matrix that should represent the underlying data in vectors.

It's worth noticing that we don't limit our approach to one single vectorization. That is, we use three different approaches as elements of the matrix in order to allow model comparison and to reduce the risk of having biased models.
Specifically, we use:
    
1. COUNT
2. TF/IDF
3. WORD2VEC

Notice that word2vec doesn't need stemming or other pre-processing techniques.

We will now compare the different vectorization approaches using a simple linear regression.


## 1. Bag of words & linear regression


The first model is a simple linear regression trained on the bag of words representation.
First we transform the data into vectors, then we proceed with the classical train/test split.

In [None]:
# creating the feature matrix 
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000) #Bag of words
X = vectorizer.fit_transform(data).toarray()
y = dataset.iloc[:, 0]

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Then we initialized the linear regression, trained it and evaluated it on the test set.

In [None]:
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# RMSE volue on test set
scores = cross_val_score(regr, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

From the linear regression we can extract the coefficient that represents the feature importance of each word.

In [None]:
coef_table = pd.DataFrame(list(vectorizer.get_feature_names())).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regr.coef_.transpose())
coef_table.nlargest(25,'Coefs')

Error distribution:

In [None]:
print(f' RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}')
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_pred, ax=ax, label='Prediction')
ax.legend()
plt.show()

## 2. TF-IDF


The second model is a linear regression trained on the TF-IDF representation.
We follow the same approach as the previous model

In [None]:
# creating the feature matrix 
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000) #TFIDF

X = vectorizer.fit_transform(data).toarray()
y = dataset.iloc[:, 0]

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# RMSE volue on test set
scores = cross_val_score(regr, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

In [None]:
print(f' RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}')
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_pred, ax=ax, label='Prediction')
ax.legend()
plt.show()

## 3. Word2Vec


The final linear regression presented is based on the word2vec representation.

First of all, we load the pretrained word embedding model.

In [None]:
word2vecModel = gensim.models.KeyedVectors.load_word2vec_format("/kaggle/input/google-pretrain-model/GoogleNews-vectors-negative300.bin.gz", binary=True)

Then we create a function to compute the vector that will represent each excerpt.

In [None]:
def avg_feature_vector(sentence, model, num_features):
    words = sentence.replace('\n'," ").replace(',',' ').replace('.'," ").split()
    feature_vec = np.zeros((num_features,),dtype="float32")
    i=0
    for word in words:
        try:
            feature_vec = np.add(feature_vec, model[word])
        except KeyError as error:
            feature_vec 
            i = i + 1
    if len(words) > 0:
        feature_vec = np.divide(feature_vec, len(words)- i)
    return feature_vec

Then we compute the vectorization on the training data.

In [None]:
word2vec_train = np.zeros((len(dataset.index),300),dtype="float32")

for i in range(len(dataset.index)):
    word2vec_train[i] = avg_feature_vector(dataset["excerpt"][i],word2vecModel, 300)

We proceed with the classical train/test split.

In [None]:
# creating the feature matrix 

X = word2vec_train
y = dataset.iloc[:, 0]

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Finally we compute the last linear regression on the word2vec data.

In [None]:
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# RMSE volue on test set
scores = cross_val_score(regr, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

In [None]:
print(f' RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}')
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_pred, ax=ax, label='Prediction')
ax.legend()
plt.show()

It's clear that the word2vec representation of the data is superior compared to the Bag of words and the TF-IDF. Therefore, that would be the representation of choice for the next models.


## 4. Decision tree & ensamble


After having decided that Word2Vec is the best technique to pre-process the data we start going further with the linear regression.
In this section we will evaluate the tree-based models.

### Regression Tree


Training and evaluation of a regression tree.

In [None]:
from sklearn.tree import DecisionTreeRegressor 
  
# create a regressor object
Tree = DecisionTreeRegressor(random_state = 42) 
  
# Train the model using the training sets
Tree.fit(X_train, y_train)

# Make predictions using the testing set
y_predT = Tree.predict(X_test)

scores = cross_val_score(Tree, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

In [None]:
print(f' RMSE: {mean_squared_error(y_test, y_pred, squared=False):.4f}')
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_predT, ax=ax, label='Prediction')
ax.legend()
plt.show()

### Regression Forest


Training and evaluation of a regression forest.

In [None]:
from sklearn.ensemble import RandomForestRegressor 
  
# create a regressor object
Forest = RandomForestRegressor(random_state = 42) 
  
# Train the model using the training sets
Forest.fit(X_train, y_train)

# Make predictions using the testing set
y_predF = Forest.predict(X_test)

scores = cross_val_score(Forest, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

In [None]:
print(f' RMSE: {mean_squared_error(y_test, y_predF, squared=False):.4f}')

fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_predF, ax=ax, label='Prediction')
ax.legend()
plt.show()

### Boosted Tree


Training and evaluation of a boosted tree.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor 
  
# create a regressor object
Boost = GradientBoostingRegressor(random_state = 42) 
  
# Train the model using the training sets
Boost.fit(X_train, y_train)

# Make predictions using the testing set
y_predB = Forest.predict(X_test)

scores = cross_val_score(Boost, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

In [None]:
#Error Distribution
print(f' RMSE: {mean_squared_error(y_test, y_predB, squared=False):.4f}')

sns.scatterplot(
    x=y_test, y= y_predB,
    palette=sns.color_palette("hls", 10),
    legend="full")

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_predB, ax=ax, label='Prediction')
ax.legend()
plt.show()

## 5. Support Vector machine

Training and evaluation of a support vector machine.

In [None]:
from sklearn.svm import SVR
  
# create a regressor object
Support = SVR(kernel = 'rbf') 
  
# Train the model using the training sets
Support.fit(X_train, y_train)

# Make predictions using the testing set
y_predS = Support.predict(X_test)

scores = cross_val_score(Support, X_train, y_train, cv=5,scoring='neg_root_mean_squared_error')
-(scores.mean())

In [None]:
print(f' RMSE: {mean_squared_error(y_test, y_predS, squared=False):.4f}')

fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(y_test, ax=ax, label='Label')
sns.distplot(y_predS, ax=ax, label='Prediction')
ax.legend()
plt.show()

The support vector machine seems to be the model that perform better in this task.

So we are going to optimize the hyperparameter using the grid search technique.

In [None]:
from sklearn.model_selection import GridSearchCV
  
#defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']} 
  
grid = GridSearchCV(SVR(), param_grid, refit = True, verbose = 0)
  
#fitting the model for grid search
grid.fit(X, y)

In [None]:
#print best parameter after tuning
print(grid.best_params_)


SVM with best parameter found by the grid search and to further optimize the model we train it with cross validation with five fold.

In [None]:
#KFold 　n_splits=5
from sklearn.model_selection import KFold
y_train_num=dataset.target.to_numpy()

fold = KFold(n_splits=5, shuffle=True, random_state=42)
cv=list(fold.split(word2vec_train, y_train_num))

In [None]:
rmses = []
for tr_idx, val_idx in cv: 
    x_tr, x_va = word2vec_train[tr_idx], word2vec_train[val_idx]
    y_tr, y_va = y_train_num[tr_idx], y_train_num[val_idx]
        
    # Training
    model = SVR(kernel = 'rbf',gamma=1,C=1) 
    model.fit(x_tr, y_tr)    
    y_pred = model.predict(x_va)
    rmse =  np.sqrt(mean_squared_error(y_va, y_pred))
    rmses.append(rmse)
    
    
    fig, ax = plt.subplots(1, 1, figsize=(20, 6))
    sns.distplot(y_va, ax=ax, label='Label')
    sns.distplot(y_pred, ax=ax, label='Prediction')
    ax.legend()
    plt.show()
        
print("\n", "Mean Fold RMSE:", np.mean(rmses))    

SVM with best parameter found by the grid search and to further optimize the model we train it with cross validation with five fold.

## 5. Transfomer


After training the algorithm on our current dataset we push further with new models based on transfer learning.
In this case we will use the "RoBERTa-base" model, which is a smaller implementation of  [Roberta](https://arxiv.org/abs/1907.11692).

#### Import libraries from transformes and define useful function for the training phase.

In [None]:
import tensorflow as tf
import tensorflow.keras.layers as L
import tensorflow.keras.backend as K
from tensorflow.keras import optimizers, losses, metrics, Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
from transformers import TFAutoModelForSequenceClassification, TFAutoModel, AutoTokenizer
from transformers import RobertaTokenizer, RobertaModel
from sklearn.metrics import mean_squared_error

# Utility functions
def custom_standardization(text):
    text = text.lower() # if encoder is uncased
    text = text.strip()
    return text


def sample_target(features, target):
    mean, stddev = target
    sampled_target = tf.random.normal([], mean=tf.cast(mean, dtype=tf.float32), 
                                      stddev=tf.cast(stddev, dtype=tf.float32), dtype=tf.float32)
    
    return (features, sampled_target)
    

def get_dataset(pandas_df, tokenizer, labeled=True, ordered=False, repeated=False, 
                is_sampled=False, batch_size=32, seq_len=128):
    """
        Return a Tensorflow dataset ready for training or inference.
    """
    text = [custom_standardization(text) for text in pandas_df['excerpt']]
    
    # Tokenize inputs
    tokenized_inputs = tokenizer(text, max_length=seq_len, truncation=True, 
                                 padding='max_length', return_tensors='tf')
    
    if labeled:
        dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': tokenized_inputs['input_ids'], 
                                                      'attention_mask': tokenized_inputs['attention_mask']}, 
                                                      (pandas_df['target'], pandas_df['standard_error'])))
        if is_sampled:
            dataset = dataset.map(sample_target, num_parallel_calls=tf.data.AUTOTUNE)
    else:
        dataset = tf.data.Dataset.from_tensor_slices({'input_ids': tokenized_inputs['input_ids'], 
                                                      'attention_mask': tokenized_inputs['attention_mask']})
        
    if repeated:
        dataset = dataset.repeat()
    if not ordered:
        dataset = dataset.shuffle(1024)
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    
    return dataset


def plot_metrics(history):
    metric_list = list(history.keys())
    size = len(metric_list)//2
    fig, axes = plt.subplots(size, 1, sharex='col', figsize=(20, size * 5))
    axes = axes.flatten()
    
    for index in range(len(metric_list)//2):
        metric_name = metric_list[index]
        val_metric_name = metric_list[index+size]
        axes[index].plot(history[metric_name], label='Train %s' % metric_name)
        axes[index].plot(history[val_metric_name], label='Validation %s' % metric_name)
        axes[index].legend(loc='best', fontsize=16)
        axes[index].set_title(metric_name)

    plt.xlabel('Epochs', fontsize=16)
    sns.despine()
    plt.show()

#### Learning strategy

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print(f'Running on TPU {tpu.master()}')
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync

#### Hyperparameters

In [None]:
#Ml parameter
BATCH_SIZE = 16
LEARNING_RATE = 1e-5
EPOCHS = 5
SEQ_LEN = 256 
#Cv numbers
N_FOLDS = 5
BASE_MODEL = "/kaggle/input/huggingface-roberta/roberta-base" 

#### Model

In [None]:
def model_fn(encoder, seq_len=256):
    input_ids = L.Input(shape=(seq_len,), dtype=tf.int32, name='input_ids')
    input_attention_mask = L.Input(shape=(seq_len,), dtype=tf.int32, name='attention_mask')
    
    outputs = encoder({'input_ids': input_ids, 
                       'attention_mask': input_attention_mask})
    
    model = Model(inputs=[input_ids, input_attention_mask], outputs=outputs)

    optimizer = optimizers.Adam(lr=LEARNING_RATE)
    model.compile(optimizer=optimizer, 
                  loss=losses.MeanSquaredError(), 
                  metrics=[metrics.RootMeanSquaredError()])
    
    return model


with strategy.scope():
    encoder = TFAutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1)
    model = model_fn(encoder, SEQ_LEN)
    
model.summary()

#### Training

In [None]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
skf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=42)
oof_pred = []; oof_labels = []; history_list = []; test_pred = []
train=pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")
test=pd.read_csv("/kaggle/input/commonlitreadabilityprize/test.csv")
train.drop(['url_legal', 'license'], axis=1, inplace=True)
test.drop(['url_legal', 'license'], axis=1, inplace=True)

for fold,(idxT, idxV) in enumerate(skf.split(train)):
    if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)
    print(f'\nFOLD: {fold+1}')
    print(f'TRAIN: {len(idxT)} VALID: {len(idxV)}')

    # Model
    K.clear_session()
    with strategy.scope():
        encoder = TFAutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1,hidden_dropout_prob=0.1)
        model = model_fn(encoder, SEQ_LEN)
        
    model_path = f'model_{fold}.h5'
    es = EarlyStopping(monitor='val_root_mean_squared_error', mode='min', 
                       patience=2, restore_best_weights=True, verbose=1)
    checkpoint = ModelCheckpoint(model_path, monitor='val_root_mean_squared_error', mode='min', 
                                 save_best_only=True, save_weights_only=True)

    # Train
    history = model.fit(x=get_dataset(train.loc[idxT], tokenizer, repeated=True, is_sampled=True, 
                                      batch_size=BATCH_SIZE, seq_len=SEQ_LEN), 
                        validation_data=get_dataset(train.loc[idxV], tokenizer, ordered=True, 
                                                    batch_size=BATCH_SIZE, seq_len=SEQ_LEN), 
                        steps_per_epoch=100, 
                        callbacks=[es, checkpoint], 
                        epochs=EPOCHS,  
                        verbose=2).history
      
    history_list.append(history)
    # Save last model weights
    model.load_weights(model_path)
    
    # Results
    print(f"#### FOLD {fold+1} OOF RMSE = {np.min(history['val_root_mean_squared_error']):.4f}")

    # OOF predictions
    valid_ds = get_dataset(train.loc[idxV], tokenizer, ordered=True, batch_size=BATCH_SIZE, seq_len=SEQ_LEN)
    oof_labels.append([target[0].numpy() for sample, target in iter(valid_ds.unbatch())])
    x_oof = valid_ds.map(lambda sample, target: sample)
    oof_pred.append(model.predict(x_oof)['logits'])

    # Test predictions
    test_ds = get_dataset(test, tokenizer, labeled=False, ordered=True, batch_size=BATCH_SIZE, seq_len=SEQ_LEN)
    x_test = test_ds.map(lambda sample: sample)
    test_pred.append(model.predict(x_test)['logits'])

#### Training performance

In [None]:
y_true = np.concatenate(oof_labels)
y_preds = np.concatenate(oof_pred)


for fold, history in enumerate(history_list):
    print(f"FOLD {fold+1} RMSE: {np.min(history['val_root_mean_squared_error']):.4f}")
    
print(f'OOF RMSE: {mean_squared_error(y_true, y_preds, squared=False):.4f}')

In [None]:
for fold, history in enumerate(history_list):
    print(f'\nFOLD: {fold+1}')
    plot_metrics(history)

# Conclusion

The performance of the transformer is superior to the Support vector machine so we are going to push that model to the challenge submission.
Obviously there is room for further improvement by fine tuning the transformer model with techniques such as:

1. Testing different hyperparameters 
2. Gradient Clipping
3. Differential Learning Rate
4. Scaling up the the full Roberta model
5. Ensamble learning on multiples istances of the model

#### Submission

In [None]:
submission = test[['id']]
submission['target'] = np.mean(test_pred, axis=0)
submission.to_csv('./submission.csv', index=False)
display(submission.head(10))