# CommonLit: Detailed Guide

<a id="section-zero"></a>

<p style="font-family: Arial; font-size:1.4em;color:gold;"> Learning NLP </p>
# TABLE OF CONTENTS


* [Library Importations](#section-library-importations)
* [Loading Datasets](#section-loading-datasets)
* [Exploratory Data Analysis](#section-EDA)
* [Data Preprocessing](#section-preprocessing)
    - [Data Cleaning](#subsection-datacleaning) 
    - [Stemming](#subsection-stemming) 
    - [Lemmatization](#subsection-lemmatization)
* [Part-of-Speech Tagging](#section-pos)
* [Named Entity Recognition](#section-ner)
* [Bag of Words + Models](#section-bow)
    - [Linear Regression](#subsection-bow-lr) 
    - [Ridge Regression](#subsection-bow-ridge)  
    - [Extreme Gradient Boosting](#subsection-bow-xgb)  
* [TD IDF + Models](#section-tdidf)
    - [Linear Regression](#section-tdidf) 
    - [Ridge Regression](#section-tdidf)  
    - [Extreme Gradient Boosting](#section-tdidf)  
    - [Lasso Regression](#section-tdidf) 
    - [Tweedie Regression](#section-tdidf)  
    - [Huber Regression](#section-tdidf)  
* [Embedding + Models](#section-wordembeddings)
    - [Simple Embedding](#subsection-embedding)
    - [Convolutional Neural Networks](#subsection-CNN)
    - [Gated Recurrent Units](#subsection-GRU)
    - [Single Long Short Term Memory](#subsection-LSTM)
    - [Multiple Long Short Term Memory](#subsection-multiple-LSTM)
* [Hyper Parameters Tuning](#section-hyperparametertuning)
    - [Random Search](#subsection-randomsearch)
    - [Hyperband](#subsection-hyperband)
* [Glove Embeddings](#section-gloveembeddings)
    - [Extreme Gradient Boosting](#subsection-glovexgb)
    - [Stacked LSTM](#subsection-gloveLSTM)
* [BERT Huggingface Transformer](#section-bert)
* [RoBerta HuggingFace Transformer](#section-robertabase)
* [Submission](#section-submission)

<a id="section-library-importations"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Library Importations </p>

In [None]:
import numpy as np
import pandas as pd
import time
import string
import re
import math

from collections import Counter

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LinearRegression, Ridge, Lasso, TweedieRegressor,HuberRegressor
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import mean_squared_error as mse


import xgboost as xgb

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler, ReduceLROnPlateau

import kerastuner as kt

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.io import curdoc, show, output_notebook
output_notebook()

import nltk
from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk import pos_tag
stop_words = stopwords.words('english')

import spacy
nlp = spacy.load('en_core_web_lg')
from spacy import displacy

import transformers
from transformers import BertTokenizer, TFBertModel, RobertaTokenizer, TFRobertaModel

<a id="section-loading-datasets"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Loading Datasets </p>

Use pandas's read_csv function to read dataframe and print it's shape.

In [None]:
df_train = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
df_test = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')
df_submission = pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')

print(" Training dataset shape : " + str(df_train.shape))
print(" Testing dataset shape : " + str(df_test.shape))


In [None]:
df_train.head()

In [None]:
df_train['excerpt'][0]

In [None]:
df_test.head()

In [None]:
df_submission.head()

[Back to Top](#section-zero)

<a id="section-EDA"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Exploratory Data Analysis </p>

Only url_legal and license columns appear to be having missing values

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

Examples with the 5 lowest target values

In [None]:
display(df_train.sort_values(by=['target']).head())

Examples with the 5 highest target values

In [None]:
display(df_train.sort_values(by=['target'], ascending=False).head())

View target distribution

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.distplot(df_train['target'], ax=ax, color ='green')
plt.show()

View std_error distribution

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
#sns.distplot(df_train['standard_error'], ax=ax, color ='green')
plt.show()

In [None]:
def msv_1(data, color = 'yellow', edgecolor = 'black', height = 3, width = 15):
    
    plt.figure(figsize = (width, height))
    percentage = (data.isnull().mean()) * 100
    percentage.sort_values(ascending = False).plot.bar(color = color, edgecolor = edgecolor)

    plt.title('Missing values percentage per column', fontsize=20, weight='bold' )
    plt.xlabel('Columns', size=15, weight='bold')
    plt.ylabel('Missing values percentage')
    plt.yticks(weight ='bold')
    
    return plt.show()
msv_1(df_train, color=sns.color_palette('flare',15))

Count number of words in excerpts and maximum count

In [None]:
count = df_train['excerpt'].str.split().str.len()
print("Number of words in excerpts:\n",count)
print("Max word count from excerpt: ", max(count))

Adding two columns to the train dataset: 
* *excerpt_len* 
> Length of the excerpt 
* *excerpt_word_count* 
> Count of number of words in the excerpt

In [None]:
df_train['excerpt_len'] = df_train['excerpt'].apply(
    lambda x : len(x)
)
df_train['excerpt_word_count'] = df_train['excerpt'].apply(
    lambda x : len(x.split(' '))
)

In [None]:
fig= plt.subplots(1, 1, figsize=(20, 6))
sns.kdeplot(df_train['excerpt_len'],  color = 'green').set_title('Excerpt Len')
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))
sns.kdeplot(df_train['excerpt_word_count'], ax=ax, color = 'green').set_title('Excerpt Word Count')
plt.show()

Generate a word cloud

In [None]:
plt.figure(figsize=(10,10))
wordcloud1 = WordCloud( background_color='white',
                        width=600,
                        height=500).generate(" ".join(df_train['excerpt']))
plt.imshow(wordcloud1)
plt.axis('off')
plt.title('Excerpts',fontsize=40);

License Distribution

In [None]:
plt.figure(figsize=(16, 8))
sns.countplot(y="license",data=df_train,palette="crest",linewidth=3)
plt.title("License Distribution",font="Serif")
plt.show()

We will find out vocab size count,i.e. total number of words used. We will use Counter class from collections.

In [None]:

results = Counter()
df_train['excerpt'].str.lower().str.split().apply(results.update)
print(len(results.keys()))

Find out the longest word and it's length

In [None]:
longest = max(str(results.keys()).split(), key=len)
print(longest)
print(len(longest))

In [None]:
df_train.head()

We will see the excerpt with minimum target value

In [None]:
df_train['target'].min()

In [None]:
df_train.loc[df_train['target'] == df_train['target'].min()].excerpt

In [None]:
for word in df_train.loc[df_train['target'] == df_train['target'].min()].excerpt:
    print(word)

We will see the excerpt with maximum target value

In [None]:
df_train['target'].max()

In [None]:
df_train.loc[df_train['target'] == df_train['target'].max()].excerpt

In [None]:
for word in df_train.loc[df_train['target'] == df_train['target'].max()].excerpt:
    print(word)

Functions to get top unigrams and bigrams

In [None]:
def get_top_n_words(corpus, n = None):
    """
    A function that returns the top 'n' unigrams used in the corpus
    """
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus) 
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] 
    freq_sorted = sorted(words_freq, key = lambda x: x[1], reverse = True)
    return freq_sorted[:n]

def get_top_n_bigram(corpus, n = None):
    """
    A function that returns the top 'n' bigrams used in the corpus
    """
    vec = CountVectorizer(ngram_range = (2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis = 0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    freq_sorted = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return freq_sorted[:n]

[Back to Top](#section-zero)

<a id="section-preprocessing"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Data Preprocessing </p>

Data preprocessing is the process of converting raw data into a well-readable format to be used by a machine learning model.

<a id="subsection-datacleaning"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Data Cleaning </p>

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset

We will create a 'clean' function which comprises of various cleaning function such as removal of punctuations etc

In [None]:
def removeStopwords(text):
    doc = nlp(text)
    clean_text = ' '
    for txt in doc:
        if (txt.is_stop == False):
            clean_text = clean_text + " " + str(txt)        
    
    return clean_text

print("\033[1mText before removeStopwords function: \033[0m" + df_train['excerpt'][1])
print("\033[1mText after removeStopwords function: \033[0m" + removeStopwords(df_train['excerpt'][1]))

In [None]:
def removePunctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

print("\033[1mText before removePunctuations function: \033[0m" + df_train['excerpt'][1])
print("\n")
print("\033[1mText after removePunctuations function: \033[0m" + removePunctuations(df_train['excerpt'][1]))

In [None]:
def removeLinks(text):
    clean_text = re.sub('https?://\S+|www\.\S+', '', text)
    #https? will match both http and https
    #A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
    #\S Matches any character which is not a whitespace character.
    #+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
    return clean_text

test_string = "http://www.youtube.com/ and https://www.youtube.com/ should be removed "
(test_string,removeLinks(test_string))

In [None]:
def removeNumbers(text):
    clean_text = re.sub(r'\d+', '', text)
    return clean_text

test_string = "Hi 🙈 99 girls are running"
(test_string,removeNumbers(test_string))

In [None]:
def clean(text):
    text = text.lower() #Lets make it lowercase
    text = removeStopwords(text)
    text = removePunctuations(text)
    text = removeNumbers(text)
    text = removeLinks(text)
    return text

In [None]:
df_train['excerpt_clean'] = df_train['excerpt'].apply(clean)
df_test['excerpt_clean'] = df_test['excerpt'].apply(clean)
df_train.head()

After cleaning, see size of vocabulary:

In [None]:
results = Counter()
df_train['excerpt_clean'].str.lower().str.split().apply(results.update)
print(len(results.keys()))

In [None]:
df_train.excerpt_clean

In [None]:
top_unigram = get_top_n_words(df_train['excerpt_clean'], 20)
words = [i[0] for i in top_unigram]
count = [i[1] for i in top_unigram]
source = ColumnDataSource(data = dict(Word = words, counts = count, color = ['#6baed6'] * 20))

p = figure(x_range = words, plot_height = 400, plot_width = 800, title = "Top Unigrams", tools = "hover", tooltips = "@Word: @counts")
p.vbar(x = 'Word', top = 'counts', width = 0.8, source = source, color = 'color')
curdoc().theme = 'dark_minimal'
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.title.align = 'center'
p.xaxis.major_label_orientation = "vertical"
show(p)

In [None]:
top_bigram = get_top_n_bigram(df_train['excerpt_clean'], 20)
words = [i[0] for i in top_bigram]
count = [i[1] for i in top_bigram]
source = ColumnDataSource(data = dict(Word = words, counts = count, color = ['#a1dab4'] * 20))

p1 = figure(x_range = words, plot_height = 400, plot_width = 800, title = "Top Bigrams", tools = "hover", tooltips = "@Word: @counts")
p1.vbar(x = 'Word', top = 'counts', width = 0.8, source = source, color = 'color')
# curdoc().theme = 'dark_minimal'
p1.xgrid.grid_line_color = None
p1.title.align = 'center'
p1.y_range.start = 0
p1.xaxis.major_label_orientation = "vertical"
show(p1)


[Back to Top](#section-zero)

<a id="subsection-stemming"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Stemming </p>

We will use NLTK for stemming since Spacy doesn't contain any function for stemming as it relies on lemmatization only There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. So we will use that.

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.

In [None]:
stemmer = SnowballStemmer(language='english')

tokens = df_train['excerpt'][1].split()
clean_text = ' '

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

In [None]:
def stemWord(text):
    stemmer = SnowballStemmer(language='english')
    tokens = text.split()
    clean_text = ' '
    for token in tokens:
        clean_text = clean_text + " " + stemmer.stem(token)      
    
    return clean_text

print("\033[1mText before stemWord function: \033[0m" + df_train['excerpt'][1])
print("\033[1mText after stemWord function: \033[0m" + stemWord(df_train['excerpt'][1]))

In [None]:
df_train['excerpt_clean'] = df_train['excerpt_clean'].apply(stemWord)
df_test['excerpt_clean'] = df_test['excerpt_clean'].apply(stemWord)

See vocabulary size now

In [None]:
results = Counter()
df_train['excerpt_clean'].str.lower().str.split().apply(results.update)
print(len(results.keys()))

<a id="subsection-lemmatization"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Lemmatization </p>

Though we could not perform stemming with spaCy, we can perform lemmatization using spaCy. This is a time consuming process.

Output of lemmatization is an actual word in English unlike Stemming. (word.lemma_ will print word's lemma in SPacy)

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
#for token in doc:
   # print(token.lemma_)
for noun in doc.noun_chunks:
    print(noun.text)

In [None]:
for word in doc:
    print(word.text,  word.lemma_)

In [None]:
def lemmatizeWord(text):
    tokens=nlp(text)
    clean_text = ' '
    for token in tokens:
        clean_text = clean_text + " " + token.lemma_      
    
    return clean_text

print("Text before lemmatizeWord function: " + df_train['excerpt'][1])
print("Text after lemmatizeWord function: " + lemmatizeWord(df_train['excerpt'][1]))

doc = "Apple is looking at buying U.K. startup for $1 billion"
lemmatizeWord(doc)


[Back to Top](#section-zero)

Lets define Root Mean Squared Error

In [None]:
rmse = lambda y_true, y_pred: np.sqrt(mse(y_true, y_pred))
rmse_loss = lambda Estimator, X, y: rmse(y, Estimator.predict(X))

In [None]:
# Split into train and test sets


x = df_train['excerpt_clean']
y = df_train['target']

print(len(x), len(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)
print(len(x_train), len(y_train))
print(len(x_test), len(y_test))


[Back to Top](#section-zero)

<a id="section-pos"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Part-of-Speech Tagging </p>

> In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

[Wiki link](https://en.wikipedia.org/wiki/Part-of-speech_tagging)

In [None]:
df_train['pos_tags'] = df_train['excerpt_clean'].str.split().map(pos_tag)

In [None]:
df_train['pos_tags']

Write a function count_tags to count the number of pos_tags and add it as a column to the dataframe.

In [None]:
def count_tags(pos_tags):
    tag_count = {}
    for word,tag in pos_tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1
    return tag_count

df_train['tag_counts'] = df_train['pos_tags'].map(count_tags)

In [None]:
df_train['tag_counts']

Add columns for differnt tags

Here are some of the different tags:

* CC coordinating conjunction
* CD cardinal digit
* DT determiner
* EX existential there (like: “there is” … think of it like “there exists”)
* FW foreign word
* IN preposition/subordinating conjunction
* JJ adjective ‘big’
* JJR adjective, comparative ‘bigger’
* JJS adjective, superlative ‘biggest’
* LS list marker 1)
* MD modal could, will
* NN noun, singular ‘desk’
* NNS noun plural ‘desks’
* NNP proper noun, singular ‘Harrison’
* NNPS proper noun, plural ‘Americans’
* PDT predeterminer ‘all the kids’
* POS possessive ending parent‘s 
* PRP personal pronoun I, he, she
* RB adverb very, silently,
* RBR adverb, comparative better
* RBS adverb, superlative best
* RP particle give up
* TO to go ‘to‘ the store.
* UH interjection errrrrrrrm
* VB verb, base form take
* VBD verb, past tense took
* VBG verb, gerund/present participle taking
* VBN verb, past participle taken
* VBP verb, sing. present, non-3d take
* VBZ verb, 3rd person sing. present takes
* WDT wh-determiner which
* WP wh-pronoun who, what
* WP$ possessive wh-pronoun whose
* WRB wh-abverb where, when

In [None]:
set_pos = set([tag for tags in df_train['tag_counts'] for tag in tags])
tag_cols = list(set_pos)

for tag in tag_cols:
    df_train[tag] = df_train['tag_counts'].map(lambda x: x.get(tag, 0))

View df_train now

In [None]:
df_train.head()

Refer https://seaborn.pydata.org/tutorial/color_palettes.html for seaborn palette

Plot POS tag frequency for df_train's tag_cols.
set_yscale as log so that smaller values also get displayed.

In [None]:
pos = df_train[tag_cols].sum().sort_values(ascending = False)
plt.figure(figsize=(16,10))
ax = sns.barplot(x=pos.index, y=pos.values,palette="flare")
plt.xticks(rotation = 50)
ax.set_yscale('log')
plt.title('Part-Of-Speech tags frequency')
plt.show()

Use displacy to render the excerpt with the largest target value

In [None]:
df_train.loc[df_train['target'] == df_train['target'].max()].excerpt.to_string()

In [None]:
sent = str()
for word in df_train.loc[df_train['target'] == df_train['target'].max()].excerpt:
    sent = sent  + word
sent

In [None]:
doc1 = nlp(sent)

In [None]:
displacy.render(doc1, style="dep")

[Back to Top](#section-zero)

<a id="section-ner"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Named Entity Recognition </p>

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can indentify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc

In [None]:
doc1 = nlp(df_train['excerpt'][22])

Use displacy to render with style ent

In [None]:
displacy.render(doc1, style="ent")

In [None]:
# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc1.ents]
print(ents)

On preprocessing, some NER information is lost.

[Back to Top](#section-zero)

<a id="section-bow"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Bag of Words (BoW) </p>

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

* A vocabulary of known words.
* A measure of the presence of known words.
It is called a “*bag*” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

An n-gram is a contiguous sequence of n items from a given sample of text or speech

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram".

Click [here](https://en.wikipedia.org/wiki/Bag-of-words_model) for more information on Bag-of-Words model

<a id="subsection-bow-lr"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Linear Regression </p>

# Unigram Only

In [None]:
model = make_pipeline(
    CountVectorizer(ngram_range=(1,1)),
    LinearRegression(),
)

val_score = cross_val_score(
    model, 
    df_train['excerpt_clean'], 
    df_train['target'], 
    scoring=rmse_loss
).mean()

print(f'Train Score for CountVectorizer(1,1): {val_score}')


[Back to Top](#section-zero)

# Bi-grams only

In [None]:
model = make_pipeline(
    CountVectorizer(ngram_range=(2,2)),
    LinearRegression(),
)

val_score = cross_val_score(
    model, 
    df_train['excerpt_clean'], 
    df_train['target'], 
    scoring=rmse_loss
).mean()

print(f'Train Score for CountVectorizer(2,2): {val_score}')


[Back to Top](#section-zero)

# Unigrams + Bi-grams

In [None]:
model = make_pipeline(
    CountVectorizer(ngram_range=(1,2)),
    LinearRegression(),
)

val_score = cross_val_score(
    model, 
    df_train['excerpt_clean'], 
    df_train['target'], 
    scoring=rmse_loss
).mean()

print(f'Train Score for CountVectorizer(1,2): {val_score}')


[Back to Top](#section-zero)

# Unigrams + Bi-grams + Tri-grams

In [None]:
model = make_pipeline(
    CountVectorizer(ngram_range=(1,3)),
    LinearRegression(),
)

val_score = cross_val_score(
    model, 
    df_train['excerpt_clean'], 
    df_train['target'], 
    scoring=rmse_loss
).mean()

print(f'Train Score for CountVectorizer(1,3): {val_score}')

[Back to Top](#section-zero)

<a id="subsection-bow-ridge"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Ridge Regression</p>

In [None]:
model = make_pipeline(
    CountVectorizer(ngram_range=(1,1)),
    Ridge(),
)

val_score = cross_val_score(
    model, 
    df_train['excerpt_clean'], 
    df_train['target'], 
    scoring=rmse_loss
).mean()

print(f'Train Score for Ridge Regression: {val_score}')

[Back to Top](#section-zero)

<a id="subsection-bow-xgb"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Extreme Gradient Boosting</p>

Change ngram_range and experiment

In [None]:
model = make_pipeline(
    CountVectorizer(ngram_range=(1,1)),
    xgb.XGBRegressor() ,
)

val_score = cross_val_score(
    model, 
    df_train['excerpt_clean'], 
    df_train['target'], 
    scoring=rmse_loss
).mean()

print(f'Train Score for Extreme Gradient Boosting: {val_score}')

[Back to Top](#section-zero)

<a id="section-tdidf"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> TD IDF </p>

TF-IDF (stands for Term-Frequency-Inverse-Document Frequency) weights down the common words occuring in almost all the documents and give more importance to the words that appear in a subset of documents. TF-IDF works by penalising these common words by assigning them lower weights while giving importance to some rare words in a particular document.

Use sklearn.feature_extraction.text's TfidfVectorizer and make a pipeline comprising TfidfVectorizer and our models

In [None]:
def training(model, X_train, y_train, X_test, y_test, model_name, ngram_range):
    t1 = time.time()
    
    model = make_pipeline(
        TfidfVectorizer(binary=True, ngram_range=ngram_range),
        model,
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    MSE = mse(y_test, y_pred)
    
    t2 = time.time()
    training_time = t2-t1 
    
    print("--- Model:", model_name,"---")
    print("MSE: ",MSE)
    print("Training time:",training_time)
    print("\n")

We will run different models at the same time.
1. Ridge Regression
2. Linear Regression
3. Extreme Gradient Boosting

**fit_intercept** bool, default=True
    Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. X and y are expected to be centered).
    
**normalizebool**, default=False
    This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False

In [None]:
ridge = Ridge(fit_intercept = True, normalize = False)
lr = LinearRegression()
xgbr = xgb.XGBRegressor()
lasso = Lasso(alpha=0.1)
tr = TweedieRegressor()
hr = HuberRegressor(max_iter = 300)
models = [ridge,lr,xgbr,lasso,tr,hr]

modelnames = ["Ridge Regression","Linear Regression","Extreme Gradient Boosting", "Lasso Regression","Tweedie Regressor","Huber Regressor"]

Use train_test_split to split data into training and validation

In [None]:
X = df_train["excerpt_clean"]
y = df_train['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

n_gram_dict = { "Unigram" : (1,1), "Unigrams + Bigrams": (1,2), "Bigrams alone": (2,2), "Unigrams + Bigrams + Trigrams": (1,3)}

for n_gram in n_gram_dict.keys():
    print("\033[1m " + n_gram + " \n \033[0m")
    for i in range(0,len(models)):
        training(model=models[i], X_train=X_train, y_train=y_train, X_test=X_test,y_test=y_test, model_name=modelnames[i],ngram_range=n_gram_dict[n_gram])
    print("*" * 40)
    


[Back to Top](#section-zero)

<a id="section-wordembeddings"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Embeddings </p>

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. 

An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. 

A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

Before we start, let us define some callbacks, variables and other helper functions

# Callbacks

We will define some callbacks to be used with the model's fit function:-
* **Learning rate reduction** - 

        tf.keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss",
        factor=0.1,
        patience=10,
        verbose=0,
        mode="auto",
        min_delta=0.0001,
        cooldown=0,
        min_lr=0,
        **kwargs
        )
Reduce learning rate when a metric has stopped improving.

Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced.

We will use that to prevent/reduce overfitting.

* **Early Stopping**

       tf.keras.callbacks.EarlyStopping(
        monitor="val_loss",
        min_delta=0,
        patience=0,
        verbose=0,
        mode="auto",
        baseline=None,
        restore_best_weights=False,
        )
Stop training when a monitored metric has stopped improving.

Assuming the goal of a training is to minimize the loss. With this, the metric to be monitored would be 'loss', and mode would be 'min'. A model.fit() training loop will check at end of every epoch whether the loss is no longer decreasing, considering the min_delta and patience if applicable. Once it's found no longer decreasing, model.stop_training is marked True and the training terminates.

The quantity to be monitored needs to be available in logs dict. To make it so, pass the loss or metrics at model.compile().

We will use that to prevent/reduce overfitting.

In [None]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_root_mean_squared_error', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)


early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=5, # how many epochs to wait before stopping
    restore_best_weights=True,
)

# Plotting and Predicting Helper Functions

In [None]:


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

In [None]:
def predict_complexity(model, excerpt):
  # Create the sequences
  padding_type='post'
  sample_sequences = tokenizer.texts_to_sequences(excerpt)
  excerpt_padded = pad_sequences(sample_sequences, padding=padding_type, 
                                 maxlen=max_length) 
  classes = model.predict(excerpt_padded)
  for x in range(len(excerpt_padded)):
    print(excerpt[x])
    print(classes[x])
    print('\n')


In [None]:
#text = df_train.excerpt
text = df_train.excerpt_clean

A question we should ask ourselves is what values to give for vocab_size, max_length, embedding dimension etc.
Vocab_size I have given as 51308 here which is the total number of words (We found this in [Exploratory Data Analysis](#section-EDA))

In [None]:
vocab_size = 51038
embedding_dim = 64
max_length = 50
trunc_type='post'
pad_type='post'
oov_tok = "<OOV>"

The preprocessing library in TensorFlow Keras provides a number of extremely useful tools to prepare data for machine learning. One of these is a
Tokenizer that will allow you to take words and turn them into tokens.


In [None]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index

texts_to_sequences will convert our excerpts to corresponding sequences. We use padding so that each sequence is same (max_length). Commented out validtion part because I am using validation_split in model's fit function instead.

In [None]:
training_sequences = tokenizer.texts_to_sequences(text)
training_padded = pad_sequences(training_sequences,maxlen=max_length, 
                                truncating=trunc_type, padding=pad_type)

#validation_sequences = tokenizer.texts_to_sequences(text[800:])
#validation_padded = pad_sequences(validation_sequences,maxlen=max_length)

training_labels_final = np.array(df_train.target)
#validation_labels_final = np.array(df_train[800:].target)

In [None]:
training_padded

In [None]:
print(training_padded.shape)
print(training_labels_final.shape)
#print(validation_labels_final.shape)

# Simple Model

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),  
    tf.keras.layers.Dense(1)
])
#model.compile(loss='binary_crossentropy',optimizer='adam',)#metrics=[rmse])
model.compile(loss='mean_squared_error', metrics=[RootMeanSquaredError()])
model.summary()

In [None]:
num_epochs = 35
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.1,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])

Our model is over-fitting now.


[Back to Top](#section-zero)

# Optimizing the Model

Let's optimize this model and then run in next section.

*Earlier values:*
vocab_size = 51038
embedding_dim = 64
max_length = 50
trunc_type='post'
pad_type='post'
oov_tok = "<OOV>"

**Exploring embedding dimensions**

Best practice for embedding size is to have it be the fourth root of the vocab size. 



In [None]:

def f(num):
    return math.sqrt(math.sqrt(num))

f(51308)

In [None]:
f(16662)

In [None]:
embedding_dim = 12

**Using regularization**
Regularization is a technique that helps prevent overfitting by reducing the polariza‐
tion of weights. If the weights on some of the neurons are too heavy, regularization
effectively punishes them. Broadly speaking, there are two types of regularization: L1
and L2.
* L1 regularization is often called lasso (least absolute shrinkage and selection operator)
    regularization. It effectively helps us ignore the zero or close-to-zero weights when
    calculating a result in a layer.
* L2 regularization is often called ridge regression because it pushes values apart by taking their squares. This tends to amplify the differences between nonzero  values and zero or close-to-zero ones, creating a ridge effect.

For NLP problems like the one we’re considering, L2 is most commonly used. It can be added as an attribute to the Dense layer using the kernel_regularizers property,
and takes a floating-point value as the regularization factor. This is another hyperparameter that you can experiment with to improve your model.

**Max length Optimization** We have arbitarly set max_length as 50 earlier. 

In [None]:
xs=[]
ys=[]
current_item=1
for item in text:
 xs.append(current_item)
 current_item=current_item+1
 ys.append(len(item))
newys = sorted(ys)
plt.xlabel('Excerpt')
plt.ylabel('Word Length')
plt.title('Length of Words in Excerpt')
plt.plot(xs,newys)

plt.show()

Most excerpts have 800 words or less, so we use that value instead.

In [None]:
max_length = 800

We will create a simple model using an embedding after applying all the optimizations we learned.


[Back to Top](#section-zero)

<a id="subsection-embedding"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Create the model using an Embedding </p>

Let's define a simple model with embedding layer as the first layer
For regression to arbitary values problem, dont give last layer activations

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu', kernel_regularizer = tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(1)
])
model.compile(loss='mean_squared_error',optimizer='adam', metrics=[RootMeanSquaredError()])
model.summary()

The num of params of the embedding layer will be (vocab_size) * (embedding_dim).
The average pooling layer has 0 trainable parameters, as it’s just averaging the parameters in the embedding layer before it.

Using validation split as 0.1 so that 10% of training data is used for validation purpose.

In [None]:
num_epochs = 100
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.1,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])

Let us see the predictions for df_test

In [None]:

predict_complexity(model, df_test['excerpt'])
plot_graphs(history, "root_mean_squared_error")
plot_graphs(history, "loss")


[Back to Top](#section-zero)


<a id="subsection-CNN"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Convolutional Neural Network (CNN/ConvNet) </p>

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(embedding_dim, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(), 
    tf.keras.layers.Dense(24, activation='relu', kernel_regularizer = tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(1)
])

# Default learning rate for the Adam optimizer is 0.001
# Let's slow down the learning rate by 10.
learning_rate = 0.0001
model.compile(loss='mean_squared_error',optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=[RootMeanSquaredError()])
model.summary()

In [None]:
num_epochs = 100
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.1,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])

In [None]:
model.save("commonlitmodel.h5")


[Back to Top](#section-zero)

<a id="subsection-GRU"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Gated Recurrent Units  RNN(GRU) </p>

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dense(1)
])

# Default learning rate for the Adam optimizer is 0.001
# Let's slow down the learning rate by 10.
learning_rate = 0.00003
model.compile(loss='mean_squared_error',optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=[RootMeanSquaredError()])
model.summary()

In [None]:
num_epochs = 35
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.1,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])


[Back to Top](#section-zero)

<a id="subsection-LSTM"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Bidirectional Long Short Term Memory (LSTM) </p>

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)), 
    tf.keras.layers.Dense(1)
])

# Default learning rate for the Adam optimizer is 0.001
# Let's slow down the learning rate by 10.
learning_rate = 0.00003
model.compile(loss='mean_squared_error',optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=[RootMeanSquaredError()])
model.summary()

In [None]:
num_epochs = 35
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.1,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])


[Back to Top](#section-zero)

<a id="subsection-multiple-LSTM"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Multiple Bidirectional Long Short Term Memory (LSTM) </p>

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim, 
                                                       return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
    tf.keras.layers.Dense(1)
])

# Default learning rate for the Adam optimizer is 0.001
# Let's slow down the learning rate by 10.
learning_rate = 0.00003
model.compile(loss='mean_squared_error',optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=[RootMeanSquaredError()])
model.summary()

In [None]:
num_epochs = 35
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.1,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])


[Back to Top](#section-zero)

<a id="section-hyperparametertuning"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Exploring Hyper Parameter Tuning with Keras </p>

Let's define a function to build our model. Try optimizing units and learning rate.

In [None]:
def model_builder(hp):
  model = keras.Sequential()
  model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length))

  # Tune the number of units in the first Dense layer
  # Choose an optimal value between 32-512
  hp_units = hp.Int('units', min_value=16, max_value=256, step=8)
  model.add(keras.layers.Dense(units=hp_units, activation='relu'))
  model.add(keras.layers.Dense(1))

  # Tune the learning rate for the optimizer
  # Choose an optimal value from 0.01, 0.001, or 0.0001
  hp_learning_rate = hp.Choice('learning_rate', values=[1e-3, 1e-4])

  model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss=keras.losses.MeanSquaredError(),
                metrics=[RootMeanSquaredError()])

  return model

# Random Search

kerastuner.tuners.randomsearch.RandomSearch(hypermodel, objective, max_trials, seed=None, hyperparameters=None, tune_new_entries=True, allow_new_entries=True, **kwargs)

In [None]:
tuner_search=kt.RandomSearch(model_builder,
                       objective = kt.Objective("val_root_mean_squared_error", direction="min"),
                       max_trials=5,directory='output',project_name="nlp")

Uncomment this to run

In [None]:
#tuner_search.search(training_padded,training_labels_final,epochs=10,validation_split=0.1)

# Hyperband

kerastuner.tuners.hyperband.Hyperband(hypermodel, objective, max_epochs, factor=3, hyperband_iterations=1, seed=None, hyperparameters=None, tune_new_entries=True, allow_new_entries=True, **kwargs)

In [None]:
tuner = kt.Hyperband(model_builder,
                     max_epochs=10,
                     objective = kt.Objective("val_root_mean_squared_error", direction="min"),
                     factor=3,
                     directory='my_dir',
                     project_name='intro_to_kt')

In [None]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

Uncomment this to run. 

In [None]:

#tuner.search(training_padded, training_labels_final, epochs=5, validation_split=0.1, callbacks=[stop_early])

# Get the optimal hyperparameters
#best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

#print(f"""The hyperparameter search is complete. The optimal number of units in the first densely-connected layer is {best_hps.get('units')} and the optimal learning rate for the optimizeris {best_hps.get('learning_rate')}.""")

[Back to Top](#section-zero)

<a id="section-gloveembeddings"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Glove Embeddings </p>

What if, instead of learning the embeddings for yourself, you could instead use prelearned embeddings, where researchers have already done the hard work of turning words into vectors and those vectors are proven? One example of this is the GloVe (Global Vectors for WordRepresentation) model developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford

Pretrained Word Embeddings are the embeddings learned in one task that are used for solving another similar task. These embeddings are trained on large datasets, saved, and then used for solving other tasks. That’s why pretrained word embeddings are a form of Transfer Learning.

In [None]:
glove_embeddings = dict()
f = open('/kaggle/input/glove6b/glove.6B.50d.txt')
for line in f:
 values = line.split()
 word = values[0]
 coefs = np.asarray(values[1:], dtype='float32')
 glove_embeddings[word] = coefs
f.close()

In [None]:
glove_embeddings['frog']


In [None]:
# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(glove_embeddings[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(50)
    return v / np.sqrt((v ** 2).sum())

[Back to Top](#section-zero)

<a id="subsection-glovexgb"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> Extreme Gradient Boosting Regressor </p>

In [None]:
xtrain, xvalid, ytrain, yvalid = train_test_split(df_train.excerpt_clean, df_train.target, 
 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)

In [None]:
# create sentence vectors using the above function for training and validation set
xtrain_glove = [sent2vec(x) for x in xtrain]
xvalid_glove = [sent2vec(x) for x in xvalid]

In [None]:
xtrain_glove = np.array(xtrain_glove)
xvalid_glove = np.array(xvalid_glove)

In [None]:
# Fitting a simple xgboost on glove features
clf = xgb.XGBRegressor(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1, silent=False)
clf.fit(xtrain_glove, ytrain)
predictions = clf.predict(xvalid_glove)

print ("MSE: %f " % mse(yvalid, predictions))

[Back to Top](#section-zero)

<a id="subsection-gloveLSTM"></a>

# <p style="font-family: Arial; font-size:1.2em;color:tomato;"> GloVe  Stacked LSTM </p>

In [None]:
embedding_dim = 50
vocab_size = 51308

In [None]:
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, index in tokenizer.word_index.items():
 if index > vocab_size - 1:
     break
 else:
     embedding_vector = glove_embeddings.get(word)
 if embedding_vector is not None:
     embedding_matrix[index] = embedding_vector

In [None]:
model = tf.keras.Sequential([
 tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False),
 tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim,
 return_sequences=True)),
 tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
 tf.keras.layers.Dense(24, activation='relu'),
 tf.keras.layers.Dense(1)
])

In [None]:
learning_rate = 0.00003
model.compile(loss='mean_squared_error',optimizer=tf.keras.optimizers.Adam(learning_rate), metrics=[RootMeanSquaredError()])
model.summary()

In [None]:
num_epochs = 100
history = model.fit(training_padded, training_labels_final, epochs=num_epochs, 
                    validation_split=0.3,
                    #validation_data=(validation_padded, validation_labels_final),
                   callbacks=[early_stopping,learning_rate_reduction])

In [None]:
xs=[]
ys=[]
cumulative_x=[]
cumulative_y=[]
total_y=0
for word, index in tokenizer.word_index.items():
 xs.append(index)
 cumulative_x.append(index)
 if glove_embeddings.get(word) is not None:
     total_y = total_y + 1
     ys.append(1)
 else:
     ys.append(0)
 cumulative_y.append(total_y / index)

In [None]:
fig, ax = plt.subplots(figsize=(12,2))
ax.spines['top'].set_visible(False)
plt.margins(x=0, y=None, tight=True)
#plt.axis([13000, 14000, 0, 1])
plt.fill(ys)

In [None]:
plt.plot(cumulative_x, cumulative_y)
plt.axis([0, 25000, .915, .985])



[Back to Top](#section-zero)


<a id="section-bert"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> HuggingFace TFBertModel </p>

The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

The abstract from the paper is the following:

> We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
> 
> BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
> 
Tips:

* BERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

* BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.

In [None]:

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased', do_lower_case=True)

**BERT Encoding**


Data is encoded according to BERT requirement.There is a very helpful function called encode_plus provided in the Tokenizer class. It can seamlessly perform the following operations:

* Tokenize the text

* Add special tokens - [CLS] and [SEP]

* Create token IDs

* Pad the sentences to a common length

* Create attention masks for the above PAD tokens

Use on data excerpt or excerpt_clean?

In [None]:
def bert_encode(data,maximum_length) :
  input_ids = []
  attention_masks = []
  

  for i in range(len(data.excerpt)):
      encoded = tokenizer.encode_plus(
        
        data.excerpt[i],
        add_special_tokens=True,
        max_length=maximum_length,
        pad_to_max_length=True,
        
        return_attention_mask=True,
        
      )
      
      input_ids.append(encoded['input_ids'])
      attention_masks.append(encoded['attention_mask'])
  return np.array(input_ids),np.array(attention_masks)

Encode both train and test dataset

In [None]:
train_input_ids,train_attention_masks = bert_encode(df_train,60)
test_input_ids,test_attention_masks = bert_encode(df_test,60)

Write a function to create our model. We will add Dense layer(s) as output layer.

Add DropOut if overfitting

In [None]:
def create_model(bert_model):
  input_ids = tf.keras.Input(shape=(60,),dtype='int32')
  attention_masks = tf.keras.Input(shape=(60,),dtype='int32')
  
  output = bert_model([input_ids,attention_masks])
  output = output[1]
  #output = tf.keras.layers.Dense(32,activation='relu')(output)
  #output = tf.keras.layers.Dropout(0.2)(output)

  output = tf.keras.layers.Dense(1)(output)
  model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = output)
  model.compile(tf.keras.optimizers.Adam(lr=6e-6), loss='mean_squared_error', metrics=[RootMeanSquaredError()])
  return model

 Click [TFBertModel](https://huggingface.co/transformers/model_doc/bert.html#tfbertmodel) for more information

In [None]:

bert_model = TFBertModel.from_pretrained('bert-large-uncased')

In [None]:
model = create_model(bert_model)
model.summary()

Change epochs number

In [None]:
history = model.fit([train_input_ids,train_attention_masks],df_train.target,validation_split=0.3, epochs=2,batch_size=10)

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


[Back to Top](#section-zero)

<a id="section-robertabase"></a>
# <p style="font-family: Arial; font-size:1.4em;color:gold;"> roberta-base Hugging Face Transformer </p>

Follow the same steps as the above section

https://huggingface.co/roberta-base

In [None]:

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
BASE_MODEL = TFRobertaModel.from_pretrained('roberta-base')

In [None]:


def create_model(bert_model):
  input_ids = tf.keras.Input(shape=(60,),dtype='int32')
  attention_masks = tf.keras.Input(shape=(60,),dtype='int32')
  
  output = bert_model([input_ids,attention_masks])
  output = output[1]
  output = tf.keras.layers.Dense(32,activation='relu')(output)
  output = tf.keras.layers.Dropout(0.2)(output)

  output = tf.keras.layers.Dense(1)(output)
  model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = output)
  
  return model



model = create_model(BASE_MODEL)
model.compile(tf.keras.optimizers.Adam(lr=6e-6), loss='mean_squared_error', metrics=[RootMeanSquaredError()])
    
model.summary()

In [None]:
history = model.fit([train_input_ids,train_attention_masks],df_train.target,validation_split=0.3, epochs=2,batch_size=10)


[Back to Top](#section-zero)

<a id="section-submission"></a>

# <p style="font-family: Arial; font-size:1.4em;color:gold;"> Making the Submission </p>

In [None]:
def submission(submission_file_path,model,excerpt):
    padding_type='post'
    sample_sequences = tokenizer.texts_to_sequences(excerpt)
    excerpt_padded = pad_sequences(sample_sequences, padding=padding_type, 
                                 maxlen=max_length) 
    classes = model.predict(excerpt_padded)
    sample_submission = pd.read_csv(submission_file_path)
    sample_submission["target"] = classes
    sample_submission.to_csv("submission.csv", index=False)
    

In [None]:
#submission_file_path = "/kaggle/input/commonlitreadabilityprize/sample_submission.csv"

#submission(submission_file_path,model,df_test['excerpt'])


[Back to Top](#section-zero)
