<h1 align="center">📚CommonLit Readability Prize📚 </h1>
<hr>

# Introduction
In this notebook I will try to cover most of the NLP techniques which you can apply which will help you in solving any NLP problem. But before that, I will first show you some explorations techniques which will be useful to analyze the data and then we will proceed further. <br>

<h3>So, Let's Get Started!!!</h3><br>

🎯 <b>Objective:</b> The objective of this competition is to rate the complexity level of literary passages for grades 3-12 use.

📈 <b>Dataset:</b> The dataset contains excerpts from several time periods and a wide range of reading ease scores.

<b>Columns of the train/test data-</b> 

* ```id``` - unique ID for excerpt
* ```url_legal``` - URL of source (blank in the test set)
* ```license``` - license of source material (blank in the test set)
* ```excerpt``` - text to predict reading ease of
* ```target``` - reading ease
* ```standard_error``` - measure of spread of scores among multiple raters for each excerpt (not included for test data)

<div class="alert alert-block alert-info">
    <h2>🙂 Please do an upvote if you find it useful ! 🙂 </h2>
</div>

# Table of Contents
<ul style="list-style-type:square">
    <li><a href="#1">Importing Libraries</a></li>
    <li><a href="#2">Reading the data</a></li>
    <li><a href="#3">Exploratory Data Analysis</a></li>
    <li><a href="#4">Data Preprocessing</a></li>
    <li><a href="#5">ML Models (Baseline)</a></li>
    <ul>
        <li><a href="#5.1">Linear Regression</a></li>
        <li><a href="#5.2">Ridge Regression</a></li>
        <li><a href="#5.3">Support Vector Regression</a></li>
        <li><a href="#5.4">Random Forest Regressor</a></li>
        <li><a href="#5.5">Gradient Boosting Regressor</a></li>
        <li><a href="#5.6">AdaBoost Regressor</a></li>
        <li><a href="#5.7">XGBoost Regressor</a></li>
    </ul>
    <li><a href="#6">DL Models (Baseline)</a></li>
    <ul>
        <li><a href="#6.1">Simple RNN</a></li>
        <li><a href="#6.2">LSTM</a></li>
        <li><a href="#6.3">Bidirectional RNN</a></li>
        <li><a href="#6.4">BERT</a></li>
    </ul>
    <li><a href="#7">Ending Notes</a></li>
</ul>

<a id='1'></a>
# Importing Libraries

In [None]:
import re
import numpy as np
import pandas as pd
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.express as px
plt.style.use('seaborn-darkgrid')
from textblob import TextBlob
from PIL import Image
import requests
from wordcloud import WordCloud, STOPWORDS

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout, SimpleRNN, Bidirectional
from keras.optimizers import Adam

import warnings
warnings.simplefilter('ignore')

<a id='2'></a>
# Reading the data

In [None]:
df = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

In [None]:
df.head()

In [None]:
df.info()

<a id='3'></a>
# Exploratory Data Analysis

### **First let us look the distribution of target.**

In [None]:
plt.figure(figsize=(10, 7))
sns.distplot(df['target'])
plt.title('Target Distribution')
plt.show()
df['target'].describe()

This shows that our data is normally distributed with mean=-0.959319 and standard deviation=1.033579.
The target ranges from -3.676268 to 1.711390 where target=-3.67 is the most difficult text and target=1.71 is the easiest.<br> **Let us also look the distribution of standard error.**

In [None]:
plt.figure(figsize=(10, 7))
sns.distplot(df['standard_error'])
plt.title('Standard Error Distribution')
plt.show()
df['standard_error'].describe()

The standard_error basically tells us the measure of spread of scores among the raters for each excerpt, that means each excerpt has been read by many different people and accordingly they have given their score and standard_error measures the difference. It means that lesser the standard_error, more precise the target value.  From the plot we can observe that it is very sqewed to the left. But we also have one outlier and where standard_error=0. This excerpt is considered as the reference except and all other excerpt are compared with this excerpt. 

In [None]:
df[df['standard_error']== 0]

### **Let us also look at the relationship between target and standard_error.**

In [None]:
plt.figure(figsize=(10, 5))
sns.scatterplot(x='target', y='standard_error', data=df)
plt.title('Standard Error vs Target')
plt.show()

Although we can observe there is no linear relationship between target and standard_error. Still we can say that when the target value is very high or very low (i.e. excerpt is either very easy or very difficult) then the standard_error is high that means most of the raters disagreed.

### **Now let us get some insighsts from the "excerpt".**

First of all we will observe the most common words in the excerpt.
For that we will first clean the data and will also remove the stopwords. We will store this in a new column and will then count the number of words using Counter. In the end we will plot 25 most common words. 

In [None]:
def clean_text(text):
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower().split()
    return [word for word in text if word not in stopwords.words('english')]

df['temp'] = df['excerpt'].apply(lambda x : clean_text(x))

top = Counter([word for words in df['temp'] for word in words])
df_temp = pd.DataFrame(top.most_common(25))
df_temp.columns = ['Common_words','count']

fig = px.bar(df_temp, x='count', y='Common_words', title='Most Common Words in excerpt', orientation='h', width=700,height=700, color='Common_words')
fig.show()

fig = px.treemap(df_temp, path=['Common_words'], values='count',title='Tree of Most Common Words')
fig.show()

### WordCloud

In [None]:
plt.figure(figsize=(10, 10))
text = df['excerpt'].values
url = 'https://static.vecteezy.com/system/resources/previews/000/263/280/non_2x/vector-open-book.jpg'
im = np.array(Image.open(requests.get(url, stream=True).raw))
cloud = WordCloud(stopwords = STOPWORDS,
                  background_color='white',
                  mask = im,
                  max_words = 200,
                  ).generate(" ".join(text))
plt.imshow(cloud)
plt.axis('off')
plt.show()

Then we can also plot the distribution of top part-of-speech tags of excerpt corpus. Part-Of-Speech Tagging (POS) is a process of assigning parts of speech to each word, such as noun, verb, adjective, etc.
For this we will use TextBlog to dive into POS of our "excerpt" data.

In [None]:
text = ' '.join(df['excerpt'])
blob = TextBlob(text)
top = Counter([pos[1] for pos in blob.tags])
df_temp = pd.DataFrame(top.most_common(15))
df_temp.columns = ['Part_of_Speech','count']
fig = px.bar(df_temp, x='Part_of_Speech', y='count', title='Top 15 Part-Of-Speech tagging', width=700,height=700, color='Part_of_Speech')
fig.show()

### **After that, now we will explore the data on the basis of complexity of the text.**

### **Number of words in each passage**

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

text_len = df[df['target'] <= 0]['excerpt'].str.split().map(lambda x: len(x))
sns.distplot(text_len, ax=ax[0], color='red')
ax[0].set_title('High Complexity')

text_len = df[df['target'] > 0]['excerpt'].str.split().map(lambda x: len(x))
sns.distplot(text_len, ax=ax[1], color='blue')
ax[1].set_title('Low Complexity')

fig.suptitle('Number of Words in text')
plt.show()

### **Average word length in each passage**

In [None]:
def avg_word_len(text):
    avg_len = text.str.split().apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x))
    return avg_len

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

avg_len = avg_word_len(df[df['target'] <= 0]['excerpt'])
sns.distplot(avg_len, ax=ax[0], color='red')
ax[0].set_title('High Complexity')

avg_len = avg_word_len(df[df['target'] > 0]['excerpt'])
sns.distplot(avg_len, ax=ax[1], color='blue')
ax[1].set_title('Low Complexity')

fig.suptitle('Average word length in a text')
plt.show()

### **Number of Sentences in text**

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

no_sents = df[df['target'] <= 0]['excerpt'].apply(lambda x : len(x.split('\n')))
sns.distplot(no_sents, ax=ax[0], color='red')
ax[0].set_title('High Complexity')

no_sents = df[df['target'] > 0]['excerpt'].apply(lambda x : len(x.split('\n')))
sns.distplot(no_sents, ax=ax[1], color='blue')
ax[1].set_title('Low Complexity')

fig.suptitle('Number of Sentences in text')
plt.show()

### **Now let us also compare these with the 'target'.**

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(10,15))

df['text_len'] = df['excerpt'].str.split().map(lambda x: len(x))
sns.scatterplot(x='text_len', y='target', data=df, ax=ax[0])
ax[0].set_title("Word Count vs Target", fontweight ="bold")

avg_len = avg_word_len(df['excerpt'])
df['avg_word_len'] = avg_len
sns.scatterplot(x='avg_word_len', y='target', data=df, color='red', ax=ax[1])
ax[1].set_title("Average Word Length vs Target", fontweight ="bold")

df['no_sents'] = df['excerpt'].apply(lambda x : len(x.split('\n')))
sns.scatterplot(x='no_sents', y='target', data=df, color='orange', ax=ax[2])
ax[2].set_title("Sentence Count vs Target", fontweight ="bold")

plt.subplots_adjust(hspace=0.35)

plt.show()

### **Confusion Matrix**

In [None]:
corr = df.corr()
fig = plt.figure(figsize=(10,10))
sns.heatmap(corr, cmap="YlGnBu", center=0, square=True, linewidths=.5, annot=True)
plt.show()

<a id='4'></a>
# Data Preprocessing

In [None]:
wnl = WordNetLemmatizer()
def clean_text(text):
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower().split()
    text = [wnl.lemmatize(word) for word in text if word not in stopwords.words('english')]
    text = " ".join(text)
    
    return text

In [None]:
df['clean_text'] = df['excerpt'].apply(lambda x : clean_text(x))

In [None]:
X = df['clean_text'].values
y = df['target'].values

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=5)

After all the explorations and preprocessing now it's time to create some models. Firstly, we will build some different baseline machine learning models. I will be just creating some basic models so that you get to know how can use machine learning for nlp tasks. By using more features and some hyperparameter tuning, you can achieve better results also.

<a id='5'></a>
# ML Models (Baseline)

Before applying different kinds of ml algorithms, we first have to convert our string data into some numerical form(or vectorial form). So we can convert the text data into vector form through many ways. Here I will be using TF-IDF vectorizer.

In [None]:
tfidf = TfidfVectorizer(binary=True)
vect = tfidf.fit(X_train)
X_train = vect.transform(X_train)
X_val = vect.transform(X_val)

In [None]:
mse_plot = {} # For plotting purpose

<a id='5.1'></a>
## Linear Regression

In [None]:
model_lr = LinearRegression().fit(X_train, y_train)
y_pred = model_lr.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['LinearRegression'] = mse
print(f"Model Name: Linear Regression ====>>> MSE:{mse}")

<a id='5.2'></a>
## Ridge Regression

In [None]:
model_rr = Ridge().fit(X_train, y_train)
y_pred = model_rr.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['RidgeRegression'] = mse
print(f"Model Name: Ridge Regression ====>>> MSE:{mse}")

<a id='5.3'></a>
## Support Vector Regression

In [None]:
model_svr = SVR().fit(X_train, y_train)
y_pred = model_svr.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['SVR'] = mse
print(f"Model Name: Support Vector Regression ====>>> MSE:{mse}")

<a id='5.4'></a>
## Random Forest Regressor

In [None]:
model_rf = RandomForestRegressor().fit(X_train, y_train)
y_pred = model_rf.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['RandomForest'] = mse
print(f"Model Name: Random Forest Regressor ====>>> MSE:{mse}")

<a id='5.5'></a>
## Gradient Boosting Regressor

In [None]:
model_gbr = GradientBoostingRegressor().fit(X_train, y_train)
y_pred = model_gbr.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['GradientBoosting'] = mse
print(f"Model Name: Gradient Boosting Regressor ====>>> MSE:{mse}")

<a id='5.6'></a>
## AdaBoost Regressor

In [None]:
model_abr = AdaBoostRegressor().fit(X_train, y_train)
y_pred = model_abr.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['AdaBoost'] = mse
print(f"Model Name: AdaBoost Regressor ====>>> MSE:{mse}")

<a id='5.7'></a>
## XGBoost Regressor

In [None]:
model_xgb = XGBRegressor().fit(X_train, y_train)
y_pred = model_xgb.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
mse_plot['XGBoost'] = mse
print(f"Model Name: XGBoost Regressor ====>>> MSE:{mse}")

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(list(mse_plot.keys()), list(mse_plot.values()))
plt.xlabel("ML Models")
plt.ylabel("Mean_Squared_Error")
plt.title("Comparison Graph", fontsize=20)
plt.show()

<a id='6'></a>
# DL Models (Baseline)
Next, let us also talk about deep learning models. <br>
We will be using the same data which we got after applying data preprcoessing steps. But before creating the models we first have to process the data differently.<br>
Just as we converted our text data into vector form previously, here also we will convert our text data but using different technique.
Firstly, we will convert our data into one hot representation and for that we will be using keras Tokenizer. Then we will do padding i.e., we will make all the sentence length to be equal.

In [None]:
tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
word_index = tokenizer.word_index

In [None]:
max_len = 256
padded = pad_sequences(sequences, maxlen=max_len, padding='post')

In [None]:
X_train, X_val, y_train, y_val = train_test_split(padded, y, test_size=0.2, random_state=5)

<a id='6.1'></a>
## Simple RNN

In [None]:
model1 = Sequential()
model1.add(Embedding(len(word_index)+1, 250, input_length=max_len))
model1.add(SimpleRNN(100, return_sequences=True))
model1.add(SimpleRNN(100))
model1.add(Dense(100, activation='linear'))
model1.add(Dense(1, activation='linear'))

In [None]:
model1.summary()

In [None]:
model1.compile(optimizer='Adam', loss='mean_squared_error', metrics=['mse'])

In [None]:
model1.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=32, epochs=5)

<a id='6.2'></a>
## LSTM

In [None]:
model2 = Sequential()
model2.add(Embedding(len(word_index)+1, 250, input_length=max_len))
model2.add(LSTM(100, return_sequences = True))
model2.add(LSTM(100))
model2.add(Dense(100, activation='linear'))
model2.add(Dense(1, activation='linear'))

In [None]:
model2.summary()

In [None]:
model2.compile(optimizer='Adam', loss='mean_squared_error', metrics=['mse'])

model2.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=32, epochs=5)

<a id='6.3'></a>
## Bidirectional RNN

In [None]:
model3 = Sequential()
model3.add(Embedding(len(word_index)+1, 250, input_length = max_len))
model3.add(Bidirectional(LSTM(100, return_sequences = True)))
model3.add(Bidirectional(LSTM(100, dropout=0.3, recurrent_dropout=0.3)))
model3.add(Dense(100, activation='linear'))
model3.add(Dense(1, activation='linear'))

In [None]:
model3.summary()

In [None]:
model3.compile(optimizer='Adam', loss='mean_squared_error', metrics=['mse'])

model3.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=64, epochs=5)

<a id='6.4'></a>
# BERT
Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based machine learning technique developed by Google. Before we see implementation of BERT, if you are a beginner and had never used BERT, I would recommend you to go through the resources which I have listed below so that you know the basics and then implement BERT with a better understanding.

### Sequence To Sequence Models
* https://www.youtube.com/watch?v=jCrgzJlxTKg&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=26&ab_channel=KrishNaik

### Attention Models
* https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a
* https://www.youtube.com/watch?v=fdhojC37_Co&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=28&ab_channel=KrishNaik

### Transformers
* http://jalammar.github.io/illustrated-transformer/
* https://www.youtube.com/watch?v=SMZQrJ_L1vo&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=29&ab_channel=KrishNaik

### BERT
* http://jalammar.github.io/illustrated-bert/
* https://www.youtube.com/watch?v=xI0HHN5XKDo&ab_channel=CodeEmporium

After going through all the resources, you will have a sound understanding of all the topics. So now its time to implement BERT. We will fine-tune our model for our task using TF/Keras.

**Reference :- [TF/Keras BERT Baseline (Training/Inference)](https://www.kaggle.com/jeongyoonlee/tf-keras-bert-baseline-training-inference/notebook)**

In [None]:
import tensorflow as tf
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras import Model, Input, backend as K
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

from transformers import TFBertModel, BertConfig, BertTokenizerFast
from tensorflow.keras.callbacks import LearningRateScheduler

### Tokenization using transformers

In [None]:
pretrained_dir = '../input/tfbert-base-uncased'

In [None]:
tokenizer = BertTokenizerFast.from_pretrained(pretrained_dir)

model_config = BertConfig.from_pretrained(pretrained_dir)
model_config.output_hidden_states = True

bert_model = TFBertModel.from_pretrained(pretrained_dir, config=model_config)

In [None]:
def bert_encode(texts, tokenizer, max_len=max_len):
    input_ids = []
    token_type_ids = []
    attention_mask = []
    
    for text in texts:
        token = tokenizer(text, max_length=max_len, truncation=True, padding='max_length',
                         add_special_tokens=True)
        input_ids.append(token['input_ids'])
        token_type_ids.append(token['token_type_ids'])
        attention_mask.append(token['attention_mask'])
    
    return np.array(input_ids), np.array(token_type_ids), np.array(attention_mask)

In [None]:
X = bert_encode(X, tokenizer, max_len=max_len)

### Model Training 

In [None]:
def build_model(bert_model, max_len):
    input_ids = Input(shape=(max_len, ), dtype=tf.int32, name='input_ids')
    attention_mask = Input(shape=(max_len, ), dtype=tf.int32, name='attention_masks')
    token_type_ids = Input(shape=(max_len, ), dtype=tf.int32, name='token_type_ids')
    
    sequence_output = bert_model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0]
    output = sequence_output[:, 0, :]
    output = Dropout(0.2)(output)
    output = Dense(1, activation='linear')(output)
    
    model = Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=output)
    model.compile(Adam(lr=1e-5), loss='mean_squared_error', metrics=[RootMeanSquaredError()])
    
    return model

In [None]:
model4 = build_model(bert_model, max_len=max_len)
model4.summary()

In [None]:
def scheduler(epoch, lr, warmup=5, decay_start=10):
    if epoch <= warmup:
        return lr / (warmup - epoch + 1)
    elif warmup < epoch <= decay_start:
        return lr
    else:
        return lr * tf.math.exp(-.1)

ls = LearningRateScheduler(scheduler, verbose=1)

In [None]:
model4.fit( X, y, validation_split=0.2, epochs=5, batch_size=8, callbacks=[ls])

In [None]:
metrics1 = pd.DataFrame(model1.history.history)
metrics2 = pd.DataFrame(model2.history.history)
metrics3 = pd.DataFrame(model3.history.history)
metrics4 = pd.DataFrame(model4.history.history)

In [None]:
plt.figure(figsize=(9, 6))
metrics1['val_loss'].plot(label='SimpleRNN', marker='o')
metrics2['val_loss'].plot(label='LSTM', marker='o')
metrics3['val_loss'].plot(label='Bidirectional RNN', marker='o')
metrics4['val_loss'].plot(label='BERT', marker='o')
plt.xlabel("Epochs", fontsize=12)
plt.ylabel("Model Val_Loss", fontsize=12)
plt.title("Model Val_Loss vs Epochs", fontsize=16)
plt.legend()
plt.show()

<a id='7'></a>
# Ending Notes
I have tried to cover most of the techniques but still there are some other methods too which I have not covered in this notebook. If you want to understand Word2Vec and Glove which I have not covered here : - [Click Here](https://github.com/Printutcarsh/Complete-NLP-notebook-Part-2-Word2Vec-and-Glove) <br>
So, I hope this notebook will help you in this competition and in other NLP tasks as well. <br>
<div class="alert alert-block alert-info">
    <h2 align="center">Please do an upvote if you find it useful !</h2>
</div>