<br>
<h1 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">CommonLit Readability</h1>
<br>
<img src="https://i.imgur.com/mi9U6o5.png">

### <h3 style="color:#fe346e">About the Problem</h3>

CommonLit, Inc., is a nonprofit education technology organization serving over 20 million teachers and students with free digital reading and writing lessons for grades 3-12. Together with Georgia State University, an R1 public research university in Atlanta, they are challenging Kagglers to improve readability rating methods.
In this competition, you’ll build algorithms to rate the complexity of reading passages for grade 3-12 classroom use. To accomplish this, you'll pair your machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.

### <h3 style="color:#fe346e">Evaluation</h3>

Submissions are scored on the root mean squared error. RMSE is defined as:

<img src="https://miro.medium.com/max/966/1*lqDsPkfXPGen32Uem1PTNg.png">

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Import Libraries&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping,ReduceLROnPlateau

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as ex
import plotly.graph_objs as go

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Read dataset&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
train=pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test=pd.read_csv("../input/commonlitreadabilityprize/test.csv")
submission=pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")

In [None]:
print(train.head)
print(test.head)
print(submission.head)

In [None]:
display(train.info())
print("\n\n")
display(test.info())
print("\n\n")
display(submission.info())

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Check nulls and unique values&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
train_stats = (pd.concat([train.apply(lambda x: x.nunique(), axis = 0)
                          .rename("distinct_values").to_frame(),
                          train.apply(lambda x: x.notna().sum(), axis = 0)
                          .rename("not_nan_values").to_frame()], 1)
              .reset_index().rename({'index': 'variable'}, axis = 1))

In [None]:
train_stats

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Plot distribution of values&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
fig=go.Figure()
fig.add_trace(go.Bar(x=train_stats['variable'],
                     y=train_stats['distinct_values'],
                     name="distinct val",
                     text=train_stats['distinct_values'],
                     textposition="outside"
))
fig.add_trace(go.Bar(x=train_stats['variable'],
                     y=train_stats['not_nan_values'],
                     name="not na",
                     text=train_stats['distinct_values'],
                     textposition="outside"
))
fig.update_layout(title={'text':"Train set Data Comparison Unique vs Not NA's",
                         'xanchor': 'center',
                         'yanchor': 'top',
                         'x':0.5,'y':0.97},
                  font=dict(size=10,family='Verdana'),
                  template='plotly_dark',
                  legend=dict(
                        orientation='h',
                        yanchor="bottom",
                        y=1.01,
                        xanchor="center",
                        x=0.5,
                        bgcolor="black",
                        bordercolor="white",
                        borderwidth=2,
                        font=dict(
                                family="Courier",
                                size=10,
                                color="white"
                            )))
fig.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Plot distribution of target values and Std.dev&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
fig, ax = plt.subplots(2, 1, figsize = (20, 12))

fig.suptitle("Distribution of Target and Std. Dev.", fontsize = 20)

sns.histplot(data = train, x = 'target', 
             ax = ax[0], kde=True, bins = 50,
             stat = 'density', color = 'maroon',
             alpha = 0.3, label = 'Target',
             linewidth = 3, line_kws= {'linewidth': 3})

ax[0].legend(fontsize=18)
ax[0].set_xlabel('target', fontsize = 18)
ax[0].set_title('target distribution', fontsize = 15)
ax[0].tick_params(axis='both', which='major', labelsize=14)
ax[0].tick_params(axis='both', which='minor', labelsize=14)

sns.histplot(data = train, x = 'standard_error', 
             ax = ax[1], kde=True, bins = 50,
             stat = 'density', color = 'blue',
             alpha = 0.3, label = 'Standard Error',
             linewidth = 3, line_kws= {'linewidth': 3})

ax[1].legend(fontsize=18)
ax[1].set_xlabel('standard_error', fontsize = 18)
ax[1].set_xlim(0.4, 0.7)
ax[1].set_title('standard_error distribution', fontsize = 15)
ax[1].tick_params(axis='both', which='major', labelsize=14)
ax[1].tick_params(axis='both', which='minor', labelsize=14)

plt.subplots_adjust(hspace = 0.3)

<div class="alert alert-success">The above distribution plot for <b>Target</b> seems to have slight normal distribution whereas Standard error plot seems to show some right skewness

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Check for Outliers in Target&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
fig = go.Figure()
fig.add_trace(go.Violin(y=train['target'],
                            name='KDE with Boxplot',
                            box_visible=True,
                            meanline_visible=True,
                            text=train['target'],
                            fillcolor="lightgrey",
                            line=dict(color='darkred')))
fig.update_traces(points='all',jitter=0.05)
fig.update_layout(template='plotly_dark',
                  title={'text':"Distribution of Target",
                         'xanchor': 'center',
                         'yanchor': 'top',
                         'x':0.5,'y':0.9},
                  width=1000,
                  height=800,
                  yaxis_title="Target",
                  legend=dict(
                        orientation='h',
                        yanchor="bottom",
                        y=0.5,
                        xanchor="center",
                        x=0.5,
                        bgcolor="black",
                        bordercolor="white",
                        borderwidth=2,
                        font=dict(
                                family="Courier",
                                size=10,
                                color="white"
                            )))
fig.show()

<div class="alert alert-success">The above boxplot cum violin distribution plot for <b>Target</b> seems to show mean very close to median value

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Check for Outliers in Standard Error&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
fig = go.Figure()
fig.add_trace(go.Violin(y=train['standard_error'],
                            name='KDE with Boxplot',
                            box_visible=True,
                            meanline_visible=True,
                            text=train['target'],
                            fillcolor="lightgrey",
                            line=dict(color='darkred')))
fig.update_traces(points='all',jitter=0.05)
fig.update_layout(template='plotly_dark',
                  title={'text':"Distribution of standard error",
                         'xanchor': 'center',
                         'yanchor': 'top',
                         'x':0.5,'y':0.9},
                  width=1000,
                  height=800,
                  yaxis_title="standard error",
                  font=dict(
                            size=10,
                            color="white"),
                  legend=dict(
                            yanchor="bottom",
                            y=-0.3,
                            xanchor="center",
                            x=0.5))
fig.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Cleaning the text data&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

In [None]:
train['excerpt']=train['excerpt'].apply(lambda x:strip_links(x))
test['excerpt']=test['excerpt'].apply(lambda x:strip_links(x))

In [None]:
### replace :\n 
train['excerpt']=train['excerpt'].str.replace("\n",' ')
test['excerpt']=test['excerpt'].str.replace("\n",' ')

In [None]:
# Define the function to remove the punctuation
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
# Apply to the DF series
train['excerpt'] = train['excerpt'].apply(remove_punctuations) 
test['excerpt'] = test['excerpt'].apply(remove_punctuations) 

In [None]:
### lower case 
train['excerpt']=train['excerpt'].apply(lambda x:x.lower())
test['excerpt']=test['excerpt'].apply(lambda x:x.lower())

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Word Clouds&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
# Define a function to plot word cloud
def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(40, 30))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");

In [None]:
# Import package
from wordcloud import WordCloud, STOPWORDS
# Generate word cloud
wordcloud = WordCloud(width = 3000, 
                      height = 2000, 
                      random_state=1, 
                      background_color='salmon', 
                      colormap='Pastel1', 
                      collocations=False, 
                      stopwords = STOPWORDS).generate(train['excerpt'].values[0])
# Plot
plot_cloud(wordcloud)

In [None]:
# Generate wordcloud
wordcloud = WordCloud(width = 3000, 
                      height = 2000, 
                      random_state=1, 
                      background_color='black', 
                      colormap='Set2', 
                      collocations=False, 
                      stopwords = STOPWORDS).generate(train['excerpt'].values[1])
# Plot
plot_cloud(wordcloud)

In [None]:
from PIL import Image
# Import image to np.array
mask = np.array(Image.open('../input/maskcloud/upvote.png'))
# Generate wordcloud
wordcloud = WordCloud(width = 3000, 
                      height = 2000, 
                      random_state=1, 
                      background_color='white', 
                      colormap='rainbow', 
                      collocations=False, 
                      stopwords = STOPWORDS, mask=mask).generate(train['excerpt'].values[2])
# Plot
plot_cloud(wordcloud)

In [None]:
from PIL import Image
# Import image to np.array
mask = np.array(Image.open('../input/maskcloud1/comment.png'))
# Generate wordcloud
wordcloud = WordCloud(width = 3000, 
                      height = 2000, 
                      random_state=1, 
                      background_color='white', 
                      colormap='rainbow', 
                      collocations=False, 
                      stopwords = STOPWORDS, mask=mask).generate(test['excerpt'].values[0])
# Plot
plot_cloud(wordcloud)

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Drop columns not required&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
train.head()
X=train[['id','excerpt','target']]

In [None]:
## Check lenght of text in the data
train['excerpt'].apply(lambda x:len(str(x).split())).max()

### Split the data into Train and Test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.excerpt, X.target, 
                                                    random_state=42, 
                                                    test_size=0.2)

### Define max features and max len

In [None]:
max_features = 5000
maxlen = 200

### Tokenization and Indexing

In [None]:
# using keras tokenizer here
token = tf.keras.preprocessing.text.Tokenizer(num_words=max_features)

token.fit_on_texts(list(X_train) + list(X_test))
X_train_seq = token.texts_to_sequences(X_train)
X_test_seq = token.texts_to_sequences(X_test)

#zero pad the sequences
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen=maxlen,padding='post',truncating='post')
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen=maxlen,padding='post',truncating='post')

word_index = token.word_index

In [None]:
X_train[0]

In [None]:
X_train_pad[0]

In [None]:
len(token.word_index)##30112

### Load embedding file

In [None]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
#unzip the file, we get multiple embedding files. We can use either one of them
# !unzip glove.6B.zip

from gensim.scripts.glove2word2vec import glove2word2vec

#Glove file - we are using model with 50 embedding size
glove_input_file = "../input/gloveembeddings/glove.6B.50d.txt"

#Name for word2vec file
word2vec_output_file = 'glove.6B.50d.txt.word2vec'

#Convert Glove embeddings to Word2Vec embeddings
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
### We will extract word embedding for which we are interested in; the pre trained has 400k words each with 50 embedding vector size.
from gensim.models import Word2Vec, KeyedVectors

# Load pretrained embedding model (in word2vec form)
emd_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

#Embedding length based on selected model - we are using 300d here.
embedding_vector_length = 50

In [None]:
#Initialize embedding matrix
embedding_matrix = np.zeros((max_features + 1, embedding_vector_length))
print(embedding_matrix.shape)

In [None]:
for word, i in sorted(token.word_index.items(),key=lambda x:x[1]):
    if i > (max_features+1):
        break
    try:
        embedding_vector = emd_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
    except:
        pass

In [None]:
embedding_matrix.shape

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Model Building : LSTM&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
### callbacks
learning_rate_reduction = ReduceLROnPlateau(monitor='val_root_mean_squared_error', patience=3, verbose=1, factor=0.5, min_lr=0.00001)
early_stopping = EarlyStopping(min_delta=0.001,patience=5,restore_best_weights=True,verbose=1)

In [None]:
# A simple bidirectional LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(max_features+1,
                    embedding_vector_length,
                    weights=[embedding_matrix],
                    input_length=maxlen, 
                    trainable=False))
model.add(Bidirectional(LSTM(100, dropout=0.3, recurrent_dropout=0.3)))
model.add(Dense(1,activation='linear'))
model.compile(loss=tf.keras.losses.MeanSquaredError(), optimizer='adam',metrics=tf.keras.metrics.RootMeanSquaredError())
model.summary()

In [None]:
history=model.fit(X_train_pad,y_train,
          epochs=50,
          batch_size=32,          
          validation_data=(X_test_pad, y_test),
          callbacks=[early_stopping,learning_rate_reduction])

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Plots for Accuracy and Loss&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
get_acc = history.history['root_mean_squared_error']
value_acc = history.history['val_root_mean_squared_error']
get_loss = history.history['loss']
validation_loss = history.history['val_loss']

In [None]:
# summarize history for metric
fig=go.Figure()
fig.add_trace(go.Scatter(x=[n for n in range(1,51)],
                         y=get_acc,
                         name="Training RMSE",
                         mode="markers+lines",
                         marker=dict(color='green',size=4)))
fig.add_trace(go.Scatter(x=[n for n in range(1,50)],
                         y=value_acc,
                         name="Validation RMSE",
                         mode="markers+lines",
                         marker=dict(color='red',size=4)))
fig.update_layout(title="Model Metric - Training and Validation",
                  xaxis_title="Epochs",
                  yaxis_title="RMSE value",
                  template="plotly_dark"
                 )
fig.show()

In [None]:

# summarize history for metric
fig=go.Figure()
fig.add_trace(go.Scatter(x=[n for n in range(1,51)],
                         y=get_loss,
                         name="Training Loss",
                         mode="markers+lines",
                         marker=dict(color='green',size=4)))
fig.add_trace(go.Scatter(x=[n for n in range(1,51)],
                         y=validation_loss,
                         name="Validation Loss",
                         mode="markers+lines",
                         marker=dict(color='red',size=4)))
fig.update_layout(title="Model Metric - Training and Validation",
                  xaxis_title="Epochs",
                  yaxis_title="Loss",
                  template="plotly_dark"
                 )
fig.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Prediction on test&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
test.head()

In [None]:
test.shape

In [None]:
test_seq = token.texts_to_sequences(test['excerpt'])
test_pad = sequence.pad_sequences(test_seq, maxlen=maxlen,padding='post',truncating='post')

In [None]:
prediction = model.predict(test_pad)

In [None]:
prediction

In [None]:
submission.head()

In [None]:
submission["target"] = prediction
submission.to_csv("submission_v1.csv", index=False)

In [None]:
submission.head()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Model Building : GRU&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
### callbacks
learning_rate_reduction = ReduceLROnPlateau(monitor='val_root_mean_squared_error', patience=3, verbose=1, factor=0.5, min_lr=0.00001)
early_stopping = EarlyStopping(min_delta=0.001,patience=5,restore_best_weights=True,verbose=1)

In [None]:
model1 = Sequential()
model1.add(Embedding(max_features+1,
                    embedding_vector_length, ### 50 here
                    weights=[embedding_matrix],
                    input_length=maxlen, ### 1400 here
                    trainable=False))
model1.add(SpatialDropout1D(0.3))
model1.add(GRU(300))
model1.add(Dense(1, activation='linear'))
model1.compile(loss=tf.keras.losses.MeanSquaredError(), optimizer='adam',metrics=tf.keras.metrics.RootMeanSquaredError())   

model1.summary()

In [None]:
history1=model1.fit(X_train_pad,y_train,
          epochs=50,
          batch_size=32,          
          validation_data=(X_test_pad, y_test),
          callbacks=[early_stopping,learning_rate_reduction])

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Plots for Accuracy and Loss&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
get_acc1 = history1.history['root_mean_squared_error']
value_acc1 = history1.history['val_root_mean_squared_error']
get_loss1 = history1.history['loss']
validation_loss1 = history1.history['val_loss']

In [None]:
# summarize history for metric
fig=go.Figure()
fig.add_trace(go.Scatter(x=[n for n in range(1,51)],
                         y=get_acc1,
                         name="Training RMSE",
                         mode="markers+lines",
                         marker=dict(color='green',size=4)))
fig.add_trace(go.Scatter(x=[n for n in range(1,50)],
                         y=value_acc1,
                         name="Validation RMSE",
                         mode="markers+lines",
                         marker=dict(color='red',size=4)))
fig.update_layout(title="Model Metric - Training and Validation",
                  xaxis_title="Epochs",
                  yaxis_title="RMSE value",
                  template="plotly_dark"
                 )
fig.show()

In [None]:
# summarize history for metric
fig=go.Figure()
fig.add_trace(go.Scatter(x=[n for n in range(1,51)],
                         y=get_loss1,
                         name="Training Loss",
                         mode="markers+lines",
                         marker=dict(color='green',size=4)))
fig.add_trace(go.Scatter(x=[n for n in range(1,51)],
                         y=validation_loss1,
                         name="Validation Loss",
                         mode="markers+lines",
                         marker=dict(color='red',size=4)))
fig.update_layout(title="Model Metric - Training and Validation",
                  xaxis_title="Epochs",
                  yaxis_title="Loss",
                  template="plotly_dark"
                 )
fig.show()

<h1 style="font-family: Verdana; font-size: 24px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 3px; background-color: #ffffff; color: navy;">Prediction on test&nbsp;&nbsp;&nbsp;&nbsp;</h1>

In [None]:
prediction1 = model1.predict(test_pad)

In [None]:
prediction1

In [None]:
submission1=pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")
submission1["target"] = prediction1
submission1.to_csv("submission.csv", index=False)