<img src="https://images.unsplash.com/photo-1491841550275-ad7854e35ca6?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=967&q=80" width="500">
Photo by <a href="https://unsplash.com/@aaronburden?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Aaron Burden</a> on <a href="https://unsplash.com/s/photos/reading-children?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

# CommonLit Readability Prize Competition

CommonLit, Inc., is a nonprofit education technology organization serving over 20 million teachers and students with free digital reading and writing lessons for grades 3-12. Together with Georgia State University, an R1 public research university in Atlanta, they are challenging Kagglers to improve readability rating methods.

## <center style="background-color:Gainsboro; width:40%;">Contents</center>
1. [Overview](#1.-Overview)<br>
    1.1 [Acknowledgements](#1.2.-Acknowledgements)<br>
2. [The Data](#2.-The-Model)<br>
    2.1 [Creating Additional Features](#2.1.-Creating-Additional-Features)<br>
    2.2 [KDE Plots](#2.2.-KDE-Plots)<br>
    2.3 [Target Feature and Feature Relantionships](#2.3.-Target-Feature-and-Feature-Relantionships)<br>
    2.4 [Target Quartiles](#2.2.-Target-Quartiles)<br>
3. [Model](#3.-Model)<br>   
4. [Results and Conclusion](#4.-Results-and-Conclusion)<br>

***Please remember to upvote if you find this Notebook helpful!***

# 1. Overview

In this challenge we are given classroom text data that are used as reading passages for grade 3-12. The goal is to use ML as a regression model to define the readability score of each text passage:

> "*In this competition, you’ll build algorithms to rate the complexity of reading passages for grade 3-12 classroom use. To accomplish this, you'll pair your machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.<br>
>  If successful, you'll aid administrators, teachers, and students. Literacy curriculum developers and teachers who choose passages will be able to quickly and accurately evaluate works for their classrooms. Plus, these formulas will become more accessible for all. Perhaps most importantly, students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.*"

## 1.1 Acknowledgements

In the Kaggle competition page, CommonLit extends a special thanks to Professor Scott Crossley's research team at the Georgia State University Departments of Applied Linguistics and Learning Sciences for their partnership on this project.In addition, Schmidt Futures is also highlighted for their advice and support for making this challenge possible.

# 2. The Data

In this section we understand the dataset, analyse any data cleansing requirements and perform Data Analysis. The description of the features are shown below, extracted from the Challenge Organisers:

* **id** - unique ID for excerpt
* **url_legal** - URL of source - this is blank in the test set.
* **license** - license of source material - this is blank in the test set.
* **excerpt** - text to predict reading ease of
* **target** - reading ease
* **standard_error** - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

In [None]:
#The Basics
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
import os, string

#Text Processing
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

#Optimisation
import pickle
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING) #Disable Warnings

#ML Model
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import decomposition
from sklearn.preprocessing import MinMaxScaler


#Reproducible results
from numpy.random import seed
seed(0)
import tensorflow
tensorflow.random.set_seed(0)

#ANN
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.callbacks import EarlyStopping,ReduceLROnPlateau
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import RootMeanSquaredError

> My Functions

In [None]:
def basic_EDA(df):
    size = df.shape
    sum_duplicates = df.duplicated().sum()
    sum_null = df.isnull().sum().sum()
    is_NaN = df. isnull()
    row_has_NaN = is_NaN. any(axis=1)
    rows_with_NaN = df[row_has_NaN]
    count_NaN_rows = rows_with_NaN.shape
    return print("Number of Samples: %d,\nNumber of Features: %d,\nDuplicated Entries: %d,\nNull Entries: %d,\nNumber of Rows with Null Entries: %d %.1f%%" %(size[0],size[1], sum_duplicates, sum_null,count_NaN_rows[0],(count_NaN_rows[0] / df.shape[0])*100))

def summary_table(df):
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    return summary

def count_words(df,words_list):
    count_list = []
    for text in df:
        count=0
        text = preprocess_sentence(text)
        for word in text.split():
            if word in words_list: count+=1
        count_list.append(count)
    return count_list

#Compute Percentiles
def percentile_drop(df,column):
    q25, q50, q75 = np.percentile(df[column], 25), np.percentile(df[column], 50), np.percentile(df[column], 75)
    print ('Lower 25th Quartile: %.2f \nMedian: %.2f \nUpper 75th Quartile: %.2f' % (q25,q50,q75))
    q25_index = df[(df[column] <=q25)].index
    q25_q50_index = df[(df[column] > q25)|(df[column] <= q50)].index
    q50_q75_index = df[(df[column] > q50)|(df[column] <= q75)].index
    q75_index = df[(df[column] >=q75)].index
    return q25, q50, q75

def process_list(text): #returns a list of preprocessed words
        word_list = []
        #for t in text:            
        text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text) #remove punctuations
        text = text.lower() #lower case
        tokenized_word=word_tokenize(text) #separate into words
        for word in tokenized_word:
            if word not in stop_words: #filter stop-words
                word = stem.stem(word) #stemming
                word_list.append(word) #append to general list
        return word_list
    
def build_freqs(texts, category): #computes the frequency of words 
    categorylist = np.squeeze(category).tolist()
    freqs = {}
    words_sample = []
    for text, cat in zip(texts,categorylist):
        for word in process_list(text):
            words_sample.append(word)
            pair = (word, cat)
            freqs[pair] = freqs.get(pair, 0) + 1  
    return freqs,words_sample


def preprocess_sentence(df): #returns the whole sentence, with preprocessed text
    word_list = []
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', df) #remove punctuations
    text = text.lower() #lower case
    tokenized_word=word_tokenize(text) #separate into words
    for word in tokenized_word:
        if word not in stop_words: #filter stop-words
            word = stem.stem(word) #stemming
            word_list.append(word) #append to general list
    return ' '.join(word_list) #rejoins the sentence without the stopwords

def Count_Repeated(df):
    output = []
    for text in df:
        count_repeated = 0
        freqs = {}
        words_sample = []
        for word in process_list(text):
            words_sample.append(word)
            freqs[word] = freqs.get(word, 0) + 1  
        words_repeated = [word for word, occurrences in freqs.items() if occurrences >= 3] #count the number of words that occur more than 3 times
        #print(words_repeated)
        count_repeated = len(words_repeated)
        output.append(count_repeated)
        #print(count_repeated)
    return output

stop_words=set(stopwords.words("english"))
stem = PorterStemmer()

Importing the Training and Test set:

In [None]:
df_train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
df_test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

print("**Training set**")
basic_EDA(df_train)
print("**Test set**")
basic_EDA(df_test)

* The dataset is rather small, with less than 3.000 samples. It is interesting that the test set contains only seven examples available
* Several Null entries have been found. It is necessary to understand if these empty values are from relevant features or can be ignored
* No duplicated entries, no need to remove samples. It seems like no data cleansing is going to be required

Now, let's understand where the missing values are coming from:

In [None]:
summary_table(df_train)

As good news, there are no Null Entries in the excerpt feature. As such, no strategie to replace or eliminate the rows is required. In short, the only columns that are relevant for us are **excerpt** and **target**. Below is a sample of the training set. 

Next, we separate the training set into Features and Target to continue our analysis.

In [None]:
df_train.head(5)

# 2.1 Creating Additional Features

Similar to what I have done when performing Text Analysis in other Notebooks, the creation of additional features can help us understand what drives the "Target" Readability Score.

* **Word Count** - Total number of words in the text
* **Character Count** - Total number of characters in the text excluding spaces
* **Word Density** - Average length of the words used in the text
* **Punctuation Count** - Total number of punctuations used in the text
* **Upper-Case to Lower-Case Words ratio** - ratio of upper case words used and lower case words used in the text
* **Number of Paragraphs** - By looking at the passages, it was noted the \n symbol. The feature counts the number of \n in the text
* **Number of Formal Words** - Number of Complex Words* found in passage
* **Number of Common Words** - Number of Common Words** found in passage
* **Number of Repeated Words** - Counts non-stopwords that are used repeatedly in the text (*easier to read samples seem to repeat keywords more often*)

> *Formal Words:* Fragment of Manchester PhraseBank https://www.phrasebank.manchester.ac.uk/ <br>
> *Common Words:* List of ~3.000 of the most commonly used English words extracted from Vocabulary.com

In [None]:
#Importing Formal Words - Already Stemmed Words
formalWords = pd.read_csv('../input/formalwords/formal_list.csv', delimiter = ',', header=0, names = ['words'])
formalWords = formalWords['words'].tolist()
#Importing Common Words
common_words = pd.read_csv('../input/commonwords3000/CommonWords.csv', delimiter = ',', header=0, names = ['words'])
common_words = common_words['words'].tolist()
#Stemming 
common_words_stem = [stem.stem(word) for word in common_words]

In [None]:
punctuations = string.punctuation
#Adding new features
df_train['word_count'] = df_train['excerpt'].apply(lambda x : len(x.split()))
df_train['char_count'] = df_train['excerpt'].apply(lambda x : len(x.replace(" ","")))
df_train['word_density'] = df_train['word_count'] / (df_train['char_count'] + 1)

#Adding +1 to allow ratio calculation
df_train['uppercase'] = df_train['excerpt'].str.findall(r'[A-Z]').str.len()+1
df_train['lowercase'] = df_train['excerpt'].str.findall(r'[a-z]').str.len()+1
df_train['upplowRatio'] = df_train['uppercase'] / (df_train['lowercase'] + 1)
df_train['punc_count'] = df_train['excerpt'].apply(lambda x : len([a for a in x if a in punctuations]))
df_train['paragraphs'] = df_train['excerpt'].apply(lambda x : (x.count("\n")))
df_train['sentLength'] = df_train['word_count'] / (df_train['punc_count'] + 1)
df_train['commonWords'] = count_words(df_train['excerpt'], common_words_stem)
df_train['formalWords'] = count_words(df_train['excerpt'], formalWords)
df_train['repeatedWords'] = Count_Repeated(df_train['excerpt'])

In [None]:
df_train.drop(['url_legal','license'],axis = 1,inplace = True)
df_test.drop(['url_legal','license'],axis = 1,inplace = True)

The columns not used for the Data Analysis (**url_legal and license**) were removed. Below is the result of the DataFrame with ALL the new features created:

In [None]:
df_train.head(3)

# 2.2 KDE Plots

It is usually helpful to start the Data Analysis with KDE distribution plot to understand the Features value range and pattern of values distribution. Next, the KDE of all Features is displayed:

In [None]:
feat_list = list(df_train.columns[2:])

for i in feat_list:
    plt.figure(figsize=(20,5))
    ax = sns.kdeplot(data = df_train, x = i, linewidth=1,alpha=.3, fill = True,palette = 'husl') 
    ax.set_xlabel(i)
    plt.title('KDE Plot - ' + i, fontsize = 16,weight = 'bold',pad=20);  
    sns.despine(top=True, right=True, left=False, bottom=False)

* The **target** and most of the features present a Normal-like distribution. It is important to understand what the MAX and MIN values for the target mean
* Since most features present a normal distribution, no log or power transformation is required
* The **paragraphs** feature displays an unusual shape with several peaks as most samples have a unique value, i.e. one, two or three paragraphs.
* It will be interesting to analyse if text samples with more paragraphs are have a higher readability score

# 2.3 Target Feature and Feature Relantionships

In this section we further analyse the Feature Relantionships. First let's try to understand what the **Target** value ranges might mean.

Below is a sample of the text with **MAX** target value:

In [None]:
pd.set_option('display.max_colwidth', None)
print('Max Target Value:') 
print(df_train.loc[df_train['target'].idxmax(),['excerpt','target']])

And here is a sample of the text with **MIN** target value:

In [None]:
print('Min Target Value:') 
print(df_train.loc[df_train['target'].idxmin(),['excerpt','target']])

By reading the passages we can understand that ***lower target values have more complex text*** when compared to higher target values. With that in mind, let's analyse the Relationship between the Features using Scatter Plots:

In [None]:
plt.figure(figsize=(20,8));
g = sns.pairplot(df_train,vars=feat_list,corner=True,
                 plot_kws=dict(s = 8),
                 diag_kws=dict(linewidth=0,alpha=.5));
g.fig.suptitle('Relationship between Features',fontsize=16, weight = 'bold');

* **Target and Common Words:** there is no clear pattern or relationship between the two features. The ratio of common words does not seem to indicate specific target value ranges
* **Target and Paragraphs:** as the number of paragraphs increases, the samples have larger target values. It makes sense, paragraphs makes the reading easier on the eyes
* **Target and Word Case:** as the target values increase, the number of lowercase words decrease. The same pattern is not seen with the Uppercase feature
* **Target and Word Density:** the Word Density (num words/num chars+1) seems to increase as the target increases. Some of the lower density samples are displaying the lower target values, it could be that higher complexity samples use longer words (more characters)
* **Target and Character Count:** presents a similar relationship presented with the Lowercase feature. Higher target values present smaller character counts


# 2.4 Target Quartiles

As we could not find any patterns regarding the Target values and Common or Complex words, let's see if dividing the target feature into Quartiles we can learn something new about the data provided. We can take advantage of this since the Target feature has a distribution very similar to a Normal distribution.

The image below is just a visualisation of how we will divide the target values. The first quartile (Q1) is the 25th percentile, it cuts off the lowest 25% of the data. The 75th percentile cuts off the lowest 75% (or highest 25%) of the data is Q3. The second quartile is the 50th percentile providing the center of the data distribution ([source](https://www.sciencedirect.com/science/article/pii/B9780123814791000022)):

<img src="https://ars.els-cdn.com/content/image/3-s2.0-B9780123814791000022-f02-02-9780123814791.jpg" width="500">

First, we obtain the Q1, Q2 and Q3 values for the Target Feature:

In [None]:
q25, q50, q75 = percentile_drop(df_train,'target')

df_train = df_train.assign(targetQuartile = df_train['target'])
df_train.loc[(df_train['target']) <= q25, 'targetQuartile'] = '<=Q25'
df_train.loc[(df_train['target'] > q25)&(df_train['target'] <= q50), 'targetQuartile'] = 'Q25-Q50'
df_train.loc[(df_train['target'] > q50)&(df_train['target'] <= q75), 'targetQuartile'] = 'Q50-Q75'
df_train.loc[df_train['target'] >= q75, 'targetQuartile'] = '>=Q75'

Now we build a word frequency dictionary for each quartile limit. The output are **four** dictionaries, one for each region of the normal curve displayed above. The results are stored in a DataFrame and displayed below:

In [None]:
#builds dictionary for each quartile '(word,quartile): word frequency', e.g. ('young', 'Q50-Q75'): 93
freqs_quartile, words = build_freqs(df_train['excerpt'], df_train['targetQuartile'])

freq_words = []

for word in words:
    Q25=0;Q25Q50=0;Q50Q75=0;Q75 = 0
    if (word, '<=Q25') in freqs_quartile:
        Q25 = freqs_quartile[(word, '<=Q25')]
    if (word, 'Q25-Q50') in freqs_quartile:
        Q25Q50 = freqs_quartile[(word, 'Q25-Q50')]
    if (word, 'Q50-Q75') in freqs_quartile:
        Q50Q75 = freqs_quartile[(word, 'Q50-Q75')]                            
    if (word, '>=Q75') in freqs_quartile:
        Q75 = freqs_quartile[(word, '>=Q75')]      
    freq_words.append([word, Q25,Q25Q50,Q50Q75,Q75])   

freq_wordsDF = pd.DataFrame(freq_words, columns = ['word', 'Q25','Q25-Q50','Q50-Q75','Q75'])    
freq_wordsDF['sum'] =  freq_wordsDF.loc[:, ['Q25','Q25-Q50','Q50-Q75','Q75']].sum(axis=1)
freq_wordsDF.sort_values('sum', ascending=False,inplace=True)
freq_wordsDF.drop_duplicates(inplace=True)
freq_wordsDF.head(3)

Now everything is ready for us to analyse the most common words according to the Quartile Ranges:

In [None]:
z = 0; j = 0
fig, axarr = plt.subplots(1,4, figsize=(20,4))

quartiles_abbr = ['Q25','Q25-Q50','Q50-Q75','Q75']

for i in quartiles_abbr:
    df = freq_wordsDF.loc[:,['word',i]]
    df.sort_values(i, ascending=False,inplace=True)
    ax = sns.lineplot(data=df[0:20],x="word", y=i, marker='o',ax=axarr[z])
    axarr[z].tick_params(axis='x', rotation=70)    
    axarr[z].set_xlabel('The 20 Most Common Words for ' + i,fontsize = 12,weight = 'bold')
    axarr[z].set_ylabel('Count',fontsize = 13,weight = 'bold')
    axarr[z].set_title(i, fontsize = 14,weight = 'bold');
    sns.despine(top=True, right=True, left=False, bottom=False)
    z+=1
    
fig.tight_layout(pad=3.0)
plt.suptitle('Word Frequency per Target Quartile - Top 20 Words',fontsize=16, weight = 'bold');
plt.show()

In [None]:
cm = sns.diverging_palette(220, 20, sep=5, as_cmap=True)
df_quartile = df_train.iloc[:,2:]

Qcount_mean = df_quartile.groupby(['targetQuartile']).mean().round(3)
Qcount_min = df_quartile.groupby(['targetQuartile']).min().round(3)
Qcount_max = df_quartile.groupby(['targetQuartile']).max().round(3)

Qcount_mean= Qcount_mean.reindex(['<=Q25', 'Q25-Q50', 'Q50-Q75','>=Q75'])
Qcount_min= Qcount_min.reindex(['<=Q25', 'Q25-Q50', 'Q50-Q75','>=Q75'])
Qcount_max = Qcount_max.reindex(['<=Q25', 'Q25-Q50', 'Q50-Q75','>=Q75'])

MeanV = Qcount_mean.round(3).style.background_gradient(cmap=cm)
MinV = Qcount_min.round(3).style.background_gradient(cmap=cm)
MaxV = Qcount_max.round(3).style.background_gradient(cmap=cm)

Now, let's analyse the Mean, Minimum and Max values for each feature by quartiles:

In [None]:
MeanV

* The **char_count**, **lowercase** and **punc_count** are showing interesting ranges by quartile
* For the lower 25th quartile we have a higher number of charactersv(**char_count**) even though the **word_count** does not differ that much between quartiles (from 176 to 170)
* On average, the upper quartile (less complex texts) have higher number of uppercase letters, probably due to smaller sentences. This trend probably influences also the higher number of paragraphs the upper quartiles are presenting
* Note how the lower quartile and upper quartile samples (lower than 25th and higher than 75th) are presenting a higher (mean) for the standard_error. It could indicate that such samples were more difficult to define a common rate, as the measure of scores among multiple raters have a higher spread

In [None]:
MinV

In [None]:
MaxV

* Looking at the max table, the char_count and uppercase shows a good difference between the lower and upper quartiles. Uppercase and punctuation count also shows a great difference between the extreme quartiles.
* The Minimum values table is quite homogeneous among the quartiles, where char_count and word_density display a higher difference.

As a final analysis, we apply dimensionality reduction using truncated single value decomposition (SVD). In the context of text analysis, when applied to TF.IDF or Word Count matrices it is also referred as Latent Semantic Analysis. 

To apply SVD:
* we first preprocess the text (lower case, remove stop words, remove symbols)
* Define the TF.IDF matrix and transform our text input as a TF.IDF matrix
* Apply the sickit learn decomposition TruncatedSVD, transforming our TF.IDF matrix into 2 components, allowing data visualisation

In [None]:
text_data = df_train['excerpt'].apply(lambda x : preprocess_sentence(x)) #preprocess text data

tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True) #define tfidf matrix parameters
text_data_tfidf = tfidf_vec.fit_transform(text_data)#transform our text data into a TF.IDF matrix

svd = decomposition.TruncatedSVD(n_components=2)#Define the SVD parameters
text_data_svd = svd.fit_transform(text_data_tfidf)#Reduce dimensionality of our TF.IDF matrix

In [None]:
plt.figure(figsize=(20,5))
sns.set(style="ticks", font_scale = 1)
ax = sns.scatterplot(data=text_data_svd, x=text_data_svd[:,0], y=text_data_svd[:,1], 
                     hue = np.ravel(df_quartile['targetQuartile']), hue_order = ['<=Q25', 'Q25-Q50', 'Q50-Q75','>=Q75']);

leg = ax.axes.get_legend()#add legend title
leg.set_title('Target Value Quartile')#add legend title
sns.despine(top=True, right=True, left=False, bottom=False)
plt.xticks(rotation=0,fontsize = 12)
ax.set_xlabel('x1',fontsize = 14,weight = 'bold')
ax.set_ylabel('x2',fontsize = 14,weight = 'bold')
plt.title('Scatter Plot by Quartile Target Value after Dimensionality Reduction', fontsize = 16,weight = 'bold');

* The values seems to be spread according to the Target values range, even though there is a lot of overlap between the four quartiles 
* Most of the blue samples seem to be towards the left of the graph, while the red samples (with higher target values) are more concentrated towards the right of the graph
* Red and Green samples, containing values higher than the median value (from Q2 threshold and above, Q50-Q75 and >=Q75 samples), seem to be more prone to outliers than blue and orange samples

A nice 3D plot is shown next:

In [None]:
fig = px.scatter_3d(x=text_data_svd[:,0], y=text_data_svd[:,1], z=np.ravel(df_train['target']),
                    opacity=0.9)
fig.update_traces(marker=dict(size=3))#, selector=dict(type='scatter3d'))

fig.update_layout(scene = dict(xaxis_title='x1',yaxis_title='x2',zaxis_title='Target Value',
                              xaxis = dict(backgroundcolor="rgb(200, 200, 230)"),
                              yaxis = dict(backgroundcolor="rgb(230, 200,230)"),
                              zaxis = dict(backgroundcolor="rgb(230, 230,200)")))

fig.update_traces(hovertemplate='x1: %{x} <br>x2: %{y} <br>Target Value: %{z}') #
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))

# 3. Model

As this is first attempt on this competition, the goal is to keep it simple and understand the results and where can the model be improved. After a few initial tests, the following strategy was found satisfactory:

* Preprocess the text data and prepare the additional Features to be used by ML model
* Separate into 80-20 training and test ratio
* Use K-Fold for training and validation of model parameters. Five fold keeps a good ratio of training and validation sets for each fold
* Create a TF.IDF matrix. It has shown better results when compared to Word Count 
* Applying SVD has improved the Test set results. An analysis is shown on how the number of SVD components was selected
* Here I use two of my favourite algorithms, LGBM and ANN is used to predict the Target.



>Data Preprocessing, Training and Test sets definition

In [None]:
df_train_ML = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
df_test_ML = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

df_train_ML['text'] = df_train['excerpt'].apply(lambda x : preprocess_sentence(x))
df_test_ML['text'] = df_test['excerpt'].apply(lambda x : preprocess_sentence(x))

features_train = df_train_ML.text.values
target_train = df_train_ML['target']

features_test = df_test_ML.text.values

# Create First Train and Test sets
x_train_text, x_test_text, y_train, y_test = train_test_split(features_train, target_train, test_size=0.20,random_state=123)

print ("Training set size", x_train_text.shape[0])
print ("Test set size",x_test_text.shape[0])

#Define K-Fold Validation
kf = KFold(n_splits=5, shuffle = True, random_state = 123)

Here the new features we created at the beginning of the Notebook are being extracted and divided into Training and Test sets:

In [None]:
features_train_params = df_train.iloc[:, 4:-1]

# Create First Train and Test sets
x_train_params, x_test_params, y_train, y_test = train_test_split(features_train_params, target_train, test_size=0.20,random_state=123)

sc_X = MinMaxScaler()

x_train_transform = sc_X.fit_transform(x_train_params)
x_test_transform = sc_X.transform(x_test_params)

>Defining the TF.IDF Matrix

In [None]:
# Fitting TFIDF to training and test sets
tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True) #define tfidf matrix parameters
x_train_tfidf = tfidf_vec.fit_transform(x_train_text)
x_test_tfidf = tfidf_vec.transform(x_test_text)

final_test_tfidf = tfidf_vec.transform(features_test)

> Code to run K-Fold and measure model performance

In [None]:
def FitANN(X_train, X_val,Y_train, Y_val):
    keras.backend.clear_session()
    history = ANNmodel.fit(X_train,Y_train, validation_data=(X_val,Y_val),epochs=epochs,callbacks=[callbacks_list],verbose=0)
    return ANNmodel

def FitLGBM(X_train, X_val,Y_train, Y_val):
    LGBMModel.fit(X_train, Y_train, eval_metric='rmse', eval_set=[(X_val, Y_val)],verbose=0)
    return LGBMModel

def KFoldSets(x_train,x_test,MLmodel):
    RMSE = []
    for train_index, test_index in kf.split(x_train):
            X_train, X_val = x_train[train_index], x_train[test_index]
            Y_train, Y_val = y_train.iloc[train_index], y_train.iloc[test_index]
            if MLmodel == 'ANN':
                model = FitANN(X_train, X_val,Y_train, Y_val)
            if MLmodel =='LGBM':
                model = FitLGBM(X_train, X_val,Y_train, Y_val)
            predictions = model.predict(X_val)
            RMSE.append(np.sqrt(mean_squared_error(Y_val, predictions)))
    test_score = np.sqrt(mean_squared_error(y_test, model.predict(x_test)))
    mean_res = np.mean(RMSE)
    std_dev = np.std(RMSE)
    print(MLmodel)
    print("Training LogLoss: %.3f +/- %.4f \n Test Set LogLoss: %.3f" % (mean_res,std_dev,test_score))
    return model 

To obtain the optimal number of SVD parameters, a simple code was created to iterate over the number of components. The optimal value is the one that achieves the best RMSE on the test set, while displaying a similar result in the K-Fold validation set RMSE. Similar RMSE on test and validation sets gives assurance that the model is not prone to overfit and the hyperparameters are properly defined.

>LGBM definition

In [None]:
LGBMModel = lgbm.LGBMRegressor(objective='regression',
                               metric = 'rmse',
                               verbose=-1, 
                               learning_rate=0.019, 
                               max_depth=360, 
                               num_leaves=680, 
                               early_stopping_round = 50, 
                               n_estimators = 10000,
                               reg_alpha = 0.021314,
                               reg_lambda = 0.00856,
                                force_col_wise=True)

 >code to define the number of SVD components

In [None]:
test = []
train = []
i_list = list(range(5,100,5))
for i in i_list:# Apply SVD and lgbm
    svd = decomposition.TruncatedSVD(n_components=i)
    xtrain_svd = svd.fit_transform(x_train_tfidf)
    xtest_svd = svd.transform(x_test_tfidf)
    LGBMModel.fit(xtrain_svd, y_train, eval_metric='rmse', eval_set=[(xtest_svd, y_test)],verbose=0)    
    predictions_train = LGBMModel.predict(xtrain_svd)
    predictions_test = LGBMModel.predict(xtest_svd)
    train_rmse = np.sqrt(mean_squared_error(y_train, predictions_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, predictions_test))
    
    train.append(train_rmse)
    test.append(test_rmse)

The graph below shows the relationship between the RMSE for Validation and Test sets according to the number of SVD components. 

In [None]:
plt.figure(figsize=(20,5))
sns.despine(top=True, right=True, left=False, bottom=False)

ax = sns.lineplot(x=i_list, y=train,markers=True, dashes=False,color='red',marker="o",label="Validation")
ax = sns.lineplot(x=i_list, y=test,markers=True, dashes=False,color='green',marker="o", label="Test")
ax.axhline(min(test), ls='--', c = 'green')
ax.text(1,min(test),"Min.RMSE Test Result",
            bbox=dict(facecolor='white', edgecolor='none'))

ax.set_ylabel('RMSE')
ax.set_xlabel('Number of SVD Components')
   
plt.title('Validation and Test Set RMSE by Number of Features',fontsize=16, weight = 'bold');
plt.xticks(np.arange(min(i_list), max(i_list)+1, 5.0))  
plt.show()

>Defining the final SVD parameter

In [None]:
# Apply SVD
svd = decomposition.TruncatedSVD(n_components=20)
xtrain_svd = svd.fit_transform(x_train_tfidf)
xtest_svd = svd.transform(x_test_tfidf)

final_test_svd = svd.transform(final_test_tfidf)

Merge the SVD features and the New features we created into a single training set and test set:

In [None]:
x_train_final = np.concatenate([xtrain_svd, x_train_transform], axis=1)
x_test_final = np.concatenate([xtest_svd, x_test_transform], axis=1)

Building the ANN model:

In [None]:
ANNmodel = Sequential()
ANNmodel.add(Dense(20, input_dim=x_train_final.shape[1],activation = 'tanh'))
#ANNmodel.add(Dense(50, activation='tanh'))
ANNmodel.add(Dense(1, activation='linear'))
optimiser = Adam(lr=0.02, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
ANNmodel.compile(loss='mean_squared_error', optimizer=optimiser, metrics = [RootMeanSquaredError()])

early_stopping_monitor = EarlyStopping(patience=50,monitor='val_loss', mode = 'min',verbose=0)
learning_rate_reduction = ReduceLROnPlateau(monitor='val_loss', patience=20, verbose=0, factor=0.5, min_lr=0.00001)
callbacks_list = [learning_rate_reduction,early_stopping_monitor]
       
#Model Parameters
epochs = 100
batch_size = 16

# 4. Results and Conclusion 

For the final results and submission, we train our whole dataset with our final model and submit. Our partial results for each Algorithm are the following:

In [None]:
LGBM_Model = KFoldSets(x_train_final,x_test_final,'LGBM')
ANN_Model = KFoldSets(x_train_final,x_test_final,'ANN')

Looking at the current Leaderboard we are far from the competition Top Results. For now, the ANN model is our best attempt. A Scatter Plot where the Predicted and Actual Target values are plotted can help us understand where our current model is struggling:

In [None]:
y_pred = ANN_Model.predict(x_test_final)

In [None]:
#plot settings
fig, ax = plt.subplots(figsize=(10, 10))
#ax.set_yscale('log')
#ax.set_xscale('log')
ax.scatter(y_test, y_pred, color='red')
ax.plot(y_test, y_test, color='blue')  
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('Actual Target Values')
plt.ylabel('Predicted Target Values')
plt.title('Target - Predicted and Actual Results for ANN model');

Ideally, our scatter plot should not be as distributed as it currently shows. The closer to the blue line, the closer our model is able to predict the actual target values. 

For example, looking at the extreme left of the x-axis, smaller taget values are being predicted as higher values. While positive values, e.g. towards 0 and 1 values in the x-axis, are being predicted as smaller values. We also visualise several outliers at the boundary positions of the scatter plot. It is interesting to see how this scatter improves as new ideas and improved models are used.

Now let's train the ANN in the whole training set and submit to the competition:

In [None]:
X_train_text = tfidf_vec.fit_transform(features_train)
X_train_text = svd.fit_transform(X_train_text)

X_train_Feats = sc_X.fit_transform(features_train_params)
X_train = np.concatenate([X_train_text, X_train_Feats], axis=1)

ANNmodel.fit(X_train,target_train, validation_split=.2,epochs=epochs,callbacks=[callbacks_list],verbose=0)

>Preparing the Test Set as we did with the Training Set

In [None]:
df_test['word_count'] = df_test['excerpt'].apply(lambda x : len(x.split()))
df_test['char_count'] = df_test['excerpt'].apply(lambda x : len(x.replace(" ","")))
df_test['word_density'] = df_test['word_count'] / (df_test['char_count'] + 1)
df_test['uppercase'] = df_test['excerpt'].str.findall(r'[A-Z]').str.len()+1
df_test['lowercase'] = df_test['excerpt'].str.findall(r'[a-z]').str.len()+1
df_test['upplowRatio'] = df_test['uppercase'] / (df_test['lowercase'] + 1)
df_test['punc_count'] = df_train['excerpt'].apply(lambda x : len([a for a in x if a in punctuations]))
df_test['paragraphs'] = df_test['excerpt'].apply(lambda x : (x.count("\n")))
df_test['sentLength'] = df_test['word_count'] / (df_test['punc_count'] + 1)
df_test['commonWords'] = count_words(df_test['excerpt'], common_words_stem)
df_test['formalWords'] = count_words(df_test['excerpt'], formalWords)
df_test['repeatedWords'] = Count_Repeated(df_test['excerpt'])

X_test_text = tfidf_vec.transform(features_test)
X_test_text = svd.transform(X_test_text)

features_test_params = df_test.iloc[:, 2:]
X_test_Feats = sc_X.fit_transform(features_test_params)
X_test = np.concatenate([X_test_text, X_test_Feats], axis=1)


In [None]:
df_test['target'] = ANNmodel.predict(X_test)
df_test[['id','target']].to_csv('submission.csv', index=False)

## Thanks for reading it this far. Make sure to upvote if you found it helpful