# CommonLit Readability Prize 📚

![Green Blue Paint Beauty Skincare Facebook Cover](https://user-images.githubusercontent.com/66208179/118814977-69e4e080-b8b9-11eb-9d35-745b530d2390.png)

# 1. Introduction

Identify the appropriate reading level of a text based on one feature: text 👓

Data includes readers from a wide variety of age groups and a large collection of texts taken from various domains. 

🌱```id:``` unique ID for excerpt

🌱```url_legal:``` URL of source - this is blank in the test set.

🌱```license:``` license of source material - this is blank in the test set.

🌱```excerpt:``` text to predict reading ease of

🌱```target:``` reading ease

🌱```standard_error:``` measure of spread of scores among multiple raters for each excerpt. Not included for test data.

## Evaluation:

The models will be evaluated by **RMSE (Root Mean Square Error)**. RMSE measures the standard deviation of residuals.

**The lower the value of RMSE, higher the accuracy**. RMSE is a great model to compare the accuracies of different linear regression models.

<img width="233" alt="Screen Shot 2021-05-15 at 10 10 43 AM" src="https://user-images.githubusercontent.com/66208179/118364269-7c50d880-b5a0-11eb-864c-036b5ad9c65c.png">

```Winning models will be sure to incorporate text cohesion and semantics.```

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install langdetect

In [None]:
!pip install pandas-profiling 

In [None]:
!pip install langdetect

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
import math
import seaborn as sns
import nltk
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from langdetect import detect_langs

# 1. Understanding the Data

We will first try to understand the training set. At this point, to avoid data snooping, just load the test dataframe and never look at it 🤓

In [None]:
df_train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
df_test = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
df_train.head()

As we can see, there are some missing values!

I ❤️ pandas profiling! It is a very easy library to use (hence, there is just one line over there!) and it returns a detailed analysis of your current data.

In [None]:
from pandas_profiling import ProfileReport
ProfileReport(df_train)

Based on these results, we see that there is **no correlation** between target and standart error.

<img width="520" alt="Screen Shot 2021-05-15 at 7 55 02 PM" src="https://user-images.githubusercontent.com/66208179/118381301-4fc5ac80-b5f2-11eb-8cfa-8410d63bcda9.png">

In [None]:
df_train.info()

There are missing values + categorical values (it is a text data after all 🧚).

<img width="726" alt="Screen Shot 2021-05-15 at 7 55 16 PM" src="https://user-images.githubusercontent.com/66208179/118381313-666c0380-b5f2-11eb-88ae-5d4cb3f3c5ff.png">

In [None]:
df_train.describe()

# 2. What is the Standard Error?

✅ The ```mean``` of target difficulty is -0.9.

✅ ```Min``` target difficulty is -3.676268.

✅ ```Max``` target difficulty is 1.711390.

I wasn't sure what standard error can show us exactly, so I used to following resource to understand: https://s4be.cochrane.org/blog/2018/09/26/a-beginners-guide-to-standard-deviation-and-standard-error/


⭐️ The ```standard error``` tells you *how accurate the mean of any given sample from that population is likely to be compared to the true population mean*. When the standard error increases, i.e. the means are more spread out, it becomes more likely that any given mean is an inaccurate representation of the true population mean.

Our target has a **normal distribution** and the majority of the standard error is around 0.5.

Our standard error shows that people's classifications (their ratings for the excerpts) had differences since our standard error is ```skewed left```, meaning we have a larger tail on the left and most of our error is clustered around values > 0.5.

In [None]:
# create a copy of the train df to work with, so we will have the original data
df_tmp = df_train.copy()

One of the things I wasn't aware of before (yes, I still make novice mistakes and it is okay, we are all here to learn ⭐️) is that I thought saying ```df_tmp = df_train``` is fine. But NO! It creates a *shallow copy*, so any change you make in ```df_train``` will be reflected on ```df_tmp``` in the future. However, ```.copy()``` creates a *deep copy* and that's what we want!

In [None]:
# let's see the whole excerpt
pd.set_option('display.max_colwidth', None)
# to take it back: pd.reset_option("display.max_rows")

In [None]:
# the excerpt with the greatest target value (easiest to read)
df_tmp[df_tmp.target == df_tmp.target.max()].excerpt

In [None]:
# the excerpt with the lowest target value (hardest to read)
df_tmp[df_tmp.target == df_tmp.target.min()].excerpt

In [None]:
pd.reset_option("display.max_rows")

Let's dive deeper into the relation between target and standard error. So far, we found that:
- there is not correlation between target and standard error
- our target has a normal distribution 
<img width="300" alt="Screen Shot 2021-05-17 at 8 00 29 AM" src="https://user-images.githubusercontent.com/66208179/118485190-c82a8b80-b720-11eb-868b-6636cff0d4f3.png">
- the standard error is skewed left
<img width="300" alt="Screen Shot 2021-05-17 at 8 00 15 AM" src="https://user-images.githubusercontent.com/66208179/118485198-c95bb880-b720-11eb-9398-8bed525880e2.png">




# 3. Features

Let's see if the length of excerpts has any correlation with difficulty.

In [None]:
from nltk import word_tokenize

word_token = [word_tokenize(excerpt) for excerpt in df_tmp.excerpt]
len_tokens= [] 

for i in range(len(word_token)):
    len_tokens.append(len(word_token[i]))

df_tmp["n_tokens"] = len_tokens
df_tmp.head(2)

In [None]:
df_tmp.corr()

There is no correlation between ```n_tokens``` and other numerical fields.

# Word Cloud

I will separate excerpts based on three difficulties and show the most common words in the wordcloud to understand if there are any patterns.

In [None]:
# remove punctuation and lowercase everything
df_tmp['excerpt'] = df_tmp['excerpt'].str.replace('[^\w\s]','')
df_tmp['excerpt'] = df_tmp['excerpt'].str.lower()
df_tmp.head(1)

In [None]:
# sort the values based on target 
df_tmp.sort_values("target", ascending=True, inplace = True)
df_tmp = df_tmp.reset_index(drop=True)

# and divide it into three different groups (basically creating three different difficulty groups)
df_tmp["range"] = "easy"
each_group = int(len(df_tmp) / 3)

for i in range(0, each_group):
    df_tmp["range"].iloc[i] = "hard"
    
for i in range(each_group, 2 * each_group):
    df_tmp["range"].iloc[i] = "medium"

In [None]:
df_tmp.range.value_counts()

In [None]:
# create dataframes
hard = df_tmp[0:each_group]
medium = df_tmp[each_group: 2 * each_group]
easy = df_tmp[2 * each_group: 3 * each_group]

In [None]:
# create texts
text_hard = " ".join(excerpt for excerpt in hard.excerpt)
text_medium = " ".join(excerpt for excerpt in medium.excerpt)
text_easy = " ".join(excerpt for excerpt in easy.excerpt)

In [None]:
# set stopwords

stopwords = set(STOPWORDS)
#stopwords.update([])

cloud_1 = WordCloud(stopwords=stopwords, background_color="white").generate(text_hard)
cloud_2 = WordCloud(stopwords=stopwords, background_color="white").generate(text_medium)
cloud_3 = WordCloud(stopwords=stopwords, background_color="white").generate(text_easy)

# plot

width= 3
height= 3
rows = 1
cols = 3
axes=[]
fig=plt.figure(figsize=(10, 10))


for a in range(rows*cols):
    cloud = [cloud_1, cloud_2, cloud_3]
    axes.append(fig.add_subplot(rows, cols, a+1) )
    subplot_title=("Word Cloud "+ str(a + 1))
    axes[-1].set_title(subplot_title)  
    plt.imshow(cloud[a])
fig.tight_layout()    
plt.show()

Seems like all the text includes one, little, two, see, people etc. 

In [None]:
sns.catplot(x="target", y="range", data=df_tmp, kind="bar");

# 4. Sentiment Analysis

At this point, we are aware that there might be some bias from people who rate the excerpts since it is all subjective. I will implement a sentiment analysis to understand if that bias might be based on emotions.

In [None]:
from textblob import TextBlob 

In [None]:
sentiment = []

for i in df_tmp.excerpt:
    text = TextBlob(i)
    sentiment.append(text.sentiment)

In [None]:
sentiment[0]

In [None]:
sent = [float(str(i).split(",")[0].split("=")[1]) * 100 for i in sentiment]
df_tmp["sentiment"] = sent

In [None]:
df_tmp["sentiment"].hist();

In [None]:
sns.set_palette("RdBu")
sns.relplot(x="target", y="sentiment", kind="scatter", hue="standard_error", data=df_tmp, ci=None, height=8.27, aspect=11.7/8.27);

⭐️**Please comment and let me know if you think there can be further EDA on the relation of emotions and target.**⭐️

## Paragraph Length

In [None]:
excerpt_length = df_tmp["excerpt"].apply(lambda x: len(x.split(" ")))
df_tmp["excerpt_len"] = excerpt_length
df_tmp["excerpt_len"].max(), df_tmp["excerpt_len"].min()

In [None]:
df_tmp.corr()

No correlation between ```parag_len``` and any attribute.

In [None]:
df_tmp.excerpt_len.hist(bins=25);

## Syllable Count

In [None]:
# from https://www.kaggle.com/jitshil143/submission-score-0-62
def syllable_count(word):
    count = 0
    vowels = "aeiouy"
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
            if word.endswith("e"):
                count -= 1
    if count == 0:
        count += 1
    return count

In [None]:
# nof syllables
df_tmp['nof_syllables'] =  df_tmp['excerpt'].apply(lambda s: syllable_count(s))

In [None]:
sns.relplot(x="nof_syllables", y="target", hue="standard_error", data=df_tmp, kind="scatter");

In [None]:
df_tmp.corr()

# 5. Word Embeddings

For word embeddings, I've used [this article](https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/).

In [None]:
# tokenize each word in an excerpt

tokenized_sent = []
for s in df_tmp.excerpt:
    tokenized_sent.append(word_tokenize(s.lower()))
tokenized_sent[0]

In [None]:
# define the cosine similarity between two words
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

There are many exciting word embedding styles in this article, but I've always wanted to learn BERT so I will go ahead and try it.

## BERT

This is my first time learning/ using BERT. I've used different notebooks from Kaggle to understand how BERT works and how it can be applied to our data in this competition.

<img width="700" alt="Screen Shot 2021-05-18 at 10 30 18 PM" src="https://user-images.githubusercontent.com/66208179/118747873-80604d00-b863-11eb-8499-7ef4bd470a00.png">

In [None]:
! pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

#bert_model = SentenceTransformer('bert-base-nli-mean-tokens')
#excerpt_embeddings = bert_model.encode(df_tmp.excerpt)

# compute cosine similarity: takes very long to run this part since it computes each line, so instead I will examine this on one line.

# test query
query = "I had pizza and pasta"
query_vec = bert_model.encode([query])[0]



#for excerpt in df_tmp["excerpt"]:
#   similarity = cosine(paragraph_embeddings, bert_model.encode([excerpt])[0])

In [None]:
#  look at only one example

for excerpt in df_tmp.loc[0:1]["excerpt"].values:
    similarity = cosine(query_vec, bert_model.encode([excerpt])[0])
    print(excerpt, np.sum(similarity))

# Modeling

In conclusion, this is a prediction problem. We need to predict the target level from a given text.

We have text data, so it is important to preprocess our data since lots of in-built functions must work with same data type.

We've already lower-cased each excerpt and removed punctuations to produce word-clouds.

Some of my code is from here and it is a great notebook for text preprocessing: https://www.kaggle.com/manishkc06/text-pre-processing-data-wrangling

### Remove Accented Characters

In [None]:
# remove accented characters
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [None]:
for i in range(0, len(df_tmp.excerpt)):
    df_tmp.excerpt.loc[i] = remove_accented_chars(df_tmp.excerpt.loc[i])

### Remove Special Characters

In [None]:
import re

def remove_special_characters(text, remove_digits=True):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

for i in range(0, len(df_tmp.excerpt)):
    df_tmp.excerpt.loc[i] = remove_special_characters(df_tmp.excerpt.loc[i])

**Removing StopWords**: I was about the remove the stopwords as well, until I saw [this post.](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240064)

# To Be Continued :)