# CommonLit Readability Prize

![](https://i1.wp.com/www.josephineelia.com/wp-content/uploads/2016/01/the_art_of_reading.jpg?resize=1080%2C675&ssl=1)

# Overview

This notebook is a attempt to do data exploration and understand what makes a document difficult to read. Apart form data exploration I tried to do topic modelling and understand if there are any specific topics which are difficult to read. 

there are several aspects which makes reading a document difficult. Most important of them are
1. parts of speech 
2. rarity of word 

## Parts of speech
- Data analysis on POS tags of the excerpts revealed the significance of pos tags in readability
- for instance consider these sentence 
    - "I went to shop to buy soap"
    - "I went to shop to buy nivea soap"
    - "I quickly went to shop to buy white nivea soap"
- Althought above three sentences mention same thing adding more POS tags will add more details in a single sentnece andmakes it difficult to read
- In the same context we are going to see various patterns in POS tags and analyse what make a sentence less readable.

## Rarity of word
- since less frequently used words are difficult to read. They play an important role in determining the readability of excerpts
- consider the word __"puny"__ and the word __"tiny"__. Althought these two words have same meaning using tiny will make a sentence easy to read than using __puny__

## Other aspects of text
- Apart from these there are several other attributes of text which also contribute to the readability
- Features like
    - Average syllabel count in each word
    - Uniqueness of word
    - Number of pnctuations
    - word count
    - sentence sount
    - unique words count
    - puncttuations counts
- Understanding how these attributes of text correlate to the readability will help in better understanding of given data



__So why late, Lets jump into data exploration__




In [None]:
!pip install textstat

# Import Libraries 📕

In [None]:
import numpy as np #| linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import spacy 
from spacy.tokenizer import Tokenizer
from wordcloud import WordCloud
import textstat
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

class color:
    BOLD = '\033[1m' + '\033[93m'
    END = '\033[0m'

## Read and Have a quick view over data 👀 

In [None]:
train_data = pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")
train_data.head()

In [None]:
train_data.isnull().sum()

In [None]:
train_data.describe()

In [None]:
train_data.excerpt[0]

### Convert the readability score (target) column into bins
- Lets break the readability scoresinto groups for simplicity
    1. very_difficult
    2. difficult
    3. medium
    4. easy
    5. very_easy

In [None]:
train_data['target_bins'] = pd.qcut(train_data['target'],
                              q=[0, .2, .4, .6, .8, 1],
                              labels=["very_difficult","difficult","medium","easy","very_easy"])
g = sns.displot(train_data, x="target", hue="target_bins", kind="kde", fill=True)
plt.title("Target bins Ditribution", fontsize = 15)
g.fig.set_size_inches(15,7)

## Preprocess and store enriched form of text using Spacy


In [None]:
nlp = spacy.load("en_core_web_sm")

excerpt_tokens = []
for doc in nlp.pipe(train_data["excerpt"][:], disable = ["ner"]):
    excerpt_tokens.append(doc)

train_data["parsed_excerpt"] = excerpt_tokens  

## Does difficult words makes it difficult to read ?

Below data frame has the frequency of different words
> __Assumption__ : Difficult words make it difficult to read the excerpts

In [None]:
most_freq_words = pd.read_csv("/kaggle/input/english-word-frequency/unigram_freq.csv")
most_freq_words.head()

In [None]:
rank_tuples = list(most_freq_words.apply(lambda x : (x.word,x.name), axis = 1))
frequent_words_rank_dict = dict(rank_tuples)

In [None]:
def is_useful_word(word):
    if word.is_punct == False and word.is_stop != True and word.text !="\n":
        return True
    return False


 

excerpt_processed_words = []
for i,sentnece in enumerate(excerpt_tokens):
    excerpt_processed_words.append([])
    for word in sentnece:        
        if is_useful_word(word):
            excerpt_processed_words[i].append(word.text)


In [None]:
sentence_vocab_score = []
for sentence in excerpt_processed_words:
    temp_sum = 0
    for word in sentence:
        temp_sum+=frequent_words_rank_dict.get(word.lower(), 0)
    sentence_vocab_score.append(temp_sum/len(sentence))
        
train_data["vocab_score"] = sentence_vocab_score

Here we can observe the there is a correlation of __-0.29__ between vocab score and target value.

<div class="alert success-alert">
    
    This means that tougher words will increase the toughness of reading
    
</div>

In [None]:
sorted_train_data = train_data.sort_values("target", axis = 0)
sorted_train_data.plot.scatter(x = "target", y = "vocab_score")
plt.title("Vocab score vs Target", fontsize = 15)

print("correlation : ",sorted_train_data["target"].corr(sorted_train_data["vocab_score"]))


# Does longer sentence make it difficult to read ?

- Below plots show the plots of average length of top n sentences
- we can observe a negative correlation because longer sentences are toughrs to read
- However average length of top 3 sentences have highest negative correlation. Maybe top 3 sentences length average is a good approximation to meature the toughness of readability


In [None]:
def n_max_elements_avg(list1, N):
    final_list = []
    if len(list1) > N:
        for i in range(0, N): 
            max1 = 0

            for j in range(len(list1)):     
                if list1[j] > max1:
                    max1 = list1[j];

            list1.remove(max1);
            final_list.append(max1)
    else:
        final_list = list1
          
    return sum(final_list)/len(final_list)

def avg_len_n_sents(x, n = 1,punct = False):
    sents_len = []
    for sent in x.sents:
        sent_len = 0
        for word in sent:
            if word.is_punct == False:
                sent_len += 1
        
        sents_len.append(sent_len)  
    return n_max_elements_avg(sents_len, n)



In [None]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(7,11), constrained_layout = True)
# plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)

for n in range(6):
    col = str(n+1)+"_sent_avg"
    sorted_train_data[col] = sorted_train_data.parsed_excerpt.apply(lambda x : avg_len_n_sents(x, n = n+1))

    sorted_train_data.plot.scatter(x = "target", y = col, title = \
                                   "top {} len, corr = {}"\
                                   .format(n+1, sorted_train_data["target"].corr(sorted_train_data[col])), ax = axes[int(n/2),n%2])


# Topics of the excerpts

- Below word clouds show various POS tags and their respective word clouds
- Observing carefully we can understand that there are no diversified topics in the text because almost all the word clouds talk about mundane things that arepresent in everyday life. So no special topics like science, healthcare etc
- As much as i understand these excerpts are about short stories for kids

In [None]:
# noun verb sentences
def get_filtered_text(word, tags):
#     print(word, tags, word.pos_)
    if tags != []:
        if word.pos_ in tags:
             return word.text
        else:
            return "" 
    else:
        return word.text
    
def generate_word_cloud_input(spacy_excerpt,tags = []):
    word_cloud_input = ""
    for sentence in spacy_excerpt:
        for word in sentence:
            filtered_text = get_filtered_text(word,tags)
#             print(filtered_text)
            if is_useful_word(word) and filtered_text != "":
                word_cloud_input += filtered_text + " "
    return word_cloud_input
                
                

In [None]:
fig = plt.figure(figsize = (15, 15), facecolor = None)
wordcloud_obj = WordCloud(width = 800, height = 800,
                    background_color ='white',
                    min_font_size = 10)

POS = ["NOUN", "VERB", "ADJ", "ADV"]
for i,pos in enumerate(POS):
    word_cloud = wordcloud_obj.generate(generate_word_cloud_input(train_data["parsed_excerpt"], [pos]))
    ax = fig.add_subplot(2,2,i+1)  
    ax.imshow(word_cloud)
    plt.title("Word Cloud for {}".format(pos),fontsize=20)

    plt.axis("off")
    plt.tight_layout(pad = 10)
plt.show()


# topic modelling using LDA
- Below topics modelling shows 

In [None]:
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet
import gensim

def spacy_POS_to_wordnet_POS(nltk_tag):
    if nltk_tag == 'ADJ':
        return wordnet.ADJ
    elif nltk_tag == 'VERB':
        return wordnet.VERB
    elif nltk_tag == 'NOUN':
        return wordnet.NOUN
    elif nltk_tag == 'ADV':
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
lemmentizer = WordNetLemmatizer()

topic_modelling_docs = []
for i, sentence in enumerate(train_data["parsed_excerpt"]):
    topic_modelling_docs.append([])
    for word in sentence:
        if is_useful_word(word):
            topic_modelling_docs[i].append(\
                        lemmentizer.lemmatize(word.text, pos = spacy_POS_to_wordnet_POS(word.pos_)))            

In [None]:
from gensim.models import TfidfModel,LdaMulticore
from gensim.corpora import Dictionary

num_topics = 5
dct = Dictionary(topic_modelling_docs)
word_freq_corpus = [dct.doc2bow(doc) for doc in topic_modelling_docs]
model = TfidfModel(word_freq_corpus)
tfidf_corpus = [model[word_freq_sentence] for word_freq_sentence in word_freq_corpus]
lda_model = LdaMulticore(tfidf_corpus, num_topics=num_topics, id2word=dct, passes=2, workers=2)

- I was unable to observe any specific topics based on the topic words from LDA

> 📌 I would be glad if you can __point me out in the comments if there are any hidden topics__ in these topic words. Also you can tweak some parameters of LDA and see if you can find any topics

- Find the documentation on LDA parameters [here](https://radimrehurek.com/gensim/models/ldamulticore.html)


In [None]:
for topic_num in range(num_topics):
    topic_words = [dct[id_] for id_, weight in lda_model.get_topic_terms(topic_num, topn=15)]
    print(color.BOLD, "Topic {} Top Words : ".format(topic_num), color.END, ", ".join(topic_words))

In [None]:
# train_data["punctuation_count"] = train_data["parsed_excerpt"].apply(lambda x : sum([word.is_punct for word in x]))
# train_data["starting_pos"] = train_data["parsed_excerpt"].apply(lambda x : x[0].pos_)

## Visualising POS tags and their relation to readability
- In below plots we can observe that more nuons and verbs will be present in a tougher excerpt
- Also less number of verbs will be present in a toucher sentence
- Hence we can clearly inder that more number of nouns and verbs and less number of verbs in a sentence could lead to dificult to read excerpts
    - for instance we can write a sentence like this
         - "_jack went to shopping. jill went to fishing_". obeserve there are two verbs.
         - we can write the same sentence above in more compact way like. "jack and jill went to shopping and fishing respectively".
    - Hence more compact sentneces possibly have fewer verbs and more nouns and make it difficult to read
- In SCONJ plot we can observe that very simple sentences will have fewer conjunctions. Indeed it is true that distinct sentences are comparatively easy to read than long sentences joined by conjunctions.

> 📌 There is alot going on in below plots and I may not be able to point out everything. You can always explore more and post your observetions in the comments

In [None]:
pos_list = ['ADJ','ADP','ADV','AUX','CCONJ','DET','NOUN','NUM','PART','PRON','PROPN','PUNCT','SCONJ','VERB']

for pos in pos_list:
    train_data["sents_starting_with_{}".format(pos)] = train_data["parsed_excerpt"].apply(lambda x : x[0].pos_ == pos if len(x) > 0 else "null")
    train_data["{}_count".format(pos)] = train_data["parsed_excerpt"].apply(lambda x : sum([word.pos_ == pos for word in x]))



In [None]:
pos_list = ["NOUN", "VERB",'ADJ','ADV',"PRON","SCONJ","PROPN","PUNCT",'AUX','CCONJ','DET','NUM']
pos_count_cols = []
for pos in pos_list:
    pos_count_cols.append("{}_count".format(pos))

fig, ax =plt.subplots(4,3, figsize=(15,20))

for i,col in enumerate(pos_count_cols):
    s = sns.boxplot(x="target_bins", y = col, data=train_data, ax = ax[int(i/3),i%3], order = ["very_difficult", "difficult","medium", "easy" ,"very_easy"])
    s.set_title("{} vs difficulty".format(col))
    s.set_xticklabels(rotation=30, labels=["very_difficult", "difficult","medium", "easy" ,"very_easy"])
    


plt.tight_layout(pad = 2)
fig.show()



### Box plots on average word and sentence count in a excerpt

In [None]:
train_data["word_count"] = train_data["parsed_excerpt"].apply(lambda x : len(x))
train_data["sent_count"] = train_data["parsed_excerpt"].apply(lambda x : len((list(x.sents))))

In [None]:
fig, ax =plt.subplots(1,2, figsize=(10,5))
s = sns.boxplot(x="target_bins", y="word_count", data=train_data, ax = ax[0])
s.set_title("Word Count vs Difficulty")
s.set_xticklabels(rotation=30, labels=["very_difficult", "difficult","medium", "easy" ,"very_easy"])

s = sns.boxplot(x="target_bins", y="sent_count", data=train_data, ax = ax[1])
s.set_title("Sentence Count vs Difficulty")
s.set_xticklabels(rotation=30, labels=["very_difficult", "difficult","medium", "easy" ,"very_easy"])


plt.tight_layout(pad = 4)
fig.show()


## Average Syllabel count
- we can observe that the average syllabel count is higher for tougher excerpts


In [None]:
def syllable_count(word):
    word = word.lower()
    count = 0
    vowels = "aeiouy"
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("e"):
        count -= 1
    if count == 0:
        count += 1
    return count

def get_avg_syllabel(doc):
    syllabel_sum = 0 
    for word in doc:
        syllabel_sum += syllable_count(word.text)
    return syllabel_sum/len(doc)

train_data["avg_syllabel_len"] = train_data["parsed_excerpt"].apply(get_avg_syllabel)
s = sns.boxplot(x="target_bins", y="avg_syllabel_len", data=train_data)
s.set_title("Average Syllabel Count vs Difficulty")
fig.show()



## Plotting punctuations count

In [None]:
train_data["question_mark_count"] = train_data["parsed_excerpt"].apply(lambda x : sum([word.text == "?" for word in x]))
train_data["excalmation_count"] = train_data["parsed_excerpt"].apply(lambda x : sum([word.text == "!" for word in x]))
train_data["comma_count"] = train_data["parsed_excerpt"].apply(lambda x : sum([word.text == "," for word in x]))

In [None]:
fig, ax =plt.subplots(1,3, figsize=(20,7))
s = sns.violinplot(x="target_bins", y="question_mark_count", data=train_data, ax = ax[0])
s.set_title("? mark vs Difficulty")
s.set_xticklabels(rotation=30, labels=["very_difficult", "difficult","medium", "easy" ,"very_easy"])

s = sns.violinplot(x="target_bins", y="excalmation_count", data=train_data, ax = ax[1])
s.set_title("! mark vs Difficulty")
s.set_xticklabels(rotation=30, labels=["very_difficult", "difficult","medium", "easy" ,"very_easy"])

s = sns.violinplot(x="target_bins", y="comma_count", data=train_data, ax = ax[2])
s.set_title("Comma vs Difficulty")
s.set_xticklabels(rotation=30, labels=["very_difficult", "difficult","medium", "easy" ,"very_easy"])

plt.tight_layout(pad = 2)
fig.show()


# Unique words plot

In [None]:
train_data["unique_words"] = train_data["parsed_excerpt"].apply(lambda x : len(set([word.text for word in x])))

In [None]:
plt.figure(figsize=(8,4)) # this creates a figure 8 inch wide, 4 inch high
s = sns.violinplot(x="target_bins", y="unique_words", data=train_data)
s.set_title("Unique Words vs Difficulty")
plt.show()


# Standard reading tests

In [None]:
reading_tests = {
    "flesch_reading_ease" : textstat.flesch_reading_ease,
    "smog_index" : textstat.smog_index,
    "flesch_kincaid_grade" : textstat.flesch_kincaid_grade,
    "coleman_liau_index" : textstat.coleman_liau_index,
    "automated_readability_index" : textstat.automated_readability_index,
#     "dale_chall_readability_score" : textstat.dale_chall_readability_score # this test take a lot of time to compute
}

for test, test_func in reading_tests.items():
    print("running {} test".format(test))
    train_data[test] = train_data["parsed_excerpt"].apply(lambda x : test_func(x.text))


### Sort dataframe based on target score

In [None]:
sorted_train_data = train_data[["parsed_excerpt"]+list(reading_tests)+["target"]].sort_values("target")
sorted_train_data

### Randomly check scores for different excerpts


In [None]:
def print_scores(data_tuple):
    print(color.BOLD,"Given Score : ", color.END)
    print("\tTarget : ", data_tuple.target)
    print(color.BOLD, "Test Scores : ", color.END)
    print("\tflesch_reading_ease : \t\t", data_tuple.flesch_reading_ease)
    print("\tsmog_index : \t\t\t", data_tuple.smog_index)
    print("\tflesch_kincaid_grade : \t\t", data_tuple.flesch_kincaid_grade)
    print("\tcoleman_liau_index : \t\t", data_tuple.coleman_liau_index)
    print("\tautomated_readability_index : \t", data_tuple.automated_readability_index)
    print(color.BOLD, "Original Text : ", color.END)
    print("\tText", data_tuple.parsed_excerpt.text)
    


# Flesch reading score Index meaning
![](https://www.researchgate.net/profile/Anealka-Hussin/publication/329921610/figure/tbl1/AS:708098725531652@1545835300816/Descriptive-Categories-used-in-the-Flesch-Reading-Ease-Formula.png)

> 📌 Except _flesch_reading_ease_ score for _all other scores indices_ smaller score means easy to read

In [None]:
print_scores(sorted_train_data.iloc[20])


In [None]:
print_scores(sorted_train_data.iloc[500])


In [None]:
print_scores(sorted_train_data.iloc[1002])


In [None]:
print_scores(sorted_train_data.iloc[1700])


In [None]:
print_scores(sorted_train_data.iloc[2534])


### Correlation plot for standard tests & target

In [None]:
corr = train_data[list(reading_tests.keys())+["target"]].corr()
fig = plt.figure(figsize=(12,12),dpi=80)
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, cmap='PuBuGn', robust=True, center=0,
            square=True, linewidths=.5,annot=True)
plt.title('Correlation of readability tests', fontsize=15,font="Serif")
plt.show()