<center><h1> CommonLit Readability Prize </h1>
    <h2>📖 EDA + Naïve Submission 📖</h2>

<img src="https://www.commonsense.org/education/sites/default/files/tlr-blog/commonlit-logo-1.png" width="500"/>
    <p style="text-align:center;">Image <a href="https://www.commonsense.org/education/website/commonlit">source</a>.</p>
</center>

# Overview

Citing the competition's hosts:
> In this competition, you’ll build algorithms to rate the complexity of reading passages for grade 3-12 classroom use.

Thus, given a set of text excerpts, we'll have to predict their relative *textual complexity*. Such work would prove to be extremely beneficial in the context of knowledge sharing and availability. One would be able to quickly search text excerpts of interest while consequently getting a match that perfectly fits the person's reading capabilities. Consequently, knowledge sharing could be automated and education speed greatly accelerated.

In this notebook, we'll take a look at an Exploratory Data Analysis of the training data provided for this competition, as well as building and running a naïve solution that basically performs string matching frequencies to predict whether a given text excerpt has a higher or lower textual complexity.


### Outline:

1. [Setup and Basic EDA](#head-1)
2. [Understanding Excerpts and their Associated Targets](#head-2) 
3. [A Naïve String Matching Submission](#head-3)

# 1. Setup and Basic EDA <a class="anchor" id="head-1"></a>

In [None]:
import os
from collections import defaultdict

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from wordcloud import WordCloud, STOPWORDS

%matplotlib inline

os.listdir('/kaggle/input/commonlitreadabilityprize')

We are provided with 3 main pieces of data:

* `train.csv`: The CSV file containing all the training reading passages as well as their corresponding metadata, such as their ID and their target complexities (ground truths).
* `test.csv`: The CSV file containing (a small subset of) the actual reading passages that will be used for testing purposes (thus, with no ground truth column available).
* `sample_submission.csv`: The CSV file containing all the publications IDs in the test set, for which we'll have to populate the prediction column.

In [None]:
train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
train

The training data contains 2,834 rows, with 6 columns describing each row.

In [None]:
train.info()

That's great! There are no missing values (except for legal and license information), and the dataset looks complete.

In [None]:
for col in train.columns:
    print(f"{col}: {len(train[col].unique())}")

It looks like **all targets and standard errors** are unique in the dataset. This comes with no surprise as the problem at hand is a regression problem and not a classification problem.

Also, all 830 excerpts having a license share a pool of only 16 unique licenses.

# 2. Understanding Excerpts and their Associated Targets <a class="anchor" id="head-2"></a>

Let's first take a look at the distribution of the targets in the training set.

In [None]:
fig = ff.create_distplot([train['target']], ['target'])
fig.show()

Targets follow a normal distribution centered at **-1**. It is apparent that negative targets are more common than positive ones, with the training range going **from -3.67 up to 1.71**.

But what do those numbers actualy mean? Which direction is the "easier" complexity of text excerpts? To answer this, let's take a look at the 5 excerpts with the highest and lowest target scores.

In [None]:
# Top 5 excerpts with lowest scores

min_5_targets = sorted(train['target'])[:5]
for min_target in min_5_targets:
    print("Target:", train[train['target'] == min_target].iloc[0,4])
    print(train[train['target'] == min_target].iloc[0,3])
    print("*" * 50)

In [None]:
# Top 5 excerpts with highest scores

max_5_targets = sorted(train['target'])[-5:]
for max_target in max_5_targets:
    print("Target:", train[train['target'] == max_target].iloc[0,4])
    print(train[train['target'] == max_target].iloc[0,3])
    print("*" * 50)

It looks like higher scoring excerpts tend to have a lower reading complexity than excerpts with lower scores.

Sentences and simpler, and context is easily caught in higher scoring excerpts, whereas the opposite can be observed in lower scoring excerpts.

For a deeper grasp of the differences in text excerpts, let's take a look at the world cloud of the top 100 excerpts with the highest and lowest target scores.

In [None]:
# Defining our word cloud drawing function
def wordcloud_draw(data, color = 'white'):
    wordcloud = WordCloud(stopwords = STOPWORDS,
                          background_color = color,
                          width = 3000,
                          height = 2000
                         ).generate(' '.join(data))
    plt.figure(1, figsize = (12, 8))
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()

In [None]:
words_in_lower_scoring_excerpts = []

for _, row in train.sort_values('target').head(100).iterrows():
    words_in_lower_scoring_excerpts.extend(row['excerpt'].split())

print("Wordcloud for excerpts with lowest targets:")
wordcloud_draw(words_in_lower_scoring_excerpts, color='black')

Words present in lower scoring excerpts seems to be more precise or *scientific*. Words such as **system**, **light**, **matter** and **surface** stand out the most.

In [None]:
words_in_higher_scoring_excerpts = []

for _, row in train.sort_values('target').tail(100).iterrows():
    words_in_higher_scoring_excerpts.extend(row['excerpt'].split())

print("Wordcloud for excerpts with highest targets:")
wordcloud_draw(words_in_higher_scoring_excerpts)

Words present in higher scoring excerpts seems to be more relaxed and geared towards *story-telling*. Words such as **said**, **went**, **little** and **mother** stand out the most.

# 3. A Naïve String Matching Submission <a class="anchor" id="head-3"></a>

Obviously, the end goal of such a competition is not simply do string matching of known text excerpts in order to predict textual complecity, however, it is to build a strong enough NLP model that can infer from context whether or not a piece of text contains cohesion and semantics of high or low textual complexity.

That being said, below we will implement a very simple string matching technique as a POC and template for building a submission.

In [None]:
test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
test

In [None]:
submission_df = pd.read_csv('../input/commonlitreadabilityprize/sample_submission.csv', index_col=0)
submission_df

In [None]:
range_len = 100
range_counter = 0
lower_bound = 0
upper_bound = 0
target_ranges = []
for _, row in train.sort_values('target').iterrows():
    if range_counter >= range_len - 1:
        range_counter = 0
        upper_bound = row['target']
        target_ranges.append((lower_bound, upper_bound))
    elif range_counter == 0:
        lower_bound = row['target']
        range_counter += 1
    else:
        range_counter += 1

if range_counter > 0:
    target_ranges.append((lower_bound, train.sort_values('target').iloc[-1,4]))
        
target_ranges

In [None]:
target_ranges_prediction = {}
margin = 0.1
for target_range in target_ranges:
    prediction = sum(target_range)/2
    if prediction < -2: prediction += margin
    if prediction > 0: prediction -= margin
    target_ranges_prediction[target_range] = prediction

target_ranges_prediction

In [None]:
words_in_target_ranges = defaultdict(set)

for target_range in target_ranges:
    lower_bound, upper_bound = target_range
    for _, row in train[(train['target'] > lower_bound) & (train['target'] < upper_bound)].iterrows():
        words_in_target_ranges[target_range] |= set(row['excerpt'].lower().split())

In [None]:
predictions = []
for index in submission_df.index:
    excerpt_words = test[test['id'] == index].iloc[0,3].lower().split()
    max_intersection = sum([1 if word in words_in_target_ranges[target_ranges[0]] else 0 for word in excerpt_words])
    max_target_range = target_ranges[0]
    for target_range in target_ranges[1:]:
        intersection = sum([1 if word in words_in_target_ranges[target_range] else 0 for word in excerpt_words])
        if intersection > max_intersection:
            max_intersection = intersection
            max_target_range = target_range
    predictions.append(target_ranges_prediction[max_target_range])

submission_df['target'] = predictions

submission_df.to_csv('submission.csv')

submission_df

# This notebook is under development 🚧

---

## Please upvote if you found it useful 😊