<center> <h1>Exploratory Data Analysis for CommonLit Reading Prize</h1> </center>

| ![](https://cdn.pixabay.com/photo/2019/03/31/07/25/kid-4092599_960_720.jpg) |
|:--:|
| Photo by [Vlad Vasnetsov](https://pixabay.com/users/vladvictoria-9785604/) on [Pixabay](https://pixabay.com/)|

## *What is Exploratory Data Analysis (EDA)*?
EDA, according to [Wickham & Grolemund](https://r4ds.had.co.nz/exploratory-data-analysis.html), is the process of getting to know our data primarily through simple visualizations before fitting a model.

## _Why is it done_?
First, if we attempt to fit models without inspecting our data, our code will throw an error (unless we're using a toy dataset but what fun is that?). 

Second, in the unlikely event the code did run, we would fail to identify: 
* unbalanced data sets
* missing values
* [collinearity](https://medium.com/future-vision/collinearity-what-it-means-why-its-bad-and-how-does-it-affect-other-models-94e1db984168)

and more which would cause our model to produce inferior results. 

## _How is it done in Python?_
While EDA [is more of an attitude than a definitive list of steps](https://r4ds.had.co.nz/exploratory-data-analysis.html), if you're looking for a list of steps to serve as a guide until you develop your own intuition, a great place to look is Aurélien Geron's fantastic checklists for all stages in a Machine Learning project found in [Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow](https://github.com/ageron/handson-ml/blob/master/ml-project-checklist.md).


### Step 1: Frame the Problem
According to the [overview](https://www.kaggle.com/c/commonlitreadabilityprize/overview/description), the task is to answer the following, 

> _Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning?_

If you want to be pedantic, this question has two parts: 
1. Assess the complexity of a passage of text
2. Measure the impact machine learning has on motivating learners

However, the rest of the overview makes clear this competition is concerned with the former rather than the latter.

### Step 2: Get the Data
Lucky for us, the organizers have provided us with the data we need to start with so we can move on to the next step. 

### Step 3: Explore the Data to Gain Insights
Let's start by loading some useful libraries as well as the data. 

In [None]:
pip install textstat

In [None]:
#For data manipulation
import numpy as np
import pandas as pd
import textstat

#For visualization
import matplotlib.pyplot as plt
import missingno as msno 
import seaborn as sns 

RANDOM_STATE = 42

df = pd.read_csv('../input/commonlitreadabilityprize/train.csv')

Again, a definitive set of steps does not exist for this stage but I like to start by asking and answering the [Five W's and 1 H](https://www.workfront.com/blog/project-management-101-the-5-ws-and-1-h-that-should-be-asked-of-every-project)

* Who
* What
* When
* Where 
* Why 
* How

So, in no particular order, let's start with: 
#### _Where did the data come from?_
From the [overview](https://www.kaggle.com/c/commonlitreadabilityprize/overview/description) we can infer that data was produced by [CommonLit](https://www.commonlit.org/en), a non-profit EdTech company focused on promoting literacy, in conjunction with the [Applied Linguistics and ESL Department at Georgia State University](https://alsl.gsu.edu/). 

#### _How much data do we have?_

In [None]:
df.shape

OK, so we have six features and 2,834 observations in the training set. 

>_What about the test set?_

The public test set has seven observations while the [hidden test set has ~2,000](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236335#1292356).

#### _What type of data do we have?_

In [None]:
df.info()

As expected, given the scope of the challenge, we have a mix of strings and floats. 

Let's see some examples: 

In [None]:
df.sample(10, random_state=RANDOM_STATE)

Excellent! Next question that comes to mind is: 

#### _What are the key features?_ 

We're fortunate that this dataset has clear variable names so we can surmise: 

* ```id```: unique identifier for each text sample
* ```excerpt```: text passage to be used for modeling
* ```target```: difficulty rating of the text
* ```standard_error```: variance in scoring amongst the raters

> _What about_ ```url_legal``` _and_ ```license```_?_ 

Since those variables [won't be available at test time](https://www.kaggle.com/c/commonlitreadabilityprize/data), they shouldn't be used for training and will subsequently be dropped. 

#### _Who are the raters?_
According to Dr. Crossley, [teachers from the target grade ranges rated the samples, with a majority of teachers drawn from sixth through 10th grade](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423).

#### _How were the teachers selected?_
Unfortunately, that information hasn't been provided yet but [watch this space](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423#1347995). 

#### _How was the target computed?_
Teachers were given a pair of passages and asked, [of these two, which is easier for a child to understand?](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423#1322910) Target scores were then computed based on the responses of the teachers. 

> _OK, but how were the scores computed?_

This one's complicated so, in the spirit of T.S. Eliot who wrote,["Immature poets imitate; mature poets steal"](https://www.uvu.edu/arts/applause/posts/stealing.html#:~:text=%E2%80%9CGood%20artists%20borrow%2C%20great%20artists%20steal.%E2%80%9D&text=Eliot's%20dictum%3A%20%E2%80%9CImmature%20poets%20imitate,or%20at%20least%20something%20different.), I'm going to direct you to [Shahebaz's fabulous notebook](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240886) which provides an in depth explanation and/or read this [paper](https://www.jstatsoft.org/article/view/v012i01/v12i01.pdf) which was recommended by the competition host in this [discussion](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236671#1296867).

#### _How many missing values are there?_

In [None]:
df = df.drop(columns=['url_legal', 'license'])
msno.bar(df);

EXCELLENT!!! We are not missing any values.

#### _What is the distribution of our target?_

In [None]:
df.target.describe()

Let's visualize that to make it easier to comprehend.

In [None]:
df.head()

In [None]:
sns.set(rc={'figure.figsize':(20, 10)})
sns.set_theme()
g = sns.ecdfplot(data=df, x="target")
g.set_xticks(range(-4, 3));

OK, our target appears to be normally distributed. 

> _Wait! Why didn't you use a histogram?_

Histograms introduce additional bias into our analysis and are, therefore, best avoided when possible. Click [here](https://towardsdatascience.com/6-reasons-why-you-should-stop-using-histograms-and-which-plot-you-should-use-instead-31f937a0a81c) to learn more. 

#### _What does our target mean?_

That's a really good question. 

As previously mentioned, the raters were simply asked to [identify which of the two text passages was easier for students to understand](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423#1322910) and then the scores were tabulated using the method outlined [here](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240886). 

> _But what do those scores actually mean?_ 

🤔 Again, that's a really good question. 

Luckily for us, [some kind souls have already answered this question](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236402#1295630) so we know that the lower the score, the more difficult the text. Inversely, the higher the score, the easier the text. 

> _So that means the higher the target, the lower the grade?_

~~You would think so but [we're still waiting for confirmation](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423#1360081).~~  

Yes it does! So let's rescale our target using the method outlined [here](https://stackoverflow.com/a/929107/4691538) to make it a little easier to read.

All we have to do is the following: 

`OriginalRange = (OriginalMax - OriginalMin)`  
`ScaledRange = (ScaledMax - ScaledMin)`  
`ScaledValue = (((OriginalScore - OriginalMin) * ScaledRange) / OriginalRange) + ScaledMin`  

So what's going on here?

First we define the original highest and lowest scores in `df.target`.

**Key Point:** the higher the value in `df.target` the easier the text which means:

In [None]:
OriginalMax = df.target.min()
OriginalMin = df.target.max()

OriginalRange = (OriginalMax - OriginalMin)

Now for the easy part: defining our scaled values.

In [None]:
ScaledMax = 12
ScaledMin = 3
ScaledRange = (ScaledMax - ScaledMin)

Now for the magic: we can pass the highest score from `df.target` (i.e., the easiest passage) and we _should_ get `3` back.

In [None]:
OriginalScore = df.target.max()

ScaledValue = (((OriginalScore - OriginalMin) * ScaledRange) / OriginalRange) + ScaledMin

print(f'The highest target score scaled is a {ScaledValue}.')

Excellent! Now let's try the lowest score in `df.target` (i.e. the hardest passage) and we _should_ get `12` back.

In [None]:
OriginalScore = df.target.min()

ScaledValue = (((OriginalScore - OriginalMin) * ScaledRange) / OriginalRange) + ScaledMin

print(f'The lowest target score scaled is a {ScaledValue}.')

Excellent! Now for the **REALLY** fun part: scaling the `target` for all observations.

How do we do that?
1. create a function to convert the target to a grade
2. use `apply()` to apply the function to every observation

In [None]:
OriginalMax = df.target.min()
OriginalMin = df.target.max()
OriginalRange = (OriginalMax - OriginalMin)


ScaledMax = 12
ScaledMin = 3
ScaledRange = (ScaledMax - ScaledMin)


def rescale_target(OriginalScore):
    """Converts original target to range 3 - 12"""
    return (((OriginalScore - OriginalMin) * ScaledRange) / OriginalRange) + ScaledMin

It should work but let's test it first: 

In [None]:
print(f"The most difficult text's scaled score is {rescale_target(OriginalMax)} while the easiest's is {rescale_target(OriginalMin)}.")

Perfect, now all we have to do is apply it to `target`  to create a `scaled_target` feature.

In [None]:
df.loc[:, 'scaled_target'] = df.target.apply(lambda target: rescale_target(target))

df.head()

Excellent!

What does the distribution of our scaled target look like? 

In [None]:
sns.set(rc={'figure.figsize':(20, 10)})
sns.set_theme()
g = sns.ecdfplot(data=df, x="scaled_target")
g.set_xticks(range(3, 13));

Fascinating - approximately 20% of our observations are texts rated at the elementary (aka, up to 6th grade) and high school (i.e. 9th - 12th grade) levels meaning 60% of texts were rated at the middle school level. 

I'd be fascinated to know if the experts who selected the passages for training purposes expected this type of distribution; it's tempting to think that books for your children and high school students are obvious whereas books for middle school students have more gradations but that may not be the case. 

Let's come back to this later.

#### _How many duplicates do we have?_

In [None]:
df.duplicated().sum()

Ok, so we do not have any duplicate rows, but do we have any duplicates in any of the columns?

In [None]:
df.apply(lambda x : x.duplicated()).sum()

OK, so that means none of the columns contain duplicates.  

Let's look at the `standard_error`.

#### _What's the standard error?_

The hosts of the competition stated,

> ["[...] individual raters saw only a fraction of the excerpts, while every excerpt was seen by numerous raters."](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423) 

What does that mean? Basically, if the standard error for a passage is greater than 0, then there was disagreement on the difficulty of that passage amongst the raters. 

And do we have any passages with a standard error of zero? 

In [None]:
sns.ecdfplot(data=df, x='standard_error');

According to the chart above, we certainly do.

But how many do we have? 

In [None]:
pd.set_option('display.max_colwidth',1000)

df[df.standard_error==0]

Curious: the excerpt with a  ```standard_error``` of zero also has a ```target``` of zero. 

> _Surely this is an outlier, right?_

Nope. According to the [host](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236403#1293029) this observation is valid. 

> _So what to make of an observation with a_ ```standard_error``` _of 0.0?_

Essentially, the raters were in complete agreement when rating it compared to its competition. 

> _How can that happen?_

1. The excerpt is **REALLY** easy/hard making the choice obvious for the raters
2. The excerpt was paired with excerpts which were **REALLY** easy/hard, again, making the choice obvious for the raters
3. Luck 😆

Given that the ```target``` for this excerpt is 0.0, I was thinking it was either option 1 or 2 rather than dumb luck but now I'm not so sure. 

Looking at the distribution of the ```target``` below, we see that ~80% of all observations are **more difficult** than this passage; remember, [the higher the score, the easier the text](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/236402#1295630). 

In [None]:
observation = len(df.loc[df['target'] <0])/df.shape[0]
chart = sns.ecdfplot(data=df, x='target')
chart.annotate('Excerpt of Interest', 
               xy=(0, observation),
               xytext=(-1.5, 0.8), 
               fontsize=20, 
               arrowprops=dict(arrowstyle="->", color='b'));

Interesting... I would have thought for sure that this passage would have been either significantly more or less difficult than the other passages but I guess not. 

Again, the competition host says this observation is correct (i.e., not a typo) so it just goes to show that things happen when there are human raters involved. 

Let's move on. 

#### _What is the relationship between_ ```target``` _and_ ```standard_error```_?_

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
sns.scatterplot(data=df, x='target', y='standard_error')
# sns.kdeplot(x=x, y=y, levels=5, color="w", linewidths=1)
sns.kdeplot(data=df, x='target', y='standard_error', color='w', linewidths=2)
ax.set_title(f'Standard Error vs Target', size=20)
ax.set_xlabel('Target', size=15)
ax.set_ylabel('Standard Error', size=15)
ax.annotate('Excerpt of Interest', 
               xy=(0, 0),
               xytext=(-1, 0.1), 
               fontsize=12, 
               arrowprops=dict(arrowstyle="->", color='b'));
plt.show;


So what is going on here? 

Essentially, passages in the Goldilocks' Zone (i.e., neither too easy nor too difficult) have less error than those on the edges. 

> _"Why are the easiest/most difficult passages harder to rate?"_

I honestly do not know. Based on my experience as a [language teacher, teacher trainer, and examiner](https://www.linkedin.com/in/evansimpson1/), I would think the passages on the ends would be the easiest to rate. 

Since they're not, maybe something else is going on. 

To that end, let's look more closely at the target and at some excerpts. 

#### _"What's the most difficult passage"_


In [None]:
def get_values(quart, var):
    """Wrapper for the pandas quantile function"""
    return df.loc[df.target==df.target.quantile(q=float(quart), interpolation='nearest'), var].values[0]

def get_scores_and_examples(percentile):
    """Wrapper for the wrapper :D"""
    print(f'ID: {get_values(percentile,"id")}')
    print(" ")
    print(f'Target score: {get_values(percentile,"target")}')
    print(" ")
    print(f'Scaled target: {get_values(percentile, "scaled_target")}')
    print(" ")
    print(f'Standard Error: {get_values(percentile,"standard_error")}')
    print(" ")
    print('Excerpt:')
    print(get_values(percentile, 'excerpt'))
    
get_scores_and_examples(0)

#### _75th percentile?_

In [None]:
get_scores_and_examples(.25)

#### _Median?_

In [None]:
get_scores_and_examples(.5)

#### _25th Percentile_

In [None]:
get_scores_and_examples(.75)

Really? 75% of passages are **more difficult** than the one above? I would think the [schemata](https://evolllution.com/programming/teaching-and-learning/schemata-and-instructional-strategies/) (i.e., Nixon, FBI, CIA, IRS) necessary to understand the passage would rate it higher but, as always, it depends on what it was paired with. 

#### _What's the easiest passage?_

In [None]:
get_scores_and_examples(1)

OK, so a passage discussing dinosaur fossils found by paleontologists on different continents was tabulated as the ***easiest*** passage in the set?

Let's look at all five of these passages together.

In [None]:
examples = ('4626100d8', '519ca97e9', '25ca8f498', 'bac396931', '9f8e4b6a8')

(df[df.id.isin(examples)].sort_values(by='target')
                         .style.set_properties(subset=['excerpt'],**{'text-align': 'left'}))

Fascinating - of the five passages, three are non-fiction while two are fiction. Additionally, of the non-fiction, the most difficult is on engineering while the topics for the second easiest and easiest are history (politics?) and paleontology. 

It would be a lot of fun to run a topic model [like this](https://www.kaggle.com/maartengr/topic-modeling-arxiv-abstract-with-bertopic/notebook) on this dataset to identify what, if any, impact the topic of the extract has on the difficulty of the text but I'll save that for a different notebook. 

Instead, I want to know more about the excerpts like:

* How long are the excerpts?
* What are some summary stats for the excerpts like:
    * average sentence length  
    *~~percentage of each excerpt which is named entities~~ [sigh, another time]
* How do traditional readability algorithms correlate with our target?

~~To that end, watch this space.~~  
Without further ado....

# How do traditional readability algorithms correlate with our target?


Huge hat tip to [Shoku-pan](https://www.kaggle.com/yhirakawa) and [Ruchi Bhatia](https://www.kaggle.com/ruchi798) for sharing their notebooks ([here](https://www.kaggle.com/yhirakawa/textstat-how-to-evaluate-readability) and [here](https://www.kaggle.com/ruchi798/commonlit-readability-prize-eda-baseline)) which outline how to use the [textstat](https://pypi.org/project/textstat/) package to compute some standard readability statistics. 

In [None]:
def textstat_stats(text):
    """Return readability metrics for a passage"""
    n_syllable = textstat.syllable_count(text)
    n_words = textstat.lexicon_count(text, removepunct=True)
    n_sentences = textstat.sentence_count(text)
    avg_words = n_words/n_sentences
    avg_syllables = n_syllable/n_sentences
    flesch_diff = textstat.flesch_reading_ease(text)
    fleschgrade_diff = textstat.flesch_kincaid_grade(text)
    gfog = textstat.gunning_fog(text)
    ari = textstat.automated_readability_index(text)
    cli = textstat.coleman_liau_index(text)
    lwf = textstat.linsear_write_formula(text)
    dcrs = textstat.dale_chall_readability_score(text)
    
    return n_syllable, n_words, n_sentences, avg_words, avg_syllables, flesch_diff, fleschgrade_diff, gfog, ari, cli, lwf, dcrs

In [None]:
def get_stats(df):
    df_stats = df.apply(lambda x: textstat_stats(x.excerpt), axis='columns', result_type='expand')
    columns = ['n_syllable', 'n_words', 'n_sentences', 'avg_words', 'avg_syllables','flesch_diff', 
               'fleschgrade_diff', 'gfog', 'ari', 'cli', 'lwf', 'dcrs']
    df_stats.columns = columns
    return pd.merge(df, df_stats, left_index=True, right_index=True)

df_full = get_stats(df)

If you're an eagle-eyed observer, you'll notice I omitted the [SMOG](https://en.wikipedia.org/wiki/SMOG) index from the function above. Why? Because it works best on passages of 30 sentences or longer and [not at all](http://www.aspiruslibrary.org/literacy/SMOG%20Readability%20Formula.pdf) for texts less than ten. 

And what proportion of our excerpts are shorter than ten sentences in length? 

In [None]:
sns.ecdfplot(data=df_full, x="n_sentences");

Now on to the question of does our target correlate with traditional measures of readability.

In [None]:
columns = ['target', 'scaled_target','n_syllable', 'n_words', 
           'n_sentences','avg_words', 'avg_syllables', 'flesch_diff',
           'fleschgrade_diff', 'gfog', 'ari', 'cli', 'lwf', 'dcrs']
    
corr = df_full.loc[:, columns].corr()

#Generate a mask to over the upper-right side of the matrix
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

#Plot the heatmap with correlations
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(28, 12))
    ax = sns.heatmap(corr, mask=mask, annot=True, square=True)

First, I'm glad to see```target``` and ```scaled_target``` are perfectly correlated meaning I did the linear transformation correctly 😀

Next, it's illuminating to see [Flesch-Kincaid Grade Level](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level) (```fleschgrade_diff```) correlate so strongly with [Gunning FOG](https://en.wikipedia.org/wiki/Gunning_fog_index) (```gfog```), [Automatic Readability Index](https://en.wikipedia.org/wiki/Automated_readability_index) (```ari```), and, to a lesser extent, [Linsear Write Formula](https://en.wikipedia.org/wiki/Linsear_Write) (```lwf```). 

However, what's most interesting, and what I'm sure the hosts of the competition are thrilled to see, is that the highest correlation for the `target` is with [Dale-Chall Readability Score](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula) at -0.55 while the correlations between the scaled target and the average number of syllables and words per sentence (i.e., `avg_syllables`, `avg_words`) is .38 and .27 respectively. 

Why do I bring those numbers up? 

If the strength of relationships between the target/scaled target and the established readability scores were strong (i.e. approx. $\pm$.8), it would indicate the ratings created by the [expert reviewers](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423#1360081) were superflous -  why bother creating a new rating system if it provides the same results as previous ones?

Besides, the competition hosts stated the ["[w]inning models will be sure to incorporate text cohesion and semantics."](https://www.kaggle.com/c/commonlitreadabilityprize/overview)
Needless to say, if the target correlated too highly with features which ignore text cohesion and semantics, well, let's just say it would be a bad look 😃

# Conclusion

I could keep inspecting this dataset for weeks. In fact, I haven't even touched on the [discussion](https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423#1316843) covering the validity of the ratings themselves. 

For example, [Daedalus](https://www.kaggle.com/daedalusai) has highlighted multiple excerpts which have nearly identical target scores, but don't seem to pass the eye test like this: 


In [None]:
samples = ["a666c1db9", 'b55026bd9']
df_full.loc[df_full.id.isin(samples), ['id', 'excerpt', 'target', 'scaled_target']]

What to do with this information? 

Well, I'm really not sure if issues like the one above can be solved algorithmically which further reinforces my belief in the topic Andrew Ng recently presented on: [From Model-centric to Data-centric AI](https://www.youtube.com/watch?v=06-AZXmwHjo). 

In other words, it doesn't matter how fancy the model is if the data it's training on is shaky (aka, garbage in, garbage out). 

Happy coding everyone! 