## CommonLit Readability Prize EDA + Feature Engineering

In this competition, we're predicting the reading ease of excerpts from literature - readability score. **Text readability** is best defined as the ease with which a text can be read and understood in terms of the linguistic features found within a text. So the task is to build algorithms to **rate the complexity of reading passages for grade 3-12 classroom use**.

In this post we'll first extensively focus on visualizing the data we have at our disposal and uncover the subtle patterns hiding in the data. While we're working on exploring and analysing the data, we'd also gradually incorporate feature engineering and introduce what could be some of the most crucial indicators and predictors of the readability score. So without further ado, lets get started!



## Table of Contents

* [Importing Libraries](#lib)
* [Reading Data](#data)
* [Exploratory Data Analysis](#eda)
    1. [Understanding Significance of URL legal and License](#eda1)
    2. [Understanding Readability Score Distribution](#eda2)
    3. [Understanding Readability Score and Standard Error Relationship](#eda3)
    4. [Understanding Linguistic Features](#eda4)
* [Feature Engineering](#fe)
    1. [Traditional Features](#fe1)
    2. [Syntactic Parse Based Features](#fe2)
    3. [POS Tag Based Features](#fe3)
    4. [Text Based Features](#fe4)

## Importing Libraries <a class="anchor" id="lib"></a>

In [14]:
! pip install textstat > /dev/null
! pip install gensim==3.8.3 > /dev/null
! pip install pyLDAvis==2.1.2 > /dev/null
! pip install spacy==2.3.0 > /dev/null
! pip install textacy > /dev/null
! python -m spacy download en_core_web_lg


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textacy 0.11.0 requires spacy>=3.0.0, but you have spacy 2.3.0 which is incompatible.[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.3.0 requires spacy<3, but you have spacy 3.0.6 which is incompatible.
en-core-web-sm 2.3.1 requires spacy<2.4.0,>=2.3.0, but you have spacy 3.0.6 which is incompatible.
en-core-web-lg 2.3.1 requires spacy<2.4.0,>=2.3.0, but you have spacy 3.0.6 which is incompatible.[0m
Collecting en-core-web-lg==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.0.0/en_core_web_lg-3.0.0-py3-none-any.whl (778.8 MB)
[K     |████████████████████████████████| 778.8 MB 11 kB/s s eta 0:00:01    |█▍            

In [15]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import spacy
import textstat

from spacy import displacy
from collections import Counter

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from sklearn import linear_model
from sklearn import model_selection
from sklearn import metrics
from sklearn.feature_extraction import text

from scipy.sparse import csc_matrix

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim

import textacy
from textacy import text_stats

pyLDAvis.enable_notebook()
nlp = spacy.load('en_core_web_lg')


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



## Reading Data <a class="anchor" id="data"></a>

### Column Meaning and Interpretations
**id** - unique ID for excerpt

**url_legal** - URL of source - this is blank in the test set.

**license** - license of source material - this is blank in the test set.

**excerpt** - text to predict reading ease of

**target** - reading ease

**standard_error** - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

In [3]:
df = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
df.tail()

  and should_run_async(code)


Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
2829,25ca8f498,https://sites.ehe.osu.edu/beyondpenguins/files...,CC BY-SA 3.0,When you think of dinosaurs and where they liv...,1.71139,0.6469
2830,2c26db523,https://en.wikibooks.org/wiki/Wikijunior:The_E...,CC BY-SA 3.0,So what is a solid? Solids are usually hard be...,0.189476,0.535648
2831,cd19e2350,https://en.wikibooks.org/wiki/Wikijunior:The_E...,CC BY-SA 3.0,The second state of matter we will discuss is ...,0.255209,0.483866
2832,15e2e9e7a,https://en.wikibooks.org/wiki/Geometry_for_Ele...,CC BY-SA 3.0,Solids are shapes that you can actually touch....,-0.215279,0.514128
2833,5b990ba77,https://en.wikibooks.org/wiki/Wikijunior:Biolo...,CC BY-SA 3.0,Animals are made of many cells. They eat thing...,0.300779,0.512379


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2834 entries, 0 to 2833
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              2834 non-null   object 
 1   url_legal       830 non-null    object 
 2   license         830 non-null    object 
 3   excerpt         2834 non-null   object 
 4   target          2834 non-null   float64
 5   standard_error  2834 non-null   float64
dtypes: float64(2), object(4)
memory usage: 133.0+ KB


  and should_run_async(code)


## Exploratory Data Analysis <a class="anchor" id="eda"></a>

### Understanding Significance of URL legal and License <a class="anchor" id="eda1"></a>
**url_legal** and **license** seem to have a high percentage of null entries and as mentioned in the competition's description that all the values of both the fields would be blank in test set, so we can drop them from our further analysis. Just to be entirely sure that their presence has no impact on target variable, readability score, let's draw violin plots to confirm if it stands true or not.

In [None]:
fig = go.Figure()

fig.add_trace(go.Violin(y=df.target[(df.url_legal.notnull()) & (df.license.notnull())],
                        side='negative',
                        name='URL & License present',
                        marker_color='#00CC96'))

fig.add_trace(go.Violin(y=df.target[(df.url_legal.isnull()) & (df.license.isnull())],
                        side='positive',
                        name='URL & License not present',
                        marker_color='#FF6692'))

fig.update_traces(box_visible=True,
                  meanline_visible=True,
                  points='all')

fig.update_layout(violinmode='overlay',
                  violingap=0,
                  title={
                      'text': 'Impact of Presence of URL and license on readability score',
                      'x': 0.5,
                      'y': 0.9
                  },
                  yaxis_title='Readability Score')

fig.show()

Going by the plot, our first guesses can be that the excerpts which have valid license and legal url attributed to them have got higher mean and median readability score (difference of ~0.4 in readability scores). However, the distributions are close to normal in both cases.

### Understanding Readability Score Distribution <a class="anchor" id="eda2"></a>

In [5]:
fig = ff.create_distplot([df.target.values], group_labels=['Readability Score'], bin_size=.2, colors=['lightseagreen'])
fig.update_layout(title={
                     'text': 'Distribution of Readability Score',
                     'x': 0.5,
                     'y': 0.9
                 })
fig.show()

  and should_run_async(code)


The distribution looks fairly normal with **min value at -3.67 and maximum at 1.71**. The scores seem to have been baselined around the excerpt with id **436ce79fe** which has 0 readability score and 0 standard error.

In [6]:
print(df.id[df.target == 0])
df.excerpt[df.target == 0].values

106    436ce79fe
Name: id, dtype: object



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



array(['The sun was shining in a cloudless sky, and no shadows lay on the mountain, and all day long they watched and waited, and at last, when the birds were singing their farewell song to the evening star, the children saw the shadows marching from the glen, trooping up the mountain side and dimming the purple of the heather.\nAnd when the mountain top gleamed like a golden spear, they fixed their eyes on the line between the shadow and the sunshine.\n"Now," said Connla, "the time has come."\n"Oh, look! look!" said Nora, and as she spoke, just above the line of shadow a door opened out, and through its portals came a little piper dressed in green and gold. He stepped down, followed by another and another, until they were nine in all, and then the door slung back again.'],
      dtype=object)

In [7]:
fig = go.Figure()
fig.add_trace(go.Violin(y=df.target,
                     box_visible=True,
                     points='all',
                     meanline_visible=True,
                     marker_color='#E377C2',
                     name='Target'))
fig.update_layout(yaxis_title='Readability Score',
                 title={
                     'text': 'Distribution of Readability Score',
                     'x': 0.5,
                     'y': 0.9
                 })
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



The mean and median seem very close to each other with values around **-0.96 and -0.91** respectively. Overall, there are no signs of any outliers present in the scores reported which validates that the entries added in the dataset were scrutinized before hand and the ones above a certain standard deviation threshold were discarded.

### Understanding Readability Score and Standard Error Relationship <a class="anchor" id="eda3"></a>

In [8]:
fig = px.scatter(df[df.target != 0], x='target', y='standard_error', color='standard_error')

fig.update_layout(yaxis_title='Standard Error',
                  xaxis_title='Readability Score',
                 title={
                     'text': 'Readability Score Vs Standard Error',
                     'x': 0.5,
                     'y': 0.95
                 },
                 legend_title='Standard Error')
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



The deviations reported in readability scores tend to increase as we drive toward extreme ends - for highly readable texts and for extraordinarily difficult texts. Humans usually are inclined towards having strong opinions on extreme matters - be it in any field. I guess that is precisely the case here.
That poses an important question to us - whether to handle such datapoints differently?


### Understanding Linguistic Features <a class="anchor" id="eda4"></a>
Lets start by extracting and visualizing some of the basic linguistic features and then drive towards more statistical features

#### Count of words and sentences

In [9]:
def get_word_count(text):
    return textstat.lexicon_count(text, removepunct=True)

df['word_count'] = df.excerpt.apply(get_word_count)
df['sentence_count'] = df.excerpt.apply(textstat.sentence_count)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [10]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(go.Histogram(x=df['word_count'],
                           name='Word Count',
                           marker_color='#AB63FA'), row=1, col=1)

fig.add_trace(go.Histogram(x=df['sentence_count'],
                           name='Sentence Count',
                           marker_color='#FF6692'), row=1, col=2)

fig.update_layout(title={
                    'text': 'Count of Words and Sentences across Corpus',
                    'x': 0.5,
                    'y': 0.9    
                  },
                  xaxis_title='Count',
                  yaxis_title='Frequency')
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



#### Average Words per Sentence in all Excerpts

In [11]:
# Average number of words per sentence in each of the excerpts
def avg_word_count(text):
    return textstat.lexicon_count(text, removepunct=True) / textstat.sentence_count(text)

df['avg_word_count'] = df.excerpt.apply(avg_word_count)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [12]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=df['avg_word_count'][df.target >= df.target.median()],
                           name='Readable Texts'))

fig.add_trace(go.Histogram(x=df['avg_word_count'][df.target < df.target.median()],
                           name='Difficult Texts'))

fig.update_layout(title={
                    'text': 'Average Number of Words per Sentence',
                    'x': 0.5,
                    'y': 0.9    
                  },
                  xaxis_title='Average number of words',
                  yaxis_title='Frequency')
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



There's a **clear and obvious shift** in the average number of words a sentence possesses when we compare the excerpts scored higher or lower than the median readability score. The blue bars denote all the excerpts scored above median score while the red ones scored less than median value **-0.91**. Although there's not enough distinction between the two, it does make sense that longer sentences are likely to be difficult to process while reading than shorter ones.

#### Lexical Sophistication
A lot of traditional studies have shown a correlation between the readability score and the amount to which difficult or sophisticated words are used in a given text. Lets see how it unfolds in our dataset.

In [16]:
# Count of Rare/Difficult words in excerpts
df['difficult_words'] = df.excerpt.apply(textstat.difficult_words)
df['difficult_word_ratio'] = df['difficult_words'] / df['word_count']


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [17]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=df['difficult_word_ratio'][df.target >= df.target.median()],
                           name='Readable Texts',
                           marker_color='#54A24B'))

fig.add_trace(go.Histogram(x=df['difficult_word_ratio'][df.target < df.target.median()],
                           name='Difficult Texts',
                           marker_color='#F58518'))

fig.update_traces(opacity=0.75)
fig.update_layout(title={
                    'text': 'Impact of Presence of Sophisticated Words',
                    'x': 0.5,
                    'y': 0.9    
                  },
                  xaxis_title='Sophisticated Word Ratio',
                  yaxis_title='Frequency',
                  barmode='overlay')
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



The distinction here in more evident and powerful compared to our previous observation regarding sentence lengths. There is clearly a pattern that is being followed - **readable texts tend to have less sophisticated words than difficult texts**

## Feature Engineering <a class="anchor" id="fe"></a>

### Traditional Features <a class="anchor" id="fe1"></a>
Lets add more classical traditional formulas as features to our dataset and visualize what value they're bringing to the table

In [28]:
df['flesch_reading_ease'] = df.excerpt.apply(textstat.flesch_reading_ease)
df['smog_index'] = df.excerpt.apply(textstat.smog_index)
df['flesch_kincaid_grade'] = df.excerpt.apply(textstat.flesch_kincaid_grade)
df['coleman_liau_index'] = df.excerpt.apply(textstat.coleman_liau_index)
df['automated_readability_index'] = df.excerpt.apply(textstat.automated_readability_index)
df['dale_chall_readability_score'] = df.excerpt.apply(textstat.dale_chall_readability_score)
df['linsear_write_formula'] = df.excerpt.apply(textstat.linsear_write_formula)
df['gunning_fog'] = df.excerpt.apply(textstat.gunning_fog)
df['text_standard'] = df.excerpt.apply(textstat.text_standard)
df['szigriszt_pazos'] = df.excerpt.apply(textstat.szigriszt_pazos)
df['fernandez_huerta'] = df.excerpt.apply(textstat.fernandez_huerta)
df['gutierrez_polini'] = df.excerpt.apply(textstat.gutierrez_polini)
df['crawford'] = df.excerpt.apply(textstat.crawford)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Lets visualize how all our additional features added so co-relate with our target variable readability score and with each other as well

In [29]:
fig = px.scatter_matrix(df,
                        dimensions=['target', 'difficult_word_ratio', 'avg_word_count', 'dale_chall_readability_score'],
                        color='target')

fig.update_layout(title={
                    'text': 'Prominent Traditional Features Vs Readability Score',
                    'x': 0.5
                  })
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [None]:
px.imshow(df.corr(), width=800, height=800, color_continuous_scale='Aggrnyl')

There's a **strong negative correlation** between target (readability score) and ***difficult_word_ratio, dale_chall_readability_score, crawford, etc.***. And ***fernandez_huerta*** also seems to have **sound positive relation** with the reoported readability scores. It'd be worth considering these features in our future analysis and models.

Lets add a few more experimental miscellaneous linguistic features...

In [27]:
def add_misc_ling_features(df):
    for idx, row in df.iterrows():
        doc = textacy.make_spacy_doc(row['excerpt'], lang='en_core_web_lg')
        ts = text_stats.TextStats(doc)
        df.loc[idx, 'n_unique_words'] = ts.n_unique_words
        df.loc[idx, 'n_unique_words_per_sent'] = ts.n_unique_words / ts.n_sents
        df.loc[idx, 'n_chars_per_word'] = ts.n_chars / ts.n_words
        df.loc[idx, 'n_syllables'] = ts.n_syllables
        df.loc[idx, 'n_syllables_per_word'] = ts.n_syllables / ts.n_words
        df.loc[idx, 'n_syllables_per_sent'] = ts.n_syllables / ts.n_sents
        df.loc[idx, 'n_monosyllable_words'] = ts.n_monosyllable_words
        df.loc[idx, 'n_polysyllable_words'] = ts.n_polysyllable_words
        df.loc[idx, 'n_long_words'] = ts.n_long_words
        df.loc[idx, 'entropy'] = ts.entropy
        
    return df

df = add_misc_ling_features(df)
# NLP aug - augmentation
# Topics 


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [25]:
df.head()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,id,url_legal,license,excerpt,target,standard_error,word_count,sentence_count,avg_word_count,difficult_words,...,n_unique_words,n_unique_words_per_sent,n_chars_per_word,n_syllables,n_syllables_per_word,n_syllables_per_sent,n_monosyllable_words,n_polysyllable_words,n_long_words,entropy
0,c12129c31,,,When the young people returned to the ballroom...,-0.340259,0.464009,179,11,16.272727,25,...,105.0,7.0,4.0,215.0,1.0,14.333333,154.0,7.0,30.0,6.344274
1,85aa80a4c,,,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805,169,12,14.083333,17,...,111.0,6.9375,3.0,213.0,1.0,13.3125,139.0,7.0,26.0,6.545788
2,b69ac6792,,,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676,166,8,20.75,17,...,117.0,10.636364,2.0,200.0,1.0,18.181818,147.0,3.0,25.0,6.686955
3,dd1000b26,,,And outside before the palace a great garden w...,-1.054013,0.450007,164,5,32.8,14,...,112.0,18.666667,3.0,186.0,1.0,31.0,150.0,2.0,22.0,6.406273
4,37c1b32fb,,,Once upon a time there were Three Bears who li...,0.247197,0.510845,147,5,29.4,1,...,42.0,8.4,4.0,155.0,1.0,31.0,148.0,1.0,2.0,5.019556


### Syntactic Parse Based Features <a class="anchor" id="fe2"></a>

Now lets dive into engineering features based on syntactic dependency parsing

In [30]:
def tree_height(root):
    if not list(root.children):
        return 1
    else:
        return 1 + max(tree_height(x) for x in root.children)


def get_average_height(paragraph):
    if type(paragraph) == str:
        doc = nlp(paragraph)
    else:
        doc = paragraph
    roots = [sent.root for sent in doc.sents]
    return np.mean([tree_height(root) for root in roots])


def count_subtrees(root):
    if not list(root.children):
        return 0
    else:
        return 1 + sum(count_subtrees(x) for x in root.children)


def get_mean_subtrees(paragraph):
    if type(paragraph) == str:
        doc = nlp(paragraph)
    else:
        doc = paragraph
    roots = [sent.root for sent in doc.sents]
    return np.mean([count_subtrees(root) for root in roots])


def get_averge_noun_chunks(paragraph):
    if type(paragraph) == str:
        doc = nlp(paragraph)
    else:
        doc = paragraph
    return len(list(doc.noun_chunks))
    
def get_noun_chunks_size(paragraph):
    if type(paragraph) == str:
        doc = nlp(paragraph)
    else:
        doc = paragraph
    noun_chunks_size = [len(chunk) for chunk in doc.noun_chunks]
    return np.mean(noun_chunks_size)
    


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [31]:
df['avg_parse_tree_height'] = df.excerpt.apply(get_average_height)
df['mean_parse_subtrees'] = df.excerpt.apply(get_mean_subtrees)
df['noun_chunks'] = df.excerpt.apply(get_averge_noun_chunks)
df['avg_noun_chunks'] = df['noun_chunks'] / df['sentence_count']
df['noun_chunk_size'] = df.excerpt.apply(get_noun_chunks_size)
df['mean_noun_chunk_size'] = df['noun_chunk_size'] / df['avg_noun_chunks']


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [32]:
fig = px.scatter_matrix(df,
                        dimensions=['target', 'avg_parse_tree_height', 'mean_parse_subtrees', 'noun_chunk_size'],
                        color='target',
                        color_continuous_scale='Plotly3')

fig.update_layout(title={
                    'text': 'Prominent Synactically Parsed Features Vs Readability Score',
                    'x': 0.5
                  })
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [33]:
px.imshow(df[['target', 'avg_parse_tree_height', 'mean_parse_subtrees', 'noun_chunks', 'avg_noun_chunks', 'noun_chunk_size', 'mean_noun_chunk_size']].corr(),
          color_continuous_scale='Aggrnyl')


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



The following syntactic dependency tree parsed features seem to have a considerable negative correlation with readability score - ***avg_parse_tree_height, mean_parse_subtrees, noun_chunk_size***. Intuitively speaking, the findings make a lot of sense as for example with an increase in tree depth, meaning more complex sentence structure, leading to a lower readability score 

### POS Tag Based Features <a class="anchor" id="fe3"></a>

Now that we have extracted quite some essential features from classical formulas and tree parsing, lets take a step forward in determining features based on parts-of-speech tagging. The features would be answering our questions if the proportion of nouns, proper nouns, interjections, etc. affect the readability score at all.

Just a fun visualization of dependency tree for one the sentences in an excerpt!

In [34]:
doc = nlp(df.excerpt.values[1])
displacy.render(list(doc.sents)[0], style="dep")


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [35]:
def get_pos_freq_per_word(paragraph, tag):
    if type(paragraph) == str:
        doc = nlp(paragraph)
    else:
        doc = paragraph
    pos_counter = Counter(([token.pos_ for token in doc]))
    pos_count_by_tag = pos_counter[tag]
    total_pos_counts = sum(pos_counter.values())
    return pos_count_by_tag / total_pos_counts


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [36]:
df['nouns_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'NOUN'))
df['proper_nouns_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'PROPN'))
df['pronouns_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'PRON'))
df['adj_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'ADJ'))
df['adv_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'ADV'))
df['verbs_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'VERB'))
df['cconj_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'CCONJ'))
df['sconj_per_word'] = df.excerpt.apply(lambda x: get_pos_freq_per_word(x, 'SCONJ'))


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



To put all these new features into perspective, lets visualize some of the prominent ones and get a sense of how significantly they manipulate the readability score.

In [37]:
fig = px.scatter_matrix(df,
                        dimensions=['target', 'nouns_per_word', 'verbs_per_word', 'pronouns_per_word'],
                        color='target',
                        color_continuous_scale='matter')

fig.update_layout(title={
                    'text': 'Prominent POS Based Features Vs Readability Score',
                    'x': 0.5
                  })
fig.show()


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [38]:
px.imshow(df[['target', 'nouns_per_word', 'proper_nouns_per_word', 'pronouns_per_word', 'adj_per_word', 'adv_per_word', 'verbs_per_word', 'cconj_per_word', 'sconj_per_word']].corr(),
          color_continuous_scale='Aggrnyl')


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



***pronouns_per_word, verbs_per_word*** tend to have decent positive influence on readability score - our inital guess can be texts with larger proportion of verbs and pronouns seem to be more easily readable than their other counterparts. Also, higher the number adjectives and nouns per word - ***adj_per_word and nouns_per_word*** the less readable the text becomes

In [39]:
df.to_csv('interim_df.csv')


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



### Text Based Features - To Be Continued<a class="anchor" id="tf1"></a>
