## Data Cleaning & Data Stats

The goal of the notebook is to clean CommonLit Data and get some statistics of the excerpt text. The code uses NLTK, regex, and pandas manipulation to achieve this.

Here are the various processes in this notebook:

1. Reading Data
2. Clean the Data
3. Get important stats about the excerpt

**Process 1: Reading the data**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/commonlitreadabilityprize/sample_submission.csv
/kaggle/input/commonlitreadabilityprize/train.csv
/kaggle/input/commonlitreadabilityprize/test.csv


Verify that the data import and explore the data by printing the top 3 rows in the dataframe.

In [2]:
train_df=pd.read_csv(os.path.join(dirname, filenames[1]))
train_df.head(3)

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
0,c12129c31,,,When the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,,,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,b69ac6792,,,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676


**Process 2: Clean the Data**

Convert the excerpt into lower case(so that an accurate count of the words can be obtained).
Then remove the most frequently occurring words like - 'an', 'the', and 'on'. This list of frequently occurring can be obtained from the NLTK library.

In [3]:
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))

train_df['excerpt_preprocess']=train_df['excerpt'].str.lower()
train_df['excerpt_preprocess']=train_df['excerpt_preprocess'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))
train_df['excerpt_length']=train_df['excerpt_preprocess'].str.len()

****Process 3: Get excerpt stats****

Get the initial length of the dataframe

In [4]:
train_df['excerpt_preprocess']

0       young people returned ballroom, presented deci...
1       dinner time, mrs. fayre somewhat silent, eyes ...
2       roger predicted, snow departed quickly came, t...
3       outside palace great garden walled round, fill...
4       upon time three bears lived together house woo...
                              ...                        
2829    think dinosaurs lived, picture? see hot, steam...
2830    solid? solids usually hard molecules packed to...
2831    second state matter discuss liquid. solids har...
2832    solids shapes actually touch. three dimensions...
2833    animals made many cells. eat things digest ins...
Name: excerpt_preprocess, Length: 2834, dtype: object

Get the average word count of each excerpt

In [5]:
round(train_df['excerpt_length'].mean(),0) #Avg. excerpt length

656.0

Extract the first text excerpt and analyze the average length of the sentence(words in a sentence), the maximum length of the sentence, and the maximum punctuations in a sentence.

The idea is to correlate word count, sentence length, and punctuations to the text complexity. This is done initially for the raw text and then for the text where it was preprocessed(lowercased and with the frequent text removed)

In [6]:
from nltk import sent_tokenize
import re

excerpt_1 = train_df.loc[0,'excerpt']
sent_1 = sent_tokenize(excerpt_1)

max_len=0
count=0
total_length=0
punct=";|!|:|;|,|-|'"
max_punct_len=0

for sent in sent_1:
    punct_len=len(re.findall(punct, sent))
    if punct_len>max_punct_len:
        max_punct_len=punct_len
    if len(sent)>max_len:
        max_len=len(sent)
    total_length+=len(sent)
    count+=1
print("Average sent Length: ",round(total_length/count,1))
print("Max sent Length: ",max_len)
print("Max punct Length: ",max_punct_len)

Average sent Length:  89.3
Max sent Length:  142
Max punct Length:  4


Below shows the same stats, but now with the preprocessed data.

In [7]:
from nltk import sent_tokenize
import re

excerpt_1 = train_df.loc[0,'excerpt_preprocess']
sent_1 = sent_tokenize(excerpt_1)

max_len=0
count=0
total_length=0
punct=";|!|:|;|,|-|'"
max_punct_len=0

for sent in sent_1:
    punct_len=len(re.findall(punct, sent))
    if punct_len>max_punct_len:
        max_punct_len=punct_len
    if len(sent)>max_len:
        max_len=len(sent)
    total_length+=len(sent)
    count+=1
print("Average sent Length: ",round(total_length/count,1))
print("Max sent Length: ",max_len)
print("Max punct Length: ",max_punct_len)

Average sent Length:  57.5
Max sent Length:  93
Max punct Length:  4


The next steps are to get the counts for all the excerpts and then encode the text. This data will then be fed to an ML model to rank/access the excerpt complexity.