# Common Lit Kaggle Competition

## Data Dictionary

- id: unique ID for excerpt
- url_legal:  URL of source - this is blank in the test set.
- license:  license of source material - this is blank in the test set.
- excerpt:  text to predict reading ease of
- target:  reading ease
- standard_error:  measure of spread of scores among multiple raters for each excerpt. Not included for test data.

## Imports

In [11]:
import pandas as pd
import seaborn as sns
import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords


## Acquire

In [2]:
# Convert data from csv into pandas dataframe
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,url_legal,license,excerpt,target,standard_error
0,c12129c31,,,When the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,,,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,b69ac6792,,,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,,,And outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,,,Once upon a time there were Three Bears who li...,0.247197,0.510845


In [3]:
# Quick summary on data types of columns and nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2834 entries, 0 to 2833
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              2834 non-null   object 
 1   url_legal       830 non-null    object 
 2   license         830 non-null    object 
 3   excerpt         2834 non-null   object 
 4   target          2834 non-null   float64
 5   standard_error  2834 non-null   float64
dtypes: float64(2), object(4)
memory usage: 133.0+ KB


In [4]:
# Shape of the entire dataframe
df.shape

(2834, 6)

In [5]:
# Total nulls per column
df.isna().sum()

id                   0
url_legal         2004
license           2004
excerpt              0
target               0
standard_error       0
dtype: int64

#### Acquire summary:
- we can drop url and license column since we will not be including these in modeling

## Prepare

#### Prepare tasks:
- set id as index
- we need to drop url_legal and license
- clean up excerpt column by:
    - removing accented characters
    - removing special characters
    - tokenization
    - lemmatization
    - remove stopwords

In [6]:
# Set id as index
df.set_index('id', inplace = True)

In [7]:
# Drop url_legal and license columns
df.drop(columns=['url_legal', 'license'], inplace=True)

In [8]:
# Ensure drop took place
df.head()

Unnamed: 0_level_0,excerpt,target,standard_error
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c12129c31,When the young people returned to the ballroom...,-0.340259,0.464009
85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
b69ac6792,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
dd1000b26,And outside before the palace a great garden w...,-1.054013,0.450007
37c1b32fb,Once upon a time there were Three Bears who li...,0.247197,0.510845


In [12]:
# Normalize
def basic_clean(string):
    """
    This function takes in one argument (string) and will apply
    some basic text cleaning to it:
    1. lowercase everything
    2. normalize unicode characters
    3. replace anything that is not a letter, number, whitespace,
    or a single quote
    """
    lowercase = string.lower()
    normalize = unicodedata.normalize('NFKD', lowercase)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')
    remove_special = re.sub(r"[^a-z0-9'\s]", '', normalize)
    clean_string = remove_special
    return clean_string

In [13]:
df.excerpt = df.excerpt.apply(basic_clean)

In [14]:
df.head()

Unnamed: 0_level_0,excerpt,target,standard_error
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c12129c31,when the young people returned to the ballroom...,-0.340259,0.464009
85aa80a4c,all through dinner time mrs fayre was somewhat...,-0.315372,0.480805
b69ac6792,as roger had predicted the snow departed as qu...,-0.580118,0.476676
dd1000b26,and outside before the palace a great garden w...,-1.054013,0.450007
37c1b32fb,once upon a time there were three bears who li...,0.247197,0.510845
