This notebook is to calculate and add some basic text statistic to the csv file

In [None]:
import re
import textstat
import string
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from itertools import chain
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')    
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')


In [None]:
### Get the data

df = pd.read_csv("../../data/interim/blogs_with_analytics.csv", sep="\t")
df.dropna(subset=["text"], inplace=True)
df.head(5)

# Average sentence length
In this part, the average sentence length of each blog is calculated to form a new column.

The process to calculate the average length is as follow:

-   The texts are tokenized using `nltk`'s `sent_tokenize` method
-   The words that have the form a number followed by a dot , for example 1., 2. ,... will be removed from the sentences
-   The sentences will be further cleaned, such that words of the form some punctuations followed by text will have the punctuations removed. For example '//u003e' will be converted into 'u003e'
-   Finally, the average sentence length will be calculated

In [None]:
#### AVERAGE SENTENCE LENGTH IN THE TEXT
texts = df["text"].astype(str)
sents_df = [sent_tokenize(sent) for sent in texts]

sents_df = [ [re.sub(pattern="\d+[.]",repl="", string=sent.strip()) for sent in sent_df] for sent_df in sents_df ]
sents_df = [ [re.sub(pattern="[^a-zA-Z0-9\s]",repl="", string=sent) for sent in sent_df] for sent_df in sents_df ]

## Filter out the strings that only contains a white space
res_df = [ [ sent.strip().replace('\r', '.').replace('\n', '.').split('.') for sent in sent_df if sent != "" ] for sent_df in sents_df ]
res_df = [ [sentence.strip() for sentences in bunch for sentence in sentences if sentence != ''] for bunch in res_df ]

splitted_df = [ [ [char for char in sent.split(" ") if char != ""] for sent in res] for res in res_df ]
avg_df = [ np.mean([len(chunk) for chunk in spliting]) for spliting in splitted_df ]


# df["average_sentence_length"] = np.array(avg_df)

**_NOTE:_** There are two problematics results:

-   The blog with url 'blog/hacker-news-favorites' has some code for a table mixed in with the text, thus the average sentence length is over 300
-   The blog with url 'blog/cycleconf-2017-attracted-some-very-different-cyclists-to-stockholm-this-spring' has the text only as ."

# Total words in a text
The calculation for this part is essentially the average but instead of using `mean` in the last part we use `sum`

In [None]:
sum_df = [ np.sum([len(chunk) for chunk in spliting]) for spliting in splitted_df ]
df['text_length'] = np.array(sum_df)

# Text readability

This will consider some readability statistics:

-   Dale-Chall readability formula (Using the new Dale-Chall formula)
-   Flesh-Kincaid readability tests

In [None]:
sum = 0
dale_chall = np.full(df.shape[0], -1, float)
flesch = np.full(df.shape[0], -1, float)

for i, text in enumerate(df.text):
    dale_chall[i] = textstat.dale_chall_readability_score(text)
    flesch[i] = textstat.flesch_reading_ease(text)


df["dale_chall"] = dale_chall
df["flesch"] = flesch

# Average stopwords per sentence
The code block below is to test for one text only, change the `run_test_stopword` flag to `True` to run

In [None]:
texts = df["text"].astype(str)
sents_length_df = np.array([len(sent_tokenize(sent)) for sent in texts])

stopwords_df = np.array([word_tokenize(text) for text in texts])
stopwords_df = np.array([len([w for w in tokens if w in stopwords.words('english')]) for tokens in stopwords_df])

After getting the arrays, we use the `divide` function of `numpy` to get the desired column. The reason I split this into two blocks is to avoid running the array generating code again if something needed to be change.

In [None]:
df["average_stopword"] = np.divide(stopwords_df, sents_length_df)
df.head()

Now, let's add the generated statistic into the csv file

In [None]:
# df.to_csv("../data/blogs_with_analytics.csv", sep="\t", index=False)