This notebook is to calculate and add some basic text statistic to the csv file

In [41]:
import re
import textstat
import string
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from itertools import chain

## First time users should uncomment the below two lines
# nltk.download('stopwords')    
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('omw-1.4')


In [42]:
### Get the data

df = pd.read_csv("../data/blogs_with_analytics.csv", sep="\t")
df.dropna(subset=["text"], inplace=True)
df.head(5)

Unnamed: 0,index,url,title,time,category,description,text,introduction,author,author_job_title,...,exit%,semantic neg score,semantic neu score,semantic pos score,semantic compound score,average_sentence_length,dale_chall,flesch,average_stopword,text_length
0,0,blog/futustories-six-reasons-pasi-left-and-cam...,FutuStories - Six reasons Pasi left – and came...,2022-09-16,Culture,"For Senior Cloud Consultant Pasi, a change can...",1. I need awesome people around me…\r\nI’d say...,"For Cloud Archtitect Pasi, a change can be as ...",Pia Hämäri,"Marketing Lead, Finland",...,0.527473,0.053,0.76,0.187,0.999,17.560976,6.88,75.84,8.707317,720
1,1,blog/foresight-methods-and-strategic-planning,Foresight methods and strategic planning in bu...,2022-09-13,Strategy,Foresight methods and strategic planning lead ...,This is where foresight methods and strategic ...,"If the past few years have taught us anything,...",Annina Antinranta,Principal Designer - Emerging Business,...,0.272727,0.02,0.849,0.131,0.9985,15.95082,7.99,45.05,7.166667,973
2,2,blog/uncertainty-in-business-volatile-market,Uncertainty in business and how to deal with it,2022-09-12,Opinion,"Future uncertainty, how to deal with uncertain...",The silver lining to all this doom and gloom i...,"Looming global threats like war, recession and...",Andreas Lindqvist,"Business Director, Futurice",...,0.571429,0.193,0.704,0.103,-0.7525,35.5,11.4,43.9,17.0,71
3,3,blog/futustories-emma-leena-heikkinens-story,FutuStories – Emma-Leena Heikkinen’s story,2022-09-01,Culture,To be leader is not naturally given. Emma-Leen...,What does your role involve?\r\nI’m a client l...,"Human connections, honesty and trust are impor...",Pia Hämäri,"Marketing Lead, Finland",...,0.672222,0.031,0.789,0.18,0.9993,19.195652,7.52,68.5,9.847826,883
4,4,blog/safe-route-uncertain-times,The Safe Route project and how it relates to d...,2022-08-26,Opinion,Good quality data used in the right way is at ...,Safe Route uses data from STRADA - a database ...,Safe Route was conceived as a new way to think...,Sonja Lakner,"Managing Director, Sweden",...,0.609524,0.065,0.711,0.224,0.9995,27.709677,9.07,39.2,14.259259,859


# Average sentence length
In this part, the average sentence length of each blog is calculated to form a new column.

The process to calculate the average length is as follow:

-   The texts are tokenized using `nltk`'s `sent_tokenize` method
-   The words that have the form a number followed by a dot , for example 1., 2. ,... will be removed from the sentences
-   The sentences will be further cleaned, such that words of the form some punctuations followed by text will have the punctuations removed. For example '//u003e' will be converted into 'u003e'
-   Finally, the average sentence length will be calculated

In [43]:
#### AVERAGE SENTENCE LENGTH IN THE TEXT
texts = df["text"].astype(str)
sents_df = [sent_tokenize(sent) for sent in texts]

sents_df = [ [re.sub(pattern="\d+[.]",repl="", string=sent.strip()) for sent in sent_df] for sent_df in sents_df ]
sents_df = [ [re.sub(pattern="[^a-zA-Z0-9\s]",repl="", string=sent) for sent in sent_df] for sent_df in sents_df ]

## Filter out the strings that only contains a white space
res_df = [ [ sent.strip().replace('\r', '.').replace('\n', '.').split('.') for sent in sent_df if sent != "" ] for sent_df in sents_df ]
res_df = [ [sentence.strip() for sentences in bunch for sentence in sentences if sentence != ''] for bunch in res_df ]

splitted_df = [ [ [char for char in sent.split(" ") if char != ""] for sent in res] for res in res_df ]
avg_df = [ np.mean([len(chunk) for chunk in spliting]) for spliting in splitted_df ]


# df["average_sentence_length"] = np.array(avg_df)

**_NOTE:_** There are two problematics results:

-   The blog with url 'blog/hacker-news-favorites' has some code for a table mixed in with the text, thus the average sentence length is over 300
-   The blog with url 'blog/cycleconf-2017-attracted-some-very-different-cyclists-to-stockholm-this-spring' has the text only as ."

# Total words in a text
The calculation for this part is essentially the average but instead of using `mean` in the last part we use `sum`

In [44]:
sum_df = [ np.sum([len(chunk) for chunk in spliting]) for spliting in splitted_df ]
df['text_length'] = np.array(sum_df)

0       720
1       973
2        71
3       883
4       859
       ... 
780     315
781      59
782     196
783     268
784    1108
Name: text_length, Length: 785, dtype: int32

# Text readability

This will consider some readability statistics:

-   Dale-Chall readability formula (Using the new Dale-Chall formula)
-   Flesh-Kincaid readability tests

In [45]:
# text = df["text"][0]
# text

df["text"]

sum = 0

dale_chall = np.full(df.shape[0], -1, float)
flesch = np.full(df.shape[0], -1, float)

for i, text in enumerate(df.text):
    dale_chall[i] = textstat.dale_chall_readability_score(text)
    flesch[i] = textstat.flesch_reading_ease(text)


df["dale_chall"] = dale_chall
df["flesch"] = flesch

# Average stopwords per sentence
The code block below is to test for one text only, change the `run_test_stopword` flag to `True` to run

In [46]:
texts = df["text"].astype(str)
sents_length_df = np.array([len(sent_tokenize(sent)) for sent in texts])

stopwords_df = np.array([word_tokenize(text) for text in texts])
stopwords_df = np.array([len([w for w in tokens if w in stopwords.words('english')]) for tokens in stopwords_df])

  stopwords_df = np.array([word_tokenize(text) for text in texts])


After getting the arrays, we use the `divide` function of `numpy` to get the desired column. The reason I split this into two blocks is to avoid running the array generating code again if something needed to be change.

In [None]:
df["average_stopword"] = np.divide(stopwords_df, sents_length_df)

Now, let's add the generated statistic into the csv file

In [None]:
df.to_csv("../data/blogs_with_analytics.csv", sep="\t", index=False)