#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/2_2.png)

# 2.2 Text Features Construction
* [2.2.1. Sentiment scores](#2.2.1)
* [2.2.2. Analysis of Readability](#2.2.2)

---

![](images/sraf.png)
.

[SRAF](https://sraf.nd.edu) is a website designed to provide a central repository for programs and data used in accounting and finance research.

[Article](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2493941)
More sources for negative and positive scoring [here](https://www.aclweb.org/anthology/W18-3306.pdf)

In [1]:
import pandas as pd

In [2]:
data = pd.read_pickle('pickle/AnnualReports_corpus.pkl')

In [3]:
data.head()

Unnamed: 0,report,company_name
ABN_AMRO_Group_(2018).pdf,babn amro bank nvabn amro group nv annual repo...,ABN_AMRO_Group
AGNC_Investment_(2018).pdf,bproviding private capital to the us housing m...,AGNC_Investment
A_G_Barr_(2018).pdf,bag barr plc nannual report nand accounts njan...,A_G_Barr
Aboitiz_Power_(2018).pdf,bscanned with sheet nn n n n nn n n n nn ...,Aboitiz_Power
Acer_(2018).pdf,bacer annual reportnnpublication date april ...,Acer


In [4]:
report = data.report[0]

---
### 2.2.1. Sentiment scores
<a id="2.2.1">

Read dictionary function

In [5]:
def read_dictionary(path):
    file = open(path,'r')
    return file.read().lower().split('\n')

In [6]:
# Loading positive words
positive_words = read_dictionary('dictionaries/positive_words.txt')

In [7]:
#positive_words

In [8]:
# Loading negative words
negative_words = read_dictionary('dictionaries/negative_words.txt')

In [9]:
#negative_words

We create a function that counts the number of times a positive or negative words appear

#### Score generator function

In [10]:
# Function to calculate scores
from nltk.tokenize import word_tokenize

def generate_score(text, list_to_compare):
    numWords = 0
    tokens = word_tokenize(text)
    for word in tokens:
        if word in list_to_compare:
            numWords  += 1
    
    cumsum = numWords
    return cumsum

In [11]:
# Calculating positive score 
positive_score = generate_score(report, positive_words)
positive_score

834

In [12]:
# Calculating negative score 
negative_score = generate_score(report, negative_words)
negative_score

1401

In [13]:
# Calculating polarity score
def polarity_score(positive_score, negative_score):
    pol_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.00001)
    return pol_score

In [14]:
polarity_score(positive_score, negative_score)

-0.2536912740327012

#### Corpus lenght

In [15]:
# Counting total words
def total_word_count(text):
    tokens = word_tokenize(text)
    return len(tokens)

In [16]:
count_tokens = total_word_count(report)
count_tokens

106119

In [17]:
# Loading uncertainty dictionary
uncertainty_dict = read_dictionary('dictionaries/uncertainty_dictionary.txt')

In [18]:
# calculating uncertainty_score
uncertainty_score = generate_score(report, uncertainty_dict)
uncertainty_score

1728

In [19]:
# Loading constraining words
constraining_dict = read_dictionary('dictionaries/constraining_dictionary.txt')

In [20]:
# calculating constraining score
constraining_score = generate_score(report, constraining_dict)
constraining_score

686

---
### 2.2.3. Analysis of Readability
<a id="2.2.3">

In [21]:
# Calculating Average sentence length 
# It will calculated using formula --- Average Sentence Length = the number of words / the number of sentences
from nltk.tokenize import word_tokenize, sent_tokenize     
    
def average_sentence_length(text):
    sentence_list = sent_tokenize(text)
    tokens = word_tokenize(text)
    average_sent_length = len(tokens) / len(sentence_list) + 0.000001
    return round(average_sent_length)

In [22]:
avg_sentence_length = average_sentence_length(report)
avg_sentence_length

106119

In [23]:
# Calculating percentage of complex word 
# It is calculated using Percentage of Complex words = the number of complex words / the number of words 
from nltk.tokenize import word_tokenize

def percentage_complex_word(text):
    tokens = word_tokenize(text)
    complexWord = 0
    complex_word_percentage = 0
    
    for word in tokens:
        vowels=0
        if word.endswith(('es','ed')):
            pass
        else:
            for w in word:
                if(w=='a' or w=='e' or w=='i' or w=='o' or w=='u'):
                    vowels += 1
            if(vowels > 2):
                complexWord += 1
    if len(tokens) != 0:
        complex_word_percentage = complexWord/len(tokens)
    
    return complex_word_percentage

In [24]:
perc_complex_word = percentage_complex_word(report)
perc_complex_word

0.3049312564196798

In [25]:
# calculating Fog Index 
# Fog index is calculated using -- Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
def calculate_fog_index(averageSentenceLength, percentageComplexWord):
    fogIndex = 0.4 * (averageSentenceLength + percentageComplexWord)
    return round(fogIndex,3)

In [26]:
fog_index = calculate_fog_index(avg_sentence_length, perc_complex_word)
fog_index

42447.722

---
#### *There are loads of sources related to Analysis of Readability, some examples:*
- Readability Index at [geeksforgeeks.org](https://www.geeksforgeeks.org/readability-index-pythonnlp/) and [geeksforgeeks.org](https://www.tutorialspoint.com/readability-index-in-python-nlp)
- Automated readability assessment [Wikipedia](https://en.wikipedia.org/wiki/Readability#The_Golub_Syntactic_Density_Score) and [Article](https://www.aclweb.org/anthology/R15-1014.pdf)

The only thing missing is to integrate the new information applying the functions created to the whole corpus and therefore augment the knowledge of the dataset

In [27]:
import time
start_time = time.time()
#--- 351.0015389919281 seconds ---

data['count_tokens'] = data.report.apply(total_word_count)
data['average_sentence_length'] = data.report.apply(average_sentence_length)
data['percentage_complex_word'] = data.report.apply(percentage_complex_word)
data['positive_score'] = data.report.apply(lambda x: generate_score(x, positive_words))
data['negative_score'] = data.report.apply(lambda x: generate_score(x, negative_words))
data['uncertainty_score'] = data.report.apply(lambda x: generate_score(x, uncertainty_dict))
data['constraining_score'] = data.report.apply(lambda x: generate_score(x, constraining_dict))

print("--- %s seconds ---" % (time.time() - start_time))

--- 351.0015389919281 seconds ---


In [28]:
data.shape

(92, 9)

---
### Dataset with text features

In [29]:
data.head()

Unnamed: 0,report,company_name,count_tokens,average_sentence_length,percentage_complex_word,positive_score,negative_score,uncertainty_score,constraining_score
ABN_AMRO_Group_(2018).pdf,babn amro bank nvabn amro group nv annual repo...,ABN_AMRO_Group,106119,106119,0.304931,834,1401,1728,686
AGNC_Investment_(2018).pdf,bproviding private capital to the us housing m...,AGNC_Investment,47122,47122,0.301091,297,858,963,412
A_G_Barr_(2018).pdf,bag barr plc nannual report nand accounts njan...,A_G_Barr,54311,54311,0.272947,662,461,457,230
Aboitiz_Power_(2018).pdf,bscanned with sheet nn n n n nn n n n nn ...,Aboitiz_Power,68940,68940,0.261053,262,529,425,224
Acer_(2018).pdf,bacer annual reportnnpublication date april ...,Acer,89820,89820,0.327533,684,1134,572,443


In [30]:
# Writing to csv file
output = data.drop(columns=['report'])
#output.to_csv('datasets/table_text_features.csv', sep=',', encoding='utf-8')