# Identifying the most commonly used adjectives in our articles
Here we use NLTK to identify commonly used adjectives in our articles. This will be helpful in future when looking to increase the coverage of our lexicon. We can use this to find commonly used adjectives that are not contained in Loughran & McDonald's lexicon.

First we read our articles from CSV, and then drop all columns but 'text', as all we're concerned with here is the article text.

In [2]:
import pandas as pd
from IPython.display import display

articles = pd.read_csv("csv/trial2.csv")
display(articles.head())
# Drop all but 'date' column
articles = articles['text']

Unnamed: 0,Title,date,text
0,"Decathlon’s Dublin opening, tobacco battles ov...",2020-06-09,The HSE is investigating if some tobacco compa...
1,Doubling down on a good hand,2020-06-09,"From an Irish perspective, one of the big diff..."
2,Dublin most expensive place to live in euro zo...,2020-06-09,Spiralling accommodation costs have made Dubli...
3,Businesses to seek instant end to lockdown and...,2020-06-08,Ibec chief executive Danny McCoy will call for...
4,Call for zero rate VAT as stores reopen; CRH U...,2020-06-08,The Government should consider introducing a z...


We use dictionaries to count frequency of adjectives of various type. We use NLTK's pos tagger to identify the adjectives amongst the words

In [42]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# defaultdict allows us to initialise dict to 0 values
from collections import defaultdict

import re

# These disctionaries will store a count for each adjective
# adjectives
jj = defaultdict(int)
# comparative adjectives
jjr = defaultdict(int)
# superlative adjectives
jjs = defaultdict(int) 

# use regex to remove punctuation
removePunctuation = lambda s: re.sub(r'[^\w\s]','',s)

for i in range(len(articles)-1):
    
    articleText = articles[i]
    # We ensure that articleText is a string before proceeding,
    # this prevents errors with removePunctuation()
    if isinstance(articleText, str):
        articleText = removePunctuation(articleText)
        tokenizedText = word_tokenize(articleText)
        
        taggedText = nltk.pos_tag(tokenizedText)
        
        # taggetText in the form (word, tag)
        for pair in taggedText:
            if(pair[1] == 'JJ'):
                jj[pair[0]] += 1

            if(pair[1] == 'JJR'):
                jjr[pair[0]] += 1

            if(pair[1] == 'JJS'):
                jjs[pair[0]] += 1

print("Adjectives")
i = 1
for word in sorted(jj, key=jj.get, reverse=True)[:10]:
    print(i, " - ", word, jj[word])
    i += 1
    
print("\nComparative adjectives")
i = 1
for word in sorted(jjr, key=jjr.get, reverse=True)[:10]:
    print(i, " - ", word, jjr[word])
    i += 1
    
print("\nSuperlative adjectives")
i = 1
for word in sorted(jjs, key=jjs.get, reverse=True)[:10]:
    print(i, " - ", word, jjs[word])
    i += 1
        


Adjectives
1  -  economic 15
2  -  last 9
3  -  first 8
4  -  Irish 8
5  -  pandemic 8
6  -  much 8
7  -  good 8
8  -  new 7
9  -  financial 7
10  -  many 7

Comparative adjectives
1  -  more 10
2  -  costlier 2
3  -  less 2
4  -  lower 2
5  -  greener 1
6  -  easier 1
7  -  weaker 1
8  -  sharper 1
9  -  clearer 1
10  -  better 1

Superlative adjectives
1  -  latest 7
2  -  biggest 5
3  -  least 5
4  -  most 3
5  -  largest 2
6  -  worst 2
7  -  hardest 1
