# Text Mining on Newspaper Articles Using NLTK  
### EE439 Discussion Session 27 Feb 2021  
#### Author: Kantapong Visantavarakul (https://github.com/GoodDee?tab=repositories)

## Objectives
1. Understand how text mining and data scraping work.  
2. Understand the logic behind constructing Economic Policy Index.  
3. References:  
https://www.youtube.com/watch?v=YzMA2O_v5co&ab_channel=ComputerScience  
http://www.nltk.org/book/ch01.html  
https://www.kaggle.com/stieranka/text-analysis-operations-using-nltk

In [6]:
%%HTML
<img src = "./Activity_Page.png", width = 800 />

In [None]:
# Download newspaper3k and nltk packages if you have not done so.
#pip install newspaper3k
#pip install nltk

### STEP 1: Newspaper Scraping from the Web  
The objective is to fetch the newspaper from the website without copying and pasting the text into the console.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from newspaper import Article

In [None]:
# Create two article instances using Article from newspaper package
URL1 = ""
URL2 = ""
article1 = Article(URL1)
article2 = Article(URL2)

In [None]:
# Download two articles
article1.download()
article2.download()

In [None]:
# Parse these two articles
article1.parse()
article2.parse()

In [None]:
# Download punkt if you have not done so. This is used for tokenizing.
#nltk.download('punkt')

In [None]:
# Process sentence tokenization on these two articles
article1.nlp()
article2.nlp()

In [None]:
print("The author of first article is {}".format(article1.authors[0]))
print("The author of first article is {}".format(article2.authors[0]))

In [None]:
# Let's see what is inside this article
print(article1.text)
print(article2.text)

In [None]:
# You can output the summary if you want to
print(article1.summary)
print(article2.summary)

In [None]:
# Guess what type of article2.text is? (A. string B. List)
type(article2.text)

### STEP 2: Tokenize the text that we scraped into words that can be used later  
The objective is to split the article into words such that we can further process and compute Economic Policy Uncertainty (EPU) Index in the final step.

In [None]:
# Tokenize the text into sentences
tokenized_text1=sent_tokenize(article1.text)
tokenized_text2=sent_tokenize(article2.text)
print(tokenized_text2[0])

In [None]:
# Tokenize the text into words (that is what we want!) and remove punctuations from consideration + make them all small letters
tokenized_word1=word_tokenize(article1.text)
words_1 =[word.lower() for word in tokenized_word1 if word.isalpha()]

tokenized_word2=word_tokenize(article2.text)
words_2 =[word.lower() for word in tokenized_word2 if word.isalpha()]
print(words_2)

In [None]:
# Download stopwords if you have not done so
#nltk.download('stopwords')

In [None]:
# Take a look at stopwords
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

In [None]:
# Define a helper function to remove all stop words
def RemoveStopWords(tokenized_words):
    NoStopList = []
    for word in tokenized_words:
        if word not in stop_words:
            NoStopList.append(word)
    return NoStopList

In [None]:
words_1 = RemoveStopWords(words_1)
words_2 = RemoveStopWords(words_2)

print(words_2)

In [None]:
# Stemming is reducing various word forms into a single one, using the word root.
# For example, inflation, inflationary -> inflat

def Stemmer(tokenized_words):
    ps = PorterStemmer()
    StemmedList = []
    for word in tokenized_words:
        StemmedList.append(ps.stem(word))
    return StemmedList

In [None]:
words_1_stemmed = Stemmer(words_1)
words_2_stemmed = Stemmer(words_2)

In [None]:
print(words_2)

**There are other forms that did better than stemming such as Lemmatization that converts better to good. This requires dictionary look-up, and part-of-speech tagging before proceeding. This leads to the classic trade-off: accuracy vs performance (like what I discussed last week).**

### STEP 3: Generate a toy example on Economic Policy Uncertainty Index (EPU)  
The purpose of this final step is to generate true or false label on the news article. True label indicates that the article is related to policy. In index calculation, the higher it is (after normalization), the higher uncertainty level the journalists (public) perceived.

In [None]:
# P-tags are words related to policy. This directly captures policy-related variables. 
# E-tags are words about economics.
# U-tags are words about uncertainties.
P_Tags = []
E_Tags = []
U_Tags = []

In [None]:
def CheckEPU(word_list, P_tags, E_tags, U_tags):
    
    P_tags = Stemmer(P_tags)
    E_tags = Stemmer(E_tags)
    U_tags = Stemmer(U_tags)
    
    P_checked, E_checked, U_checked = False, False, False
    
    # For P, E and U tags, we require that the article contains one of these key terms
    for word in word_list:
        if word in P_tags:
            P_checked = True
            break
            
    for word in word_list:
        if word in E_tags:
            E_checked = True
            break
            
    for word in word_list:
        if word in U_tags:
            U_checked = True
            break
    
    # If the article contains words in P, E and U tags, then return True
    if P_checked and E_checked and U_checked:
        return True
    
    return False

In [None]:
Article1_EPU = CheckEPU(words_1_stemmed)
Article2_EPU = CheckEPU(words_2_stemmed)

print("Economic Policy Uncertainty buckets on Article 1: {}".format(str(Article1_EPU)))
print("Economic Policy Uncertainty buckets on Article 2: {}".format(str(Article2_EPU)))

**This notebook illustrates how Economic Policy Uncertainty is detected on the news article in a toy setting. In reality, this requires much more effort in audit study, pilot study and selecting P terms in order to generate an accurate index.**    
  
**Inspiration from**: Baker, S.R., Bloom, N. and Davis, S.J., 2016. Measuring economic policy uncertainty. *The quarterly journal of economics, 131*(4), pp.1593-1636.