<a href="https://colab.research.google.com/github/GoodDee/EE439_Discussion/blob/main/%E0%B8%AA%E0%B8%B3%E0%B9%80%E0%B8%99%E0%B8%B2%E0%B8%82%E0%B8%AD%E0%B8%87_Text_Mining_EPU_Index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining on Newspaper Articles Using NLTK  
### EE439 Discussion Session 27 Feb 2021  
#### Author: Kantapong Visantavarakul (https://github.com/GoodDee?tab=repositories)

## Objectives
1. Understand how text mining and data scraping work.  
2. Understand the logic behind constructing Economic Policy Index.  
3. References:  
https://www.youtube.com/watch?v=YzMA2O_v5co&ab_channel=ComputerScience  
http://www.nltk.org/book/ch01.html  
https://www.kaggle.com/stieranka/text-analysis-operations-using-nltk

In [None]:
%%HTML
<img src = "./Activity_Page.png", width = 800 />

In [None]:
# Download newspaper3k and nltk packages if you have not done so.
%pip install newspaper3k
%pip install nltk

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 5.9MB/s 
Collecting tldextract>=2.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/7e/62/b6acd3129c5615b9860e670df07fd55b76175b63e6b7f68282c7cad38e9e/tldextract-3.1.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 5.0MB/s 
Collecting tinysegmenter==0.3
  Downloading https://files.pythonhosted.org/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz
Collecting jieba3k>=0.35.1
[?25l  Downloading https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4MB)
[K     |████████████████████████████████| 7.4MB 7.2MB/s 
Collecting feedparser>=5.2.1
[?25l  Downloading https://files.pythonhosted.

### STEP 1: Newspaper Scraping from the Web  
The objective is to fetch the newspaper from the website without copying and pasting the text into the console.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from newspaper import Article

In [None]:
# Create two article instances using Article from newspaper package
URL1 = "https://www.bangkokpost.com/business/2075355/excise-mulls-delaying-drink-tax-hike"
URL2 = "https://www.bangkokpost.com/business/2075363/tour-operators-flail-amid-industry-disruption"
article1 = Article(URL1)
article2 = Article(URL2)

In [None]:
# Download two articles
article1.download()
article2.download()

In [None]:
# Parse these two articles
article1.parse()
article2.parse()

In [None]:
# Download punkt if you have not done so. This is used for tokenizing.
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Process sentence tokenization on these two articles
article1.nlp()
article2.nlp()

In [None]:
print("The author of first article is {}".format(article1.authors[0]))
print("The author of first article is {}".format(article2.authors[0]))

The author of first article is Bangkok Post Public Company Limited
The author of first article is Bangkok Post Public Company Limited


In [None]:
# Let's see what is inside this article
print(article1.text)
print(article2.text)

Beverages on display at a convenience store. The first phase of Thailand's excise tax on sugary drinks came into force on Sept 16, 2017.

The Excise Department is considering freezing the step-up hike of the excise levy placed on drinks with a sugar-based sweetener to reduce business operators' financial burdens.

The amended excise tax structure on beverages came into force in 2017, adding a levy to sugar-based sweeteners on top of the excise duty charged on beverages.

The added excise tax on sugar-based sweetener gradually increases over four phases, intended to help entrepreneurs adjust to the higher tax burden, said Lavaron Sangsnit, director-general of the Excise Department.

Thailand has applied the new excise tax on sugary drinks, cigarettes, alcoholic beverages and imported wine since Sept 16, 2017. The levy on sugary drinks is capped at 20%, with beverages containing more sugar carrying a larger tax burden than less sweetened beverages.

The new rates rise on a gradual basis 

In [None]:
# You can output the summary if you want to
print(article1.summary)
print(article2.summary)

The first phase of Thailand's excise tax on sugary drinks came into force on Sept 16, 2017.
The added excise tax on sugar-based sweetener gradually increases over four phases, intended to help entrepreneurs adjust to the higher tax burden, said Lavaron Sangsnit, director-general of the Excise Department.
Thailand has applied the new excise tax on sugary drinks, cigarettes, alcoholic beverages and imported wine since Sept 16, 2017.
The Excise Department could postpone the third-phase implementation, but the length of this deferment has not been finalised, he said.
The department opted instead to increase tax collection efficiency to meet its excise tax collection target of 550 billion baht for fiscal 2021, he said.
With tight border controls, the outbound market also experienced a nosedive, from 11 million trips in 2019 to only 1 million last year.
Apart from retail tour packages, Mr Thanapol also set up Go Holiday Tour in March 2006 as a one-stop wholesale travel agent to offer tourism

In [None]:
# Guess what type of article2.text is? (A. string B. List)
type(article2.text)

str

### STEP 2: Tokenize the text that we scraped into words that can be used later  
The objective is to split the article into words such that we can further process and compute Economic Policy Uncertainty (EPU) Index in the final step.

In [None]:
# Tokenize the text into sentences
tokenized_text1=sent_tokenize(article1.text)
tokenized_text2=sent_tokenize(article2.text)
print(tokenized_text2[0])

Thanapol Cheewarattanaporn, president of the Association of Domestic Travel.


In [None]:
# Tokenize the text into words (that is what we want!) and remove punctuations from consideration + make them all small letters
tokenized_word1=word_tokenize(article1.text)
words_1 =[word.lower() for word in tokenized_word1 if word.isalpha()]

tokenized_word2=word_tokenize(article2.text)
words_2 =[word.lower() for word in tokenized_word2 if word.isalpha()]
print(words_2)

['thanapol', 'cheewarattanaporn', 'president', 'of', 'the', 'association', 'of', 'domestic', 'travel', 'the', 'global', 'pandemic', 'marks', 'the', 'most', 'severe', 'crisis', 'in', 'years', 'for', 'thai', 'tourism', 'as', 'the', 'number', 'of', 'international', 'arrivals', 'bottomed', 'out', 'at', 'million', 'last', 'year', 'compared', 'with', 'nearly', 'million', 'in', 'with', 'tight', 'border', 'controls', 'the', 'outbound', 'market', 'also', 'experienced', 'a', 'nosedive', 'from', 'million', 'trips', 'in', 'to', 'only', 'million', 'last', 'year', 'tour', 'operators', 'a', 'major', 'driver', 'in', 'the', 'local', 'economy', 'inevitably', 'struggled', 'to', 'maintain', 'business', 'after', 'the', 'virus', 'spread', 'last', 'year', 'with', 'vaccination', 'unlikely', 'to', 'save', 'tourism', 'until', 'late', 'this', 'year', 'at', 'the', 'earliest', 'thanapol', 'cheewarattanaporn', 'a', 'tourism', 'veteran', 'who', 'worked', 'in', 'the', 'industry', 'for', 'more', 'than', 'years', 'supe

In [None]:
# Download stopwords if you have not done so
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Take a look at stopwords
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{"it's", 'further', 'itself', 'to', 'm', "mightn't", "aren't", 'themselves', 'most', 'has', 'so', "weren't", "isn't", "couldn't", "you'd", 'then', 'those', 'y', "hadn't", 'over', 'until', 'with', 'o', 'having', 'haven', 'why', 're', 'shan', 'too', 'hadn', 'isn', 'i', 'there', 'same', 'down', 'hasn', 'each', 'this', 'hers', 'himself', 'and', 'after', 'now', 'be', "you're", 'during', 'will', 'you', 'some', "doesn't", 'mightn', 'was', 'll', 'your', 'which', 'all', 'his', 'is', 'up', 'against', 'being', 'myself', 'aren', 'by', 'about', 'mustn', 'before', 'both', 'through', 'didn', "you've", 'not', 'have', 'these', 'my', 'how', 'as', 'off', 'more', 'he', 'of', 'that', 'were', "needn't", 'our', 'did', 'don', 'but', 'or', "shan't", 'had', 'theirs', 'once', 'what', 'few', 'just', 'if', 'them', 'do', 'does', 'the', 'below', "don't", 'wasn', 'than', 'd', 'her', 'me', "wouldn't", "mustn't", 'very', 'under', 'been', "didn't", 'own', 'couldn', 'in', 'am', 'here', "haven't", 'an', "should've", 'any'

In [None]:
# Define a helper function to remove all stop words
def RemoveStopWords(tokenized_words):
    NoStopList = []
    for word in tokenized_words:
        if word not in stop_words:
            NoStopList.append(word)
    return NoStopList

In [None]:
words_1 = RemoveStopWords(words_1)
words_2 = RemoveStopWords(words_2)

print(words_2)

['thanapol', 'cheewarattanaporn', 'president', 'association', 'domestic', 'travel', 'global', 'pandemic', 'marks', 'severe', 'crisis', 'years', 'thai', 'tourism', 'number', 'international', 'arrivals', 'bottomed', 'million', 'last', 'year', 'compared', 'nearly', 'million', 'tight', 'border', 'controls', 'outbound', 'market', 'also', 'experienced', 'nosedive', 'million', 'trips', 'million', 'last', 'year', 'tour', 'operators', 'major', 'driver', 'local', 'economy', 'inevitably', 'struggled', 'maintain', 'business', 'virus', 'spread', 'last', 'year', 'vaccination', 'unlikely', 'save', 'tourism', 'late', 'year', 'earliest', 'thanapol', 'cheewarattanaporn', 'tourism', 'veteran', 'worked', 'industry', 'years', 'supervising', 'number', 'tour', 'companies', 'said', 'everything', 'razed', 'ground', 'confronting', 'unprecedented', 'global', 'crisis', 'working', 'tour', 'guide', 'visitors', 'mainland', 'several', 'years', 'took', 'general', 'manager', 'post', 'quality', 'express', 'family', 'bus

In [None]:
# Stemming is reducing various word forms into a single one, using the word root.
# For example, inflation, inflationary -> inflat

def Stemmer(tokenized_words):
    ps = PorterStemmer()
    StemmedList = []
    for word in tokenized_words:
        StemmedList.append(ps.stem(word))
    return StemmedList

In [None]:
words_1_stemmed = Stemmer(words_1)
words_2_stemmed = Stemmer(words_2)

In [None]:
print(words_2_stemmed)

['thanapol', 'cheewarattanaporn', 'presid', 'associ', 'domest', 'travel', 'global', 'pandem', 'mark', 'sever', 'crisi', 'year', 'thai', 'tourism', 'number', 'intern', 'arriv', 'bottom', 'million', 'last', 'year', 'compar', 'nearli', 'million', 'tight', 'border', 'control', 'outbound', 'market', 'also', 'experienc', 'nosed', 'million', 'trip', 'million', 'last', 'year', 'tour', 'oper', 'major', 'driver', 'local', 'economi', 'inevit', 'struggl', 'maintain', 'busi', 'viru', 'spread', 'last', 'year', 'vaccin', 'unlik', 'save', 'tourism', 'late', 'year', 'earliest', 'thanapol', 'cheewarattanaporn', 'tourism', 'veteran', 'work', 'industri', 'year', 'supervis', 'number', 'tour', 'compani', 'said', 'everyth', 'raze', 'ground', 'confront', 'unpreced', 'global', 'crisi', 'work', 'tour', 'guid', 'visitor', 'mainland', 'sever', 'year', 'took', 'gener', 'manag', 'post', 'qualiti', 'express', 'famili', 'busi', 'establish', 'novemb', 'conduct', 'inbound', 'outbound', 'tour', 'qualiti', 'express', 'wi

**There are other forms that did better than stemming such as Lemmatization that converts better to good. This requires dictionary look-up, and part-of-speech tagging before proceeding. This leads to the classic trade-off: accuracy vs performance (like what I discussed last week).**

### STEP 3: Generate a toy example on Economic Policy Uncertainty Index (EPU)  
The purpose of this final step is to generate true or false label on the news article. True label indicates that the article is related to policy. In index calculation, the higher it is (after normalization), the higher uncertainty level the journalists (public) perceived.

In [None]:
# P-tags are words related to policy. This directly captures policy-related variables. 
# E-tags are words about economics.
# U-tags are words about uncertainties.
P_Tags = ['bank', 'credit', 'covid', 'government', 'tax', 'spend']
E_Tags = ['economy']
U_Tags = ['uncertain']

In [None]:
def CheckEPU(word_list, P_tags, E_tags, U_tags):
    
    P_tags = Stemmer(P_tags)
    E_tags = Stemmer(E_tags)
    U_tags = Stemmer(U_tags)
    
    P_checked, E_checked, U_checked = False, False, False
    
    # For P, E and U tags, we require that the article contains one of these key terms
    for word in word_list:
        if word in P_tags:
            P_checked = True
            break
            
    for word in word_list:
        if word in E_tags:
            E_checked = True
            break
            
    for word in word_list:
        if word in U_tags:
            U_checked = True
            break
    
    # If the article contains words in P, E and U tags, then return True
    if P_checked or E_checked or U_checked:
        return True
    
    return False

In [None]:
Article1_EPU = CheckEPU(words_1_stemmed, P_Tags, E_Tags, U_Tags)
Article2_EPU = CheckEPU(words_2_stemmed, P_Tags, E_Tags, U_Tags)

print("Economic Policy Uncertainty buckets on Article 1: {}".format(str(Article1_EPU)))
print("Economic Policy Uncertainty buckets on Article 2: {}".format(str(Article2_EPU)))

Economic Policy Uncertainty buckets on Article 1: True
Economic Policy Uncertainty buckets on Article 2: True


**This notebook illustrates how Economic Policy Uncertainty is detected on the news article in a toy setting. In reality, this requires much more effort in audit study, pilot study and selecting P terms in order to generate an accurate index.**    
  
**Inspiration from**: Baker, S.R., Bloom, N. and Davis, S.J., 2016. Measuring economic policy uncertainty. *The quarterly journal of economics, 131*(4), pp.1593-1636.