# Content Extraction

The purpose of this code is to highlight key terms for articles that are determined to be "impactful". 

This step would be done after the article has been determined "impactful".

Resources:
http://vipulsharma20.blogspot.com/2017/03/sharingan-newspaper-text-and-context.html
https://github.com/vipul-sharma20/sharingan/blob/master/sharingan/summrizer/context.py
http://nltk.sourceforge.net/doc/en/ch03.html

In [6]:
import os
import sys
from pathlib import Path

# Data packages
import math
import pandas as pd
import numpy as np

#Progress bar
from tqdm import tqdm

#Counter
from collections import Counter

#Operation
import operator

#Natural Language Processing Packages
import re
import nltk

## Download Resources
nltk.download("vader_lexicon")
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")

from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
from nltk.tag import PerceptronTagger
from nltk.data import find

## Machine Learning
import sklearn
import sklearn.metrics as metrics
from sklearn.feature_selection import *
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import datasets

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jadekhiev/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jadekhiev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jadekhiev/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jadekhiev/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
def importData():
    #Import Labelled Data
    DATA_DIR = "Data"
    thispath = Path().absolute()
    #dtype = {"index": str, "title": str, "description": str, "url": str, "date": str, "Retail Relevance": str, "Economy Relevant": str, "Market moving": str}
    RET_ARTICLES = os.path.join(DATA_DIR, "retailarticles-18-11-06.xlsx")

    
    df = pd.read_excel(RET_ARTICLES)

    try:
        df.head()
    except:
        pass
    return df

In [304]:
#def SelectFeaturesNP():
articleDf = importData()
print(articleDf.shape)

(2421, 9)


In [None]:
# Part of Speech Tagging
# Google: https://en.wikipedia.org/wiki/Part-of-speech_tagging
tagger = PerceptronTagger()
pos_tag = tagger.tag

In [None]:
# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
"""

In [None]:
# Create phrase tree
chunker = nltk.RegexpParser(grammar)

In [None]:
# Noun Phrase Extraction Support Functions
#from nltk.corpus import stopwords
#stopwords = stopwords.words('english')
stopwords = ["myself", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "him", "his", "himself", "she", "her", "hers", "herself", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "are", "was", "were", "been", "being", "have", "has", "had", "having", "does", "did", "doing", "the", "and", "but", "because", "until", "while", "for", "with", "about", "into", "through", "during", "before", "after", "from", "down", "out", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "nor", "not", "only", "own", "same", "than", "too", "very", "can", "will", "just", "don", "should", "now"]
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()

# generator, generate leaves one by one
def leaves(tree):
    """Finds NP (nounphrase) leaf nodes of a chunk tree."""
    for subtree in tree.subtrees(filter = lambda t: t.label()=='NP' or t.label()=='JJ' or t.label()=='RB'):
        yield subtree.leaves()

# stemming, lematizing, lower case... 
def normalise(word):
    """Normalises words to lowercase and stems and lemmatizes it."""
    word = word.lower()
    word = stemmer.stem(word)
    word = lemmatizer.lemmatize(word)
    return word

# stop-words and length control
def acceptable_word(word):
    """Checks conditions for acceptable word: length, stopword."""
    accepted = bool(2 <= len(word) <= 40
        and word.lower() not in stopwords)
    return accepted

# generator, create item once a time
def get_terms(tree):
    for leaf in leaves(tree):
        term = [normalise(w) for w,t in leaf if acceptable_word(w) ]
        # Phrase only
        if len(term)>1:
            yield term
            
# Flatten phrase lists to get tokens for analysis
def flatten(npTokenList):
    finalList =[]
    for phrase in npTokenList:
        token = ''
        for word in phrase:
            token += word + ' '
        finalList.append(token.rstrip())
    return finalList

In [276]:
"""
Utility functions for filtering content
originally written by: vipul-sharma20
modifications made by: jadekhiev
"""
from nltk import tokenize
#nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

stopwords = ['$','“','”','’',"read", "Read", "Share","File", "file", "FILE","'s","i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "into", "through", "during", "before", "after", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]


def getWords(sentence):
    """
    Extracts words/tokens from a sentence
    :param sentence: (str) sentence
    :returns: list of tokens
    """
    words = word_tokenize(sentence)
    words = ([word for word in words if word.lower() not in stopwords])
    #print(words)
    return words


def getParagraphs(content):
    """
    Exctracts paragraphs from the the text content
    :param content: (str) text content
    :returns: list of paragraphs
    """
    paraList = content.split('\n\n')
    return paraList


def getSentences(paragraph):
    """
    Extracts sentences from a paragraph
    :param paragraph: (str) paragraph text
    :returns: list of sentences
    """
    indexed = {}
    sentenceList = tokenize.sent_tokenize(paragraph)
    for i, s in enumerate(sentenceList):
        indexed[i] = s
    return sentenceList, indexed

In [291]:
# -*- coding: utf-8 -*-

"""
Script to extract important topics from content
originally written by: vipul-sharma20
modifications made by: jadekhiev
"""

import nltk
#nltk.download('brown')
from nltk.corpus import brown

train = brown.tagged_sents(categories='news')

# backoff regex tagging
regex_tag = nltk.RegexpTagger([
     #(r'[$][0-9]+\s[MmBbTt]\S+','DV'), #dollar value 
     (r'^[-\:]?[0-9]+(.[0-9]+)?$', 'CD'),
     (r'.*able$', 'JJ'),
     (r'^[A-Z].*$', 'NNP'),
     (r'.*ly$', 'RB'),
     (r'.*s$', 'NNS'),
     (r'.*ing$', 'VBG'),
     (r'.*ed$', 'VBD'),
     (r'.[\/\/]\S+', 'URL'), #URL / useless
     (r'.*', 'NN')
])

unigram_tag = nltk.UnigramTagger(train, backoff=regex_tag)
bigram_tag = nltk.BigramTagger(train, backoff=unigram_tag)
trigram_tag = nltk.TrigramTagger(train, backoff=bigram_tag)

# custom defined CFG by vipul
cfg = dict()
cfg['NNP+NNP'] = 'NNP'
cfg['NN+NN'] = 'NNI'
cfg['NNI+NN'] = 'NNI'
cfg['NNI+NNI'] = 'NNI'
cfg['JJ+JJ'] = 'JJ'
cfg['JJ+NN'] = 'NNI'
cfg['CD+CD'] = 'CD'
# combination for monetary movement e.g. quarterly profit fell [VBD]
cfg['RB+NN'] = 'NNP'
cfg['NNP+VBD'] = 'NNP'

def get_info(content):
    words = getWords(content)
    temp_tags = trigram_tag.tag(words)
    tags = re_tag(temp_tags)
    normalized = True
    while normalized:
        normalized = False
        #print("len tag: ", len(tags))
        print([tag for tag in tags])
        for i in range(0, len(tags) - 1):
            #print("i: ", i)
            tagged1 = tags[i]
            if i+1 >= len(tags) - 1:
                break
            tagged2 = tags[i+1]
            key = tagged1[1] + '+' + tagged2[1]
            pos = cfg.get(key)
            if pos:
                tags.pop(i)
                tags.pop(i)
                re_tagged = tagged1[0] + ' ' + tagged2[0]
                tags.insert(i, (re_tagged, pos))
                normalized = True

    final_context = []
    for tag in tags:
        if tag[1] == 'NNP' or tag[1] == 'NNI':
            final_context.append(tag[0])
    return final_context


def re_tag(tagged):
    new_tagged = []
    for tag in tagged:
        if tag[1] == 'NP' or tag[1] == 'NP-TL':
            new_tagged.append((tag[0], 'NNP'))
        elif tag[1][-3:] == '-TL':
            new_tagged.append((tag[0], tag[1][:-3]))
        elif tag[1][-1:] == 'S':
            new_tagged.append((tag[0], tag[1][:-1]))
        else:
            new_tagged.append((tag[0], tag[1]))
    return new_tagged

In [305]:
content = articleDf['content'].iloc[2400]

In [306]:
context = get_info(content)

[('U.S.', 'NNP'), ('debt-management', 'NN'), ('policy', 'NN'), ('creating', 'VBG'), ('ripple', 'NN'), ('effects', 'NNS-HL'), ('financial', 'JJ'), ('markets', 'NN'), ('complicating', 'VBG'), ('Federal', 'JJ'), ('Reserve', 'NN'), ('efforts', 'NN'), ('set', 'VB'), ('interest', 'NN'), ('rates', 'NN'), ('.', '.'), ('Treasury', 'NN'), ('Secretary', 'NN'), ('Steven', 'NNP'), ('Mnuchin', 'NNP'), ('increases', 'NN'), ('issuance', 'NN'), ('plug', 'NN'), ('swelling', 'NN'), ('budget', 'NN'), ('deficits', 'NN'), (',', ','), ('department', 'NN'), ('choice', 'NN'), ('maturities', 'NN'), ('unintended', 'VBD'), ('consequences', 'NN'), ('.', '.'), ('America', 'NNP'), ('fiscal', 'JJ'), ('stewards', 'NN'), ('chosen', 'VBN'), ('lean', 'JJ'), ('short', 'JJ'), (',', ','), ('ramping', 'VBG'), ('sales', 'NN'), ('bills', 'NN'), ('short-dated', 'VBD'), ('notes', 'VBZ'), ('.', '.'), ('approach', 'NN'), ('created', 'VBN'), ('distortions', 'NN'), ('funding', 'VBG'), ('markets', 'NN'), ('curbed', 'VBD'), ('Fed', 'N

In [307]:
context = [term for term in context if len(term.split()) > 1 and not (''in term ==True)]

print(context)

['debt-management policy', 'financial markets', 'Federal Reserve efforts', 'interest rates', 'Treasury Secretary', 'Steven Mnuchin', 'increases issuance plug swelling budget deficits', 'department choice maturities', 'fiscal stewards', 'sales bills', 'ability control key rate', 'debate size balance sheet', 'confirmation spillover effect', 'central bank', 'Treasury dependence shorter maturities', 'yield curve march', 'economic downturns', 'slow pace rate hikes', 'even inflation', 'shows signs', 'Treasury issuance', 'closely ve', 'Jeff Caughron', 'chief executive officer', 'Oklahoma City-based Baker', 'advises community banks', 'additional complications issuance', 'deluge bills', 'central banks', 'bolster currencies', 'year Treasury', 'bill sales', 'Federal deficits', 'part tax cuts', 'Donald Trump', 'issuance needs', 'Treasury officials', 'stabilize maturity nation debt load', 'coupon auction sizes maturities', 'JPMorgan Chase', 'net sales bills amount', 'Morgan Stanley', 'thing need Tr

In [308]:
content

'U.S. debt-management policy is creating ripple effects in financial markets that are complicating the Federal Reserve’s efforts to set interest rates. As Treasury Secretary Steven Mnuchin increases issuance to plug swelling budget deficits, the department’s choice of maturities has had some unintended consequences. America’s fiscal stewards have chosen to lean short, ramping up sales of bills and short-dated notes. The approach has created distortions in funding markets that have curbed the Fed’s ability to control its key rate and influenced the debate over the size of its balance sheet. Bond traders got confirmation of the spillover effect Wednesday, when the central bank tweaked how it engineers rate changes. But the impact goes further: Treasury’s dependence on shorter maturities is speeding the yield curve’s march toward inversion, a phenomenon that has signaled economic downturns. Some investors have speculated the Fed may need to slow the pace of rate hikes to keep that from ha

In [127]:
def countWords(wordList):
    from collections import Counter
    return dict(Counter(wordList))

In [309]:
wordCount = countWords(context)

In [310]:
wordCount

{'debt-management policy': 1,
 'financial markets': 1,
 'Federal Reserve efforts': 1,
 'interest rates': 1,
 'Treasury Secretary': 1,
 'Steven Mnuchin': 1,
 'increases issuance plug swelling budget deficits': 1,
 'department choice maturities': 1,
 'fiscal stewards': 1,
 'sales bills': 1,
 'ability control key rate': 1,
 'debate size balance sheet': 1,
 'confirmation spillover effect': 1,
 'central bank': 1,
 'Treasury dependence shorter maturities': 1,
 'yield curve march': 1,
 'economic downturns': 1,
 'slow pace rate hikes': 1,
 'even inflation': 1,
 'shows signs': 1,
 'Treasury issuance': 1,
 'closely ve': 1,
 'Jeff Caughron': 1,
 'chief executive officer': 1,
 'Oklahoma City-based Baker': 1,
 'advises community banks': 1,
 'additional complications issuance': 1,
 'deluge bills': 1,
 'central banks': 2,
 'bolster currencies': 1,
 'year Treasury': 1,
 'bill sales': 1,
 'Federal deficits': 1,
 'part tax cuts': 1,
 'Donald Trump': 1,
 'issuance needs': 1,
 'Treasury officials': 1,
 's