<h1 align=center>Media Topic Tracking</h1>
<h2 align=center>Using Natural Language Processing</h2>
<h2 align=center>and</h2>
<h2 align=center>Machine Learning</h2>

### Text Prep

Relevant news articles are identified and scraped using the script news_scrape.py script.  Once they are scraped they are stored in their raw from in a MongoDB collection.  The contents of this notebook does the initial cleaning of the data and stores the cleaned text back in the same MongoDB collection along with the other document identifiers and raw text.

In [10]:
# Load the modules we will need
import sys
import re
import os.path
import requests
import time
import pandas as pd

from os import path

import re
import string

from pymongo import MongoClient
from textblob import TextBlob

In [15]:
import spacy
from spacy import displacy
from pprint import pprint
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()


Generate some quick summary statistics


In [4]:
# Open MongoDB
db_client = MongoClient()

# Get set up access to the collections we will need here
db_news = db_client['news_search']
db_news_content = db_news['news_content']

In [9]:
# Get a list of all the documents and print the count
doc_count = db_news_content.count_documents({})
print(doc_count)

3795


In [11]:
#
# Define a couple of useful functions for cleaning text
#
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation+'\\'), ' ', x.lower())



In [12]:
# Count the total number of words in the clean corpus.

cursor = db_news_content.find({}, {'_id':0, 'name': 1, 'text': 1, 'url' : 1, 'base_url' : 1, 'pub_date': 1})
article_count = 0
total_words = 0

for article in list(cursor):
    word_list = []
    article_count += 1
    clean_text = ''
    clean_text = punc_lower(alphanumeric(article['text']))
    clean_text = re.sub('\w*\d\w*', ' ', clean_text)
    clean_text = article['text']
    word_list = clean_text.split()
    # print(article['name'], len(word_list))
    total_words += len(word_list)
    
print(article_count, total_words)

3795 2844063


## Clean the text

The first steps in cleaning are removing punctuation, stripping number and making everything lower case

In [9]:
# First let's clean the text
cursor = db_news_content.find({}, {'_id':1, 'name': 1, 'text': 1, 'url' : 1, 'base_url' : 1, 'pub_date': 1})
db_news_sentences = db_news['sentences']
for article in list(cursor) :
    # make alphanumeric and lower case
    clean_text = punc_lower(alphanumeric(article['text']))
    
    # remove numbers
    clean_text = re.sub('\w*\d\w*', ' ', clean_text)
    
    # remove punctuation
    clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', clean_text)
    clean_text = re.sub('\\xa0', ' ', clean_text)
    
    # word_list = clean_text.split()
    word_list = [ x for x in word_list if (len(x) >= 2) ]
    db_news_content.update_one({ '_id': article['_id']}, { '$set' : { 'clean_text' : clean_text }})
    db_news_content.update_one({ '_id': article['_id']}, { '$set' : { 'word_list' : word_list }})

## Part Of Speech Filtering
Once the basic cleaning is done we do some more advanced filtering based on parts of speech and Named Entity Recognition.
Basic stop words are removed and words are converted to their lemma form.  After that we apply some simple word filters to remove
words that are effectively stop words for the domain in which we are working.

In [13]:
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
sentencizer = nlp.create_pipe("sentencizer")


In [14]:
cursor = db_news_content.find({}, {'_id':1, 'text': 1, 'clean_text' : 1 })
articles = list(cursor)

'''
ADJ: adjective, e.g. big, old, green, incomprehensible, first
ADP: adposition, e.g. in, to, during
ADV: adverb, e.g. very, tomorrow, down, where, there
AUX: auxiliary, e.g. is, has (done), will (do), should (do)
CONJ: conjunction, e.g. and, or, but
CCONJ: coordinating conjunction, e.g. and, or, but
DET: determiner, e.g. a, an, the
INTJ: interjection, e.g. psst, ouch, bravo, hello
NOUN: noun, e.g. girl, cat, tree, air, beauty
NUM: numeral, e.g. 1, 2017, one, seventy-seven, IV, MMXIV
PART: particle, e.g. ’s, not,
PRON: pronoun, e.g I, you, he, she, myself, themselves, somebody
PROPN: proper noun, e.g. Mary, John, London, NATO, HBO
PUNCT: punctuation, e.g. ., (, ), ?
SCONJ: subordinating conjunction, e.g. if, while, that
SYM: symbol, e.g. $, %, §, ©, +, −, ×, ÷, =, :), 😝
VERB: verb, e.g. run, runs, running, eat, ate, eating
X: other, e.g. sfpksdpsxmsa
SPACE: space, e.g.
'''

'''
PERSON	People, including fictional.
NORP	Nationalities or religious or political groups.
FAC	Buildings, airports, highways, bridges, etc.
ORG	Companies, agencies, institutions, etc.
GPE	Countries, cities, states.
LOC	Non-GPE locations, mountain ranges, bodies of water.
PRODUCT	Objects, vehicles, foods, etc. (Not services.)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART	Titles of books, songs, etc.
LAW	Named documents made into laws.
LANGUAGE	Any named language.
DATE	Absolute or relative dates or periods.
TIME	Times smaller than a day.
PERCENT	Percentage, including ”%“.
MONEY	Monetary values, including unit.
QUANTITY	Measurements, as of weight or distance.
ORDINAL	“first”, “second”, etc.
CARDINAL	Numerals that do not fall under another type.
'''

#pos_keep_list = ['ADJ', 'ADV', 'NOUN', 'PROPN', 'VERB']
pos_keep_list = ['ADJ','ADV', 'NOUN', 'VERB']
# ent_keep_list = ['PERSON', 'ORG', 'NORP', 'GPE', '']
ent_keep_list = ['PERSON', 'NORP', 'EVENT', 'PRODUCT', 'GPE', 'MONEY', 'QUANTITY', '']
'''
people_to_drop = ['tom', 'steyer', 'bernie','sanders', 'sander', 'donald', 'trump', 'elizabeth', 'warren', \
                  'joe','biden', 'mike', 'bloomberg',  \
                   'pete','buttigieg', 'amy',  'klobuchar']
                   
                   
'''

people_to_drop = ['tom', 'bernie', 'donald',  'elizabeth', 'joe',  'michael', 'pete', 'amy']

words_to_drop = ['president', 'hall', 'senator', 'vice', 'mayor', 'moines', 'democrat', 'democratic', \
                 'president', 'presidential', 'south', 'bend', 'candidate', 'state', 'this', 'look', 'make',\
                 'think', 'that', 'what', 'like', 'campaign', 'know' ]

for article in articles :
    word_list = []
    # Tag the parts of speech
    doc = nlp(article['clean_text'])

    # tokenize the text
    '''
    word_list = [token.lemma_  \
                 if ((token.pos_ in pos_keep_list) and \
                    (token.ent_type_ in ent_keep_list) and (token.text in nlp.vocab) \
                     and not (token.text in people_to_drop) and not (token.text in words_to_drop)) \
                 else 'location' if (token.ent_type_ == 'GPE') else '' for token in doc ]
    '''
    prop_noun_list = [token.text for token in doc if (token.pos_ == 'PROPN') ]
    word_list = [token.lemma_  \
                 if ((token.pos_ in pos_keep_list) or \
                    (token.ent_type_ in ent_keep_list)) and (token.text in nlp.vocab) else ' ' for token in doc ]
    
    word_list = [word for word in word_list if word.isalpha() and (len(word) > 3)]
    word_list = [word for word in word_list if not word in people_to_drop]
    word_list = [word for word in word_list if not word in words_to_drop]
    
    # make word substitutions for names
    '''
    for i, word in enumerate(word_list) :
        new_word = word
        if word == 'donald' :
            new_word = 'trump'
        elif word == 'michael' :
            new_word = 'bloomberg'
        elif word == 'elizabeth' :
            new_word = 'warren'
        elif word == 'pete' :
            new_word = 'buttigieg'
        elif word == 'joe':
            new_word = 'biden'
        elif word == 'bernie' :
            new_word = 'sanders'
        elif word == 'amy' :
            new_word = 'klobuchar'
        elif word == 'tom' :
            new_word = 'steyer'
        word_list[i] = new_word
    '''
        
    short_text = ' '.join(word_list)
    prop_noun_text = ' '.join(prop_noun_list)
    
    db_news_content.update_one({'_id': article['_id']}, { '$set' : {'short_text' : short_text}})
    db_news_content.update_one({'_id': article['_id']}, { '$set' : {'prop_nouns' : prop_noun_text}})

## Publication Date Conversion
Finally we're going to convert the publication date to a datetime object for later filtering.

In [12]:
# Fix datetime
import datetime as dt
cursor = db_news_content.find({}, {'_id':1, 'text': 1, 'pub_date' : 1 })
articles = list(cursor)

for article in articles :
    dt_pub_date = dt.datetime.strptime(article['pub_date'].split('T')[0], '%Y-%m-%d')
    db_news_content.update_one({'_id': article['_id']}, { '$set' : {'dt_pub_date' : dt_pub_date}})
    