# Pipeline to run all files

* Our plan is to not store anything (refresh daily) but need to consider how to deal with taking in user feedback


### 1. Pull Articles

This script uses NewsAPI to pull a set of articles.
These articles are filtered to only come from reputable news sources (top 19) and must contain a mention of at least one of the top 20 retail companies, to ensure that we have reliable and retail-related articles.

By default the script pulls articles over the last week. However, if you'd like to specify a date range, the script takes in:
* pull_from = date in "YYYY-MM-DD" format
* pull_to   = date in "YYYY-MM-DD" format


In [1]:
# Pull news articles with NewsAPI
import NewsAPI as news

# optional inputs: pull_from, pull_to
# format "YYYY-MM-DD"
# where pull_to > pull_from
articleDB = news.main() #output is called 'NewsAPIOutput.xlsx' in Python Scripts > Data folder

Gathering articles on (Gap Inc) OR (Foot Locker) OR (L Brands) OR Macerich OR Kimco OR TJX OR CVS OR (Home Depot) OR (Best Buy) OR (Lowe's) OR Walmart OR (Target's) OR TGT OR Amazon OR Kroger OR Walgreens OR Kohl's OR (Dollar General) OR (Bed Bath and Beyond) OR Safeway from: 2019-04-10 to 2019-04-16
885


### 2. Clean Articles

The step cleans the articles and preps them for the rest of the Pipeline. 

Along with other cleaning functions, this script removes:
* duplicate articles
* invalid 'articles' (e.g. "Your usage has been flagged", Chinese characters, etc.)  
* articles with less than 300 words

*From articles:*
* html tags (e.g. 'div', etc.)
* stop words (e.g. 'and', 'the', dates, prepositions, etc.)
* standard phrases (e.g. 'click here', 'read more') 

It stores three copies of the data:
1. For interface: stripped of only tags, links, and standard phrases, maintaining punctuation and capitalization for readability (for final output)
2. For keyphrase tagging: all of (1) + stripped of markup, time, url, punctuation that isn't associated with stops (e.g. quotation marks)
3. For classification: all of (1) + (2) + stripped of all punctuation, capitalization, stop words etc. 

In [2]:
#Article Cleaning (must pip install tqdm first (only once) to run)
import dataClean as dc

articleDB = dc.DataClean(articleDB)

### 3. Encode Features for Classifier

This script takes the list of features stored as .csv in the Data subfolder to encode features for the classifier.
Based on this list of features, each article is encoded into a vector according to the features that it contains.

In other words, this script creates a matrix where each row represents an article (i) and each column is a selected feature (j). A cell receives a 1 if the feature (i.e. word) j appears in article i, otherwise it receives a 0. 



In [3]:
#Feature Selection and Binary Article Encoding
import FeatureEncoding as fe
contentBinaryMatrix = fe.encoding(0, df=articleDB, text_col='content', norm='wnLemm')
titleBinaryMatrix = fe.encoding(0, df=articleDB, text_col = 'title', norm='wnLemm')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
content
Binary Encoding
title
Binary Encoding


### 4. Classify articles as market moving or not (ML Algo 1)

Finally Classifying!
This script predicts whether each new article is market moving or not, based on the pre-trained logistic regression classifier. It takes in the encoded matrices and appends a its predictions as a column to the articleDB dataframe.

It also ranks articles based on the logistic regression prediction from most likely to be market moving to least likely.

In [4]:
#Logistic Regression Classifier + Article Ranking, complete final file is called 'results_encoding.xlsx'
import logReg as lr
articleDB = lr.runLogReg(titleBinaryMatrix, contentBinaryMatrix, articleDB)

### 5. Extract Article Tags and Trending Terms (ML Algo 2)

Each article is tagged with 5 key phrases that help to identify the context within the article.  
This script extracts the key phrases from each article and ranks them by Pointwise Mutual Information (how useful it is), through a Content Extraction algorithm.  
These tags are then displayed along article headlines in the interface and highlighted within the article text.  

Inputs: 
    1. articleDB - uses column 'content' 
    2. (optional) - tag type (could be 'ngrams'{unlimited}, 'bigrams'{terms with up to 2 words}, or 'unigrams'{single terms}) 
        * default is 'bigrams'
Outputs:
    1. articleDB = articleDB with appended columns `tags` and `tags_top_5`
    2. trendingTermsDB = keyterms by # article mentions
    
* Note: `tags` currently stores more tags than is probably helpful. Quick fix: adjust ranking code in ContextExtraction.py to output top 15-20 tags. 

In [6]:
# This code extracts and ranks "tags" + counts frequency of tag mentions in articles 
import ContextExtraction as ce
articleDB, trendingTermsDB = ce.retrieveContext(articleDB)

100%|████████████████████████████████████████████████████████████████████████████████| 555/555 [00:17<00:00, 31.31it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 555/555 [00:53<00:00, 12.46it/s]


### 6. Recommend Related Articles (ML Algo 3)

In order to allow FAs to further explore a topic of interest, this algorithm provides the top three similar articles to any given article. These articles are ranked by similarity to an article, regardless of whether they are market moving or not.

This script also appends a column of related article ID's to the ArticleDB

In [6]:
import Recommender as rec
articleDB = rec.recommender(articleDB)

tifidf Encoding
bin Encoding


100%|███████████████████████████████████████████████████████████████████████████████| 556/556 [00:01<00:00, 492.65it/s]


tf Encoding


100%|███████████████████████████████████████████████████████████████████████████████| 556/556 [00:01<00:00, 472.95it/s]


### 7. Output the Interface Data

This final script consolidates all of the information gathered above into a single json file that is displayed through the RBC interface. The final output is called `data.json` which can be dragged into the the Git for the interface built by Hayden and the kind co-op devs at RBC!

In [8]:
import frontPage as fp
frontpage = fp.FrontPage(articleDB, trendingTermsDB)