# Building A Custom News Feed

Principle is to build an email service app that will deliver a daily email with 5 recommended articles based on what articles are in a persons pocket account (labelled dataset). This will initally mean creating a model based on data pulled from pocket using the API. Laterly this will mean creating a service that will pull news from RSS feeds and testing them against the model to find top recommendations. Then creating a second service that will send out automatic emails with recommended articles

In [10]:
import requests
import pandas as pd
import json
import config
pd.set_option('display.max_colwidth', 200)

## Creating supervised dataset

Using the Pocket API to create the initial dataset. Having curated a dataset by adding ~200 articles to pocket and tagged them with 'y' or 'n' depending on whether the user is interested in the article. Ultimately this will give a dataframe of article urls and a label of whether the user is interested or not.

In [18]:
auth_params = {
    'consumer_key' : config.consumer_key, 
    'redirect_uri' : 'https://twitter.com/leojpedwards'
}
tkn = requests.post('https://getpocket.com/v3/oauth/request', data=auth_params)
string_tkn = str(tkn.content)
split_tkn = string_tkn.split('=')[1].replace('\'', '')
print(tkn.content)
print(split_tkn)

b'code=1ffb9844-dd4d-17e3-00db-ae30f5'
1ffb9844-dd4d-17e3-00db-ae30f5


In [19]:
usr_params = {
    'consumer_key' : config.consumer_key, 
    'code' : '1ffb9844-dd4d-17e3-00db-ae30f5'
}
usr = requests.post('https://getpocket.com/v3/oauth/authorize', data=usr_params)
usr.content

b'403 Forbidden'

In [None]:
no_params = {
    'consumer_key' : config.consumer_key,
    'access_token' : config.access_token,
    'tag' : 'n'
}
no_result = requests.post('https://getpocket.com/v3/get', data=no_params)
no_result.json()

In [None]:
no_jf = no_result.json()
no_jd = no_jf['list']
no_urls = []
for i in no_jd.values():
    no_urls.append(i.get('resolved_url'))
no_urls

In [None]:
no_uf = pd.DataFrame(no_urls, columns=['urls'])
no_uf = no_uf.assign(wanted = lambda x: 'n')
no_uf

In [None]:
yes_params = {
    'consumer_key' : config.consumer_key,
    'access_token' : config.access_token,
    'tag' : 'y'
}
yes_result = requests.post('https://getpocket.com/v3/get', data=yes_params)

In [None]:
yes_jf = yes_result.json()
yes_jd = no_jf['list']
yes_urls = []
for i in yes_jd.values():
    yes_urls.append(i.get('resolved_url'))
yes_urls

In [None]:
yes_uf = pd.DataFrame(yes_urls, columns=['urls'])
yes_uf = yes_uf.assign(wanted = lambda x: 'y')
yes_uf

In [None]:
df = pd.concat([yes_uf, no_uf])
df.dropna(inplace=True)
df

## Scraping article content

Once the URLs have been recovered it is necessary to scrape the text from these articles in order to carry out the NLP steps. This requires the use of scraping on a number of different web sources. This would be a time consuming process if I were to write a bespoke web scraper for each website. Therefore I decided to use an link embedding service with an api to query. Initially I tried to use embed.ly however this is now a very expensive paid service. I instead have used embed.rocks. This is tested below and applied to the whole list of article URLs from above. 

Once the raw HTML has been extracted from the API, it was necessary to add just the text as a new column which is done using BeautifulSoup.

### Test embed API

In [None]:
test = requests.get('https://api.embed.rocks/api/?url=http://www.randalolson.com/2014/10/27/the-reddit-world-map/&key=' + config.embed_rocks_key)
test_2 = json.loads(test.text)
test_3 = test_2.get('article')
test_3

In [None]:
import urllib
def get_html(x):
    qurl = urllib.parse.quote(x)
    rhtml = requests.get('https://api.embed.rocks/api/?url=' + qurl + '&key=' + config.embed_rocks_key)
    try:
        ctnt = json.loads(rhtml.text).get('article')
    except ValueError:
        ctnt = None
    return ctnt
df.loc[:, 'html'] = df['urls'].map(get_html)
df.dropna(inplace=True)
df

In [None]:
from bs4 import BeautifulSoup
def get_text(x):
    soup = BeautifulSoup(x, 'lxml')
    text = soup.get_text()
    return text
df.loc[:, 'text'] = df['html'].map(get_text)
df

In [None]:
df.shape

## Natural Language Processing

From the text column it is possible to call a vectorizer in order to turn the text data into a usable matrix format for Machine Learning. 


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=3)
tv = vect.fit_transform(df['text'])

In [None]:
tv

## Support Vector Machines

Building a Support Vector Machine model from the vectorised data. This step will require some evaluation of the quality of the model which has not yet been done as initially it was thought that the iterative process would ensure that the model was effective.

In [None]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
model = clf.fit(tv, df['wanted'])

In [None]:
# Evaluate the model and test

### Test Gspread credentials

In [None]:
import gspread
from oauth2client.service_account import ServiceAccountCredentials

json_file = 'Custom News Feed-d7876b12a476.json'

scope = ['https://spreadsheets.google.com/feeds']

credentials = ServiceAccountCredentials.from_json_keyfile_name(json_file, scope)

gc = gspread.authorize(credentials)

In [None]:
ws = gc.open('NewsFeed')
sh = ws.sheet1
zd = list(zip(sh.col_values(2), sh.col_values(3), sh.col_values(4)))
zf = pd.DataFrame(zd, columns=['title', 'urls', 'html'])
zf.replace('', pd.np.nan, inplace=True)
zf.dropna(inplace=True)
zf.head()

In [None]:
zf.loc[:, 'text'] = zf['html'].map(get_text) 
zf.reset_index(drop=True, inplace=True)
test_matrix = vect.transform(zf['text'])
test_matrix

In [None]:
results = pd.DataFrame(model.predict(test_matrix), columns = ['wanted'])

In [None]:
results

In [None]:
rez = pd.merge(results, zf, left_index=True, right_index=True)
rez

### Tune the Model

In [None]:
# Hypothetical correction method
change_to_no = [1, 7, 16]

change_to_yes = [0, 9, 27]

In [None]:
for i in rez.iloc[change_to_yes].index:
    rez.iloc[i]['wanted'] = 'y'
for i in rez.iloc[change_to_no].index:
    rez.iloc[i]['wanted'] = 'n'
rez

In [None]:
combined = pd.concat([df[['wanted', 'text']], rez[['wanted', 'text']]])
combined

In [None]:
# Rebuild model with new data
tvcomb = vect.fit_transform(combined['text'], combined['wanted'])
model = clf.fit(tvcomb, combined['wanted'])
# Iterate this process

### Output the Model to Pickle

In [None]:
import pickle
pickle.dump(model, open(r'news_model_pickle.pkl', 'wb'))
pickle.dump(vect, open(r'news_vect_pickle.pkl', 'wb'))

## TODO

- Add news article to supervised dataset
- More articles in supervised dataset
- Flask web app
- Add config variables to config file
- Score top articles rather than simple yes/no