# Building A Custom News Feed

Principle is to build an email service app that will deliver a daily email with 5 recommended articles based on what articles are in a persons pocket account (labelled dataset). This will initally mean creating a model based on data pulled from pocket using the API. Laterly this will mean creating a service that will pull news from RSS feeds and testing them against the model to find top recommendations. Then creating a second service that will send out automatic emails with recommended articles

In [2]:
import requests
import pandas as pd
import json
import config
pd.set_option('display.max_colwidth', 200)

## Creating supervised dataset

Using the Pocket API to create the initial dataset. Having curated a dataset by adding ~200 articles to pocket and tagged them with 'y' or 'n' depending on whether the user is interested in the article. Ultimately this will give a dataframe of article urls and a label of whether the user is interested or not.

In [None]:
# API authorisation. Once Access Key is known no need to re-run.
auth_params = {
    'consumer_key' : config.consumer_key, 
    'redirect_uri' : 'https://twitter.com/leojpedwards'
}
tkn = requests.post('https://getpocket.com/v3/oauth/request', data=auth_params)
string_tkn = str(tkn.content)
split_tkn = string_tkn.split('=')[1].replace('\'', '')

In [None]:
usr_params = {
    'consumer_key' : config.consumer_key, 
    'code' : split_tkn
}
usr = requests.post('https://getpocket.com/v3/oauth/authorize', data=usr_params)
usr.content

In [16]:
no_params = {
    'consumer_key' : config.consumer_key,
    'access_token' : config.access_token,
    'tag' : 'n'
}
no_result = requests.post('https://getpocket.com/v3/get', data=no_params)

In [17]:
no_json = no_result.json()
no_list = no_json['list']
no_urls = []
for i in no_list.values():
    no_urls.append(i.get('resolved_url'))

In [18]:
no_df = pd.DataFrame(no_urls, columns=['urls'])
no_df = no_df.assign(wanted = lambda x: 'n')
no_df.head()

Unnamed: 0,urls,wanted
0,http://www.bbc.co.uk/news/av/world-us-canada-42698308/firefighter-catches-child-from-burning-building,n
1,https://www.reddit.com/r/hacking/comments/73gtvh/darren_kitchenhacking_as_a_way_of_thinking/,n
2,https://xkcd.com/974/,n
3,http://www.bbc.co.uk/news/world-europe-42701702,n
4,http://www.bbc.co.uk/news/uk-42704691,n


In [19]:
yes_params = {
    'consumer_key' : config.consumer_key,
    'access_token' : config.access_token,
    'tag' : 'y'
}
yes_result = requests.post('https://getpocket.com/v3/get', data=yes_params)

In [20]:
yes_json = yes_result.json()
yes_list = no_json['list']
yes_urls = []
for i in yes_list.values():
    yes_urls.append(i.get('resolved_url'))

In [21]:
yes_df = pd.DataFrame(yes_urls, columns=['urls'])
yes_df = yes_df.assign(wanted = lambda x: 'y')
yes_df.head()

Unnamed: 0,urls,wanted
0,http://www.bbc.co.uk/news/av/world-us-canada-42698308/firefighter-catches-child-from-burning-building,y
1,https://www.reddit.com/r/hacking/comments/73gtvh/darren_kitchenhacking_as_a_way_of_thinking/,y
2,https://xkcd.com/974/,y
3,http://www.bbc.co.uk/news/world-europe-42701702,y
4,http://www.bbc.co.uk/news/uk-42704691,y


In [22]:
df = pd.concat([yes_df, no_df])
df.dropna(inplace=True)
df.head()

Unnamed: 0,urls,wanted
0,http://www.bbc.co.uk/news/av/world-us-canada-42698308/firefighter-catches-child-from-burning-building,y
1,https://www.reddit.com/r/hacking/comments/73gtvh/darren_kitchenhacking_as_a_way_of_thinking/,y
2,https://xkcd.com/974/,y
3,http://www.bbc.co.uk/news/world-europe-42701702,y
4,http://www.bbc.co.uk/news/uk-42704691,y


## Scraping article content

Once the URLs have been recovered it is necessary to scrape the text from these articles in order to carry out the NLP steps. This requires the use of scraping on a number of different web sources. This would be a time consuming process if I were to write a bespoke web scraper for each website. Therefore I decided to use an link embedding service with an api to query. Initially I tried to use embed.ly however this is now a very expensive paid service. I instead have used embed.rocks. This is tested below and applied to the whole list of article URLs from above. 

Once the raw HTML has been extracted from the API, it was necessary to add just the text as a new column which is done using BeautifulSoup.

### Test embed API

In [24]:
test = requests.get('https://api.embed.rocks/api/?url=http://www.randalolson.com/2014/10/27/the-reddit-world-map/&key=' + config.embed_rocks_key)
test_2 = json.loads(test.text)
test_3 = test_2.get('article')

In [26]:
import urllib
def get_html(x):
    qurl = urllib.parse.quote(x)
    rhtml = requests.get('https://api.embed.rocks/api/?url=' + qurl + '&key=' + config.embed_rocks_key)
    try:
        ctnt = json.loads(rhtml.text).get('article')
    except ValueError:
        ctnt = None
    return ctnt
df.loc[:, 'html'] = df['urls'].map(get_html)
df.dropna(inplace=True)
df.head()

Unnamed: 0,urls,wanted,html
0,http://www.bbc.co.uk/news/av/world-us-canada-42698308/firefighter-catches-child-from-burning-building,y,<div></div>
1,https://www.reddit.com/r/hacking/comments/73gtvh/darren_kitchenhacking_as_a_way_of_thinking/,y,<div></div>
2,https://xkcd.com/974/,y,"<div><p>\nThis means you&apos;re free to copy and share these comics (but not to sell them). <a href=""/license.html"">More details</a>.</p></div>"
3,http://www.bbc.co.uk/news/world-europe-42701702,y,"<div><p>President Emmanuel Macron says France will not allow a new migrant camp to be set up in Calais, on a visit to the port where many gather, hoping to get to the UK. </p><p>Up to 700 migrants..."
4,http://www.bbc.co.uk/news/uk-42704691,y,"<div><p>Big Ben and Heathrow Airport were among landmarks targeted by a British man who was &quot;fascinated&quot; by the so-called Islamic State, a court has heard.</p><p>Umar Ahmed Haque, 25, fr..."


In [28]:
from bs4 import BeautifulSoup
def get_text(x):
    soup = BeautifulSoup(x, 'lxml')
    text = soup.get_text()
    return text
df.loc[:, 'text'] = df['html'].map(get_text)
df.dropna(inplace=True)
df.head()

Unnamed: 0,urls,wanted,html,text
0,http://www.bbc.co.uk/news/av/world-us-canada-42698308/firefighter-catches-child-from-burning-building,y,<div></div>,
1,https://www.reddit.com/r/hacking/comments/73gtvh/darren_kitchenhacking_as_a_way_of_thinking/,y,<div></div>,
2,https://xkcd.com/974/,y,"<div><p>\nThis means you&apos;re free to copy and share these comics (but not to sell them). <a href=""/license.html"">More details</a>.</p></div>",\nThis means you're free to copy and share these comics (but not to sell them). More details.
3,http://www.bbc.co.uk/news/world-europe-42701702,y,"<div><p>President Emmanuel Macron says France will not allow a new migrant camp to be set up in Calais, on a visit to the port where many gather, hoping to get to the UK. </p><p>Up to 700 migrants...","President Emmanuel Macron says France will not allow a new migrant camp to be set up in Calais, on a visit to the port where many gather, hoping to get to the UK. Up to 700 migrants are congregati..."
4,http://www.bbc.co.uk/news/uk-42704691,y,"<div><p>Big Ben and Heathrow Airport were among landmarks targeted by a British man who was &quot;fascinated&quot; by the so-called Islamic State, a court has heard.</p><p>Umar Ahmed Haque, 25, fr...","Big Ben and Heathrow Airport were among landmarks targeted by a British man who was ""fascinated"" by the so-called Islamic State, a court has heard.Umar Ahmed Haque, 25, from east London, denies pr..."


In [29]:
df.shape

(154, 4)

## Natural Language Processing

From the text column it is possible to call a vectorizer in order to turn the text data into a usable matrix format for Machine Learning. 


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=3)
vector_matrix = vect.fit_transform(df['text'])

In [31]:
vector_matrix

<154x2906 sparse matrix of type '<class 'numpy.float64'>'
	with 19228 stored elements in Compressed Sparse Row format>

## Support Vector Machines

Building a Support Vector Machine model from the vectorised data. This step will require some evaluation of the quality of the model which has not yet been done as initially it was thought that the iterative process would ensure that the model was effective.

In [37]:
from sklearn.svm import LinearSVC, SVC
clf = SVC()
model = clf.fit(vector_matrix, df['wanted'])

In [38]:
X = vector_matrix
y = df['wanted']

In [39]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    "C" : [1],
    "kernel" : ['rbf'],
    "gamma" : np.linspace(0, 100, 10)
}

gridSearch = GridSearchCV(clf, params, cv=5, n_jobs=1, verbose=1)
model = gridSearch.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.8s finished


In [43]:
print (model.best_params_)
best_model = model.best_estimator_
best = best_model.fit(X_train, y_train)
score = best.score(X_test, y_test)
print ('Score:\t', score)

{'kernel': 'rbf', 'C': 1, 'gamma': 0.0}
Score:	 0.358974358974359


In [44]:
params_2 = {
    "C" : np.logspace(-3, 2, 10),
    "kernel" : ['linear', 'poly', 'rbf'],
    "gamma" : np.logspace(-5, 2, 10)
}

gridSearch = GridSearchCV(clf, params_2, cv=5, n_jobs=-1, verbose=1)
model_2 = gridSearch.fit(X_train, y_train)

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed:    4.0s finished


In [51]:
print (model_2.best_params_)
best_model_2 = model_2.best_estimator_
best_2 = best_model_2.fit(X_train, y_train)
score_2 = best_2.score(X_test, y_test)
print ('Score:\t', score_2)

{'kernel': 'linear', 'C': 0.001, 'gamma': 1e-05}
Score:	 0.358974358974359


In [49]:
clf_2 = LinearSVC()

params = {
    "penalty" : ['l2'],
    "loss" : ['hinge', 'squared_hinge'],
    "C" : np.logspace(-3, 2, 10)
}

gridSearch = GridSearchCV(clf_2, params, cv=5, n_jobs=1, verbose=1)
model_3 = gridSearch.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [50]:
print (model_3.best_params_)
best_model_3 = model_3.best_estimator_
best_3 = best_model_3.fit(X_train, y_train)
score_3 = best_3.score(X_test, y_test)
print ('Score:\t', score_3)

{'C': 0.001, 'penalty': 'l2', 'loss': 'hinge'}
Score:	 0.358974358974359


### Test Gspread credentials

In [29]:
import gspread
from oauth2client.service_account import ServiceAccountCredentials

json_file = 'Custom News Feed-d7876b12a476.json'

scope = ['https://spreadsheets.google.com/feeds']

credentials = ServiceAccountCredentials.from_json_keyfile_name(json_file, scope)

connection = gspread.authorize(credentials)

In [30]:
NF_file = connection.open('NewsFeed')
NF_sheet = NF_file.sheet1
NF_list = list(zip(NF_sheet.col_values(2), NF_sheet.col_values(3), NF_sheet.col_values(4)))
articles_df = pd.DataFrame(NF_list, columns=['title', 'urls', 'html'])
articles_df.replace('', pd.np.nan, inplace=True)
articles_df.dropna(inplace=True)
articles_df.head()

Unnamed: 0,title,urls,html
0,The Pogues frontman's star-studded 60th birthday,http://www.bbc.co.uk/news/world-europe-42701430,"The man who brought punk and Irish folk together, turned 60 with a star-studded concert."
1,John Worboys: Rapist's jail move bid refused in 2015,http://www.bbc.co.uk/news/uk-42700475,Parole Board confirms it denied open jail transfer two years before ruling it was safe to free rapist.
2,Syrian opposition calls on Trump and EU to put pressure on Russia and Iran,http://www.reuters.com/article/us-mideast-crisis-syria-opposition/syrian-opposition-calls-on-trump-and-eu-to-put-pressure-on-russia-and-iran-idUSKBN1F51LO?feedType=RSS&feedName=worldNews,LONDON (Reuters) - U.S. President Donald Trump and European Union leaders should increase the pressure on President Bashar al-Assad and his allies Russia and Iran to return to talks to end Syria's...
3,Does MP Sir Desmond Swayne nod off in Ken Clarke's speech?,http://www.bbc.co.uk/news/uk-politics-42708212,Does MP Sir Desmond Swayne nod off during Ken Clarke's speech on the EU Withdrawal Bill?
4,England's first 'prisoner of war' discovered,http://www.bbc.co.uk/news/education-42690437,"A French aristocrat, captured in 1357, was England's earliest official ""prisoner of war"", say historians."


In [32]:
articles_df.loc[:, 'text'] = articles_df['html'].map(get_text) 
articles_df.reset_index(drop=True, inplace=True)
test_matrix = vect.transform(articles_df['text'])
test_matrix

<5x2873 sparse matrix of type '<class 'numpy.float64'>'
	with 42 stored elements in Compressed Sparse Row format>

In [33]:
results = pd.DataFrame(model.predict(test_matrix), columns = ['wanted'])

In [34]:
results

Unnamed: 0,wanted
0,n
1,y
2,y
3,y
4,y


In [36]:
rez = pd.merge(results, articles_df, left_index=True, right_index=True)
rez

Unnamed: 0,wanted,title,urls,html,text
0,n,The Pogues frontman's star-studded 60th birthday,http://www.bbc.co.uk/news/world-europe-42701430,"The man who brought punk and Irish folk together, turned 60 with a star-studded concert.","The man who brought punk and Irish folk together, turned 60 with a star-studded concert."
1,y,John Worboys: Rapist's jail move bid refused in 2015,http://www.bbc.co.uk/news/uk-42700475,Parole Board confirms it denied open jail transfer two years before ruling it was safe to free rapist.,Parole Board confirms it denied open jail transfer two years before ruling it was safe to free rapist.
2,y,Syrian opposition calls on Trump and EU to put pressure on Russia and Iran,http://www.reuters.com/article/us-mideast-crisis-syria-opposition/syrian-opposition-calls-on-trump-and-eu-to-put-pressure-on-russia-and-iran-idUSKBN1F51LO?feedType=RSS&feedName=worldNews,LONDON (Reuters) - U.S. President Donald Trump and European Union leaders should increase the pressure on President Bashar al-Assad and his allies Russia and Iran to return to talks to end Syria's...,LONDON (Reuters) - U.S. President Donald Trump and European Union leaders should increase the pressure on President Bashar al-Assad and his allies Russia and Iran to return to talks to end Syria's...
3,y,Does MP Sir Desmond Swayne nod off in Ken Clarke's speech?,http://www.bbc.co.uk/news/uk-politics-42708212,Does MP Sir Desmond Swayne nod off during Ken Clarke's speech on the EU Withdrawal Bill?,Does MP Sir Desmond Swayne nod off during Ken Clarke's speech on the EU Withdrawal Bill?
4,y,England's first 'prisoner of war' discovered,http://www.bbc.co.uk/news/education-42690437,"A French aristocrat, captured in 1357, was England's earliest official ""prisoner of war"", say historians.","A French aristocrat, captured in 1357, was England's earliest official ""prisoner of war"", say historians."


### Tune the Model

In [37]:
# Hypothetical correction method
change_to_no = [1, 3, 4]

change_to_yes = []

In [38]:
for i in rez.iloc[change_to_yes].index:
    rez.iloc[i]['wanted'] = 'y'
for i in rez.iloc[change_to_no].index:
    rez.iloc[i]['wanted'] = 'n'
rez

Unnamed: 0,wanted,title,urls,html,text
0,n,The Pogues frontman's star-studded 60th birthday,http://www.bbc.co.uk/news/world-europe-42701430,"The man who brought punk and Irish folk together, turned 60 with a star-studded concert.","The man who brought punk and Irish folk together, turned 60 with a star-studded concert."
1,n,John Worboys: Rapist's jail move bid refused in 2015,http://www.bbc.co.uk/news/uk-42700475,Parole Board confirms it denied open jail transfer two years before ruling it was safe to free rapist.,Parole Board confirms it denied open jail transfer two years before ruling it was safe to free rapist.
2,y,Syrian opposition calls on Trump and EU to put pressure on Russia and Iran,http://www.reuters.com/article/us-mideast-crisis-syria-opposition/syrian-opposition-calls-on-trump-and-eu-to-put-pressure-on-russia-and-iran-idUSKBN1F51LO?feedType=RSS&feedName=worldNews,LONDON (Reuters) - U.S. President Donald Trump and European Union leaders should increase the pressure on President Bashar al-Assad and his allies Russia and Iran to return to talks to end Syria's...,LONDON (Reuters) - U.S. President Donald Trump and European Union leaders should increase the pressure on President Bashar al-Assad and his allies Russia and Iran to return to talks to end Syria's...
3,n,Does MP Sir Desmond Swayne nod off in Ken Clarke's speech?,http://www.bbc.co.uk/news/uk-politics-42708212,Does MP Sir Desmond Swayne nod off during Ken Clarke's speech on the EU Withdrawal Bill?,Does MP Sir Desmond Swayne nod off during Ken Clarke's speech on the EU Withdrawal Bill?
4,n,England's first 'prisoner of war' discovered,http://www.bbc.co.uk/news/education-42690437,"A French aristocrat, captured in 1357, was England's earliest official ""prisoner of war"", say historians.","A French aristocrat, captured in 1357, was England's earliest official ""prisoner of war"", say historians."


In [39]:
combined = pd.concat([df[['wanted', 'text']], rez[['wanted', 'text']]])
combined

Unnamed: 0,wanted,text
0,y,"Employers need to do more to ""normalise conversations"" about the menopause in the workplace, say experts.The comments came after a BBC survey found 70% of respondents did not tell their bosses the..."
1,y,"SpyPi is the result of my high school graduation work that I've created with the intention to provide a different way to make data security/protection a subject of discussion, especially in class...."
2,y,"From a culinary standpoint, Christmas means two things: cookies, and CANDY. While a lot of us think “fudge” when we think Christmas candy, for us today, it means caramels! (Anybody else grow up wi..."
3,y,
4,y,Supermarket chain Iceland has said it will eliminate or drastically reduce plastic packaging of all its own-label products by the end of 2023. Iceland says the move will affect more than a thousan...
5,y,"""This song's our cry against man's inhumanity to man; and man's inhumanity to child."" - Dolores O'Riordan.After The Cranberries' debut album, people thought they had the band sussed out. The Limer..."
6,y,
7,y,"Fast food giant McDonald's has said all its packaging worldwide will come from sustainable sources by 2025. The restaurant chain will aim to get all items like bags, straws, wrappers and cups from..."
8,y,"A former friend of Newsnight presenter Emily Maitlis, who harassed her for two decades, has been jailed for contacting her from prison.Edward Vines, 47, was behind bars and later out on licence, w..."
9,y,"The controversial claim that the UK sends £350m a week to the EU was a ""gross underestimate"", Foreign Secretary Boris Johnson has said.Vote Leave's claim that £350m could go to the NHS instead was..."


In [40]:
# Rebuild model with new data
tvcomb = vect.fit_transform(combined['text'], combined['wanted'])
model = clf.fit(tvcomb, combined['wanted'])
# Iterate this process

### Output the Model to Pickle

In [41]:
import pickle
pickle.dump(model, open(r'news_model_pickle.pkl', 'wb'))
pickle.dump(vect, open(r'news_vect_pickle.pkl', 'wb'))

## TODO

- ~~Add news article to supervised dataset~~
- ~~More articles in supervised dataset~~
- Flask web app
- ~~Add config variables to config file~~
- Score top articles rather than simple yes/no
- Setup script for email service