# Building A Custom News Feed

Principle is to build an email service app that will deliver a daily email with 5 recommended articles based on what articles are in a persons pocket account (labelled dataset). This will initally mean creating a model based on data pulled from pocket using the API. Laterly this will mean creating a service that will pull news from RSS feeds and testing them against the model to find top recommendations. Then creating a second service that will send out automatic emails with recommended articles

In [1]:
import requests
import pandas as pd
import json
pd.set_option('display.max_colwidth', 200)

## Creating supervised dataset

Using the Pocket API to create the initial dataset. Having curated a dataset by adding ~200 articles to pocket and tagged them with 'y' or 'n' depending on whether the user is interested in the article. Ultimately this will give a dataframe of article urls and a label of whether the user is interested or not.

In [2]:
auth_params = {
    'consumer_key' : '74078-5623004303ffc998c3631a66', 
    'redirect_uri' : 'https://twitter.com/leojpedwards'
}
tkn = requests.post('https://getpocket.com/v3/oauth/request', data=auth_params)
tkn.content

b'code=203f8b15-baa1-3f74-f8dd-21e3c8'

In [3]:
usr_params = {
    'consumer_key' : '74078-5623004303ffc998c3631a66', 
    'code' : 'd31d69bd-8d31-e02a-6310-de8d66'
}
usr = requests.post('https://getpocket.com/v3/oauth/authorize', data=usr_params)
usr.content

b'403 Forbidden'

In [4]:
no_params = {
    'consumer_key' : '74078-5623004303ffc998c3631a66',
    'access_token' : 'c6fba7a4-54c8-1509-f778-dec422',
    'tag' : 'n'
}
no_result = requests.post('https://getpocket.com/v3/get', data=no_params)
no_result.json()

{'complete': 1,
 'error': None,
 'list': {'11400737': {'excerpt': 'Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free.  Runs on Windows, Mac OS X and Linux.',
   'favorite': '0',
   'given_title': '',
   'given_url': 'https://gephi.org/',
   'has_image': '1',
   'has_video': '0',
   'is_article': '0',
   'is_index': '1',
   'item_id': '11400737',
   'resolved_id': '11400737',
   'resolved_title': 'The Open Graph Viz Platform',
   'resolved_url': 'http://gephi.org/',
   'sort_id': 3,
   'status': '0',
   'time_added': '1512761894',
   'time_favorited': '0',
   'time_read': '0',
   'time_updated': '1516035573',
   'word_count': '250'},
  '114663999': {'excerpt': '',
   'favorite': '0',
   'given_title': '',
   'given_url': 'https://xkcd.com/974/',
   'has_image': '2',
   'has_video': '0',
   'is_article': '0',
   'is_index': '0',
   'item_id': '114663999',
   'resolved_id': '114663999',
   'resolved_title': 'The

In [5]:
no_jf = no_result.json()
no_jd = no_jf['list']
no_urls = []
for i in no_jd.values():
    no_urls.append(i.get('resolved_url'))
no_urls

['https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee',
 'https://d0hnuts.com/2016/12/21/basics-of-making-a-rootkit-from-syscall-to-hook/',
 'http://www.englandrugby.com/governance/discipline/judgments-document-search/',
 'https://xkcd.com/974/',
 'https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/',
 'https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html',
 'https://redditblog.com/2017/04/13/how-we-built-rplace/',
 'https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/',
 'http://spypi.ch/#',
 'http://fsecurify.com/how-to-learn-hacking/',
 'https://github.com/WuTheFWasThat/vimflowy/blob/master/README.md',
 'http://www.randalolson.com/2014/10/27/the-reddit-world-map/',
 'https://obannoncustomdesigns.com/our-work/',
 'https://medium.com/@algore/be-the-voice-of-reality-10a2495761db',
 'http://spritesmods.com/?art=hddhack&page

In [6]:
no_uf = pd.DataFrame(no_urls, columns=['urls'])
no_uf = no_uf.assign(wanted = lambda x: 'n')
no_uf

Unnamed: 0,urls,wanted
0,https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee,n
1,https://d0hnuts.com/2016/12/21/basics-of-making-a-rootkit-from-syscall-to-hook/,n
2,http://www.englandrugby.com/governance/discipline/judgments-document-search/,n
3,https://xkcd.com/974/,n
4,https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/,n
5,https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html,n
6,https://redditblog.com/2017/04/13/how-we-built-rplace/,n
7,https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/,n
8,http://spypi.ch/#,n
9,http://fsecurify.com/how-to-learn-hacking/,n


In [7]:
yes_params = {
    'consumer_key' : '74078-5623004303ffc998c3631a66',
    'access_token' : 'c6fba7a4-54c8-1509-f778-dec422',
    'tag' : 'y'
}
yes_result = requests.post('https://getpocket.com/v3/get', data=yes_params)

In [8]:
yes_jf = yes_result.json()
yes_jd = no_jf['list']
yes_urls = []
for i in yes_jd.values():
    yes_urls.append(i.get('resolved_url'))
yes_urls

['https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee',
 'https://d0hnuts.com/2016/12/21/basics-of-making-a-rootkit-from-syscall-to-hook/',
 'http://www.englandrugby.com/governance/discipline/judgments-document-search/',
 'https://xkcd.com/974/',
 'https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/',
 'https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html',
 'https://redditblog.com/2017/04/13/how-we-built-rplace/',
 'https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/',
 'http://spypi.ch/#',
 'http://fsecurify.com/how-to-learn-hacking/',
 'https://github.com/WuTheFWasThat/vimflowy/blob/master/README.md',
 'http://www.randalolson.com/2014/10/27/the-reddit-world-map/',
 'https://obannoncustomdesigns.com/our-work/',
 'https://medium.com/@algore/be-the-voice-of-reality-10a2495761db',
 'http://spritesmods.com/?art=hddhack&page

In [9]:
yes_uf = pd.DataFrame(yes_urls, columns=['urls'])
yes_uf = yes_uf.assign(wanted = lambda x: 'y')
yes_uf

Unnamed: 0,urls,wanted
0,https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee,y
1,https://d0hnuts.com/2016/12/21/basics-of-making-a-rootkit-from-syscall-to-hook/,y
2,http://www.englandrugby.com/governance/discipline/judgments-document-search/,y
3,https://xkcd.com/974/,y
4,https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/,y
5,https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html,y
6,https://redditblog.com/2017/04/13/how-we-built-rplace/,y
7,https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/,y
8,http://spypi.ch/#,y
9,http://fsecurify.com/how-to-learn-hacking/,y


In [10]:
df = pd.concat([yes_uf, no_uf])
df.dropna(inplace=True)
df

Unnamed: 0,urls,wanted
0,https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee,y
1,https://d0hnuts.com/2016/12/21/basics-of-making-a-rootkit-from-syscall-to-hook/,y
2,http://www.englandrugby.com/governance/discipline/judgments-document-search/,y
3,https://xkcd.com/974/,y
4,https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/,y
5,https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html,y
6,https://redditblog.com/2017/04/13/how-we-built-rplace/,y
7,https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/,y
8,http://spypi.ch/#,y
9,http://fsecurify.com/how-to-learn-hacking/,y


## Scraping article content

Once the URLs have been recovered it is necessary to scrape the text from these articles in order to carry out the NLP steps. This requires the use of scraping on a number of different web sources. This would be a time consuming process if I were to write a bespoke web scraper for each website. Therefore I decided to use an link embedding service with an api to query. Initially I tried to use embed.ly however this is now a very expensive paid service. I instead have used embed.rocks. This is tested below and applied to the whole list of article URLs from above. 

Once the raw HTML has been extracted from the API, it was necessary to add just the text as a new column which is done using BeautifulSoup.

### Test embed API

In [11]:
test = requests.get('https://api.embed.rocks/api/?url=http://www.randalolson.com/2014/10/27/the-reddit-world-map/&key=c299d38f-0506-49e5-90ad-07324677f6a3')
test_2 = json.loads(test.text)
test_3 = test_2.get('article')
test_3

'<div><p>We can all agree that online social networks dominate most people&#x2019;s day-to-day Internet lives. <a href="http://www.pewinternet.org/fact-sheets/social-networking-fact-sheet/">90%</a> of all U.S. adults aged 18-29 have a Facebook account, and a large portion of those people check their Facebook at least once a day.</p><p>What&#x2019;s strange is that most people regard social networks as nothing more than a blob of status updates and links that occasionally has something interesting on it. These people rely on word of mouth or rudimentary search features to find interesting content on social networks, despite the fact that there&#x2019;s often small communities focused on exactly what they want to talk about.</p><p>Last year, I set out to change all that. I wanted to connect people to these smaller communities.</p><p>The toughest part about online social networks is navigating them. With millions of users and hundreds of thousands of communities, how can we possibly hope 

In [12]:
import urllib
def get_html(x):
    qurl = urllib.parse.quote(x)
    rhtml = requests.get('https://api.embed.rocks/api/?url=' + qurl + '&key=c299d38f-0506-49e5-90ad-07324677f6a3')
    try:
        ctnt = json.loads(rhtml.text).get('article')
    except ValueError:
        ctnt = None
    return ctnt
df.loc[:, 'html'] = df['urls'].map(get_html)
df.dropna(inplace=True)
df

Unnamed: 0,urls,wanted,html
0,https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee,y,<div></div>
2,http://www.englandrugby.com/governance/discipline/judgments-document-search/,y,<div><p>\n This website uses cookies. By continuing to browse EnglandRugby.com you are agreeing to our use of cookies. Find out more by viewing our <span>privacy and cookie policy</...
3,https://xkcd.com/974/,y,"<div><p>\nThis means you&apos;re free to copy and share these comics (but not to sell them). <a href=""/license.html"">More details</a>.</p></div>"
4,https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/,y,<div></div>
5,https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html,y,"<div><p>Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!</p><p>On the 13th of June <a href=""https://medium.com/@TravCav"">@TravCav<..."
6,https://redditblog.com/2017/04/13/how-we-built-rplace/,y,"<div><p><span><strong>Each year for April Fools&#x2019;</strong>, rather than a prank, we like to create a project that explores the way that humans interact at large scales. This year we came up ..."
7,https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/,y,<div></div>
8,http://spypi.ch/#,y,"<div><p>SpyPi is the result of my high school graduation work that I&apos;ve created with the intention to provide a different way to make data security/protection a subject of discussion, especia..."
9,http://fsecurify.com/how-to-learn-hacking/,y,
10,https://github.com/WuTheFWasThat/vimflowy/blob/master/README.md,y,"<div><p>Vimflowy was designed to work with multiple storage backends.</p><p>By default, you own your own data, as it is stored locally on your computer.\nHowever, you can let Google host it for yo..."


In [13]:
from bs4 import BeautifulSoup
def get_text(x):
    soup = BeautifulSoup(x, 'lxml')
    text = soup.get_text()
    return text
df.loc[:, 'text'] = df['html'].map(get_text)
df

Unnamed: 0,urls,wanted,html,text
0,https://www.reddit.com/r/HowToHack/comments/5fjnxj/an_introduction_to_the_tcp_protocol_with/?st=IW4NI828&sh=92e531ee,y,<div></div>,
2,http://www.englandrugby.com/governance/discipline/judgments-document-search/,y,<div><p>\n This website uses cookies. By continuing to browse EnglandRugby.com you are agreeing to our use of cookies. Find out more by viewing our <span>privacy and cookie policy</...,\n This website uses cookies. By continuing to browse EnglandRugby.com you are agreeing to our use of cookies. Find out more by viewing our privacy and cookie policy.\n
3,https://xkcd.com/974/,y,"<div><p>\nThis means you&apos;re free to copy and share these comics (but not to sell them). <a href=""/license.html"">More details</a>.</p></div>",\nThis means you're free to copy and share these comics (but not to sell them). More details.
4,https://www.reddit.com/r/GifRecipes/comments/7dbu0h/stuffed_chicken_parmesan/,y,<div></div>,
5,https://arnaudroger.github.io/blog/2017/06/15/forward-vs-backward-loop.html,y,"<div><p>Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!</p><p>On the 13th of June <a href=""https://medium.com/@TravCav"">@TravCav<...","Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!On the 13th of June @TravCav published on medium an article arguing that reverse l..."
6,https://redditblog.com/2017/04/13/how-we-built-rplace/,y,"<div><p><span><strong>Each year for April Fools&#x2019;</strong>, rather than a prank, we like to create a project that explores the way that humans interact at large scales. This year we came up ...","Each year for April Fools’, rather than a prank, we like to create a project that explores the way that humans interact at large scales. This year we came up with Place, a collaborative canvas on ..."
7,https://www.reddit.com/r/datascience/comments/656bx1/recommended_workflow_with_azure_sql_r_and_power_bi/,y,<div></div>,
8,http://spypi.ch/#,y,"<div><p>SpyPi is the result of my high school graduation work that I&apos;ve created with the intention to provide a different way to make data security/protection a subject of discussion, especia...","SpyPi is the result of my high school graduation work that I've created with the intention to provide a different way to make data security/protection a subject of discussion, especially in class...."
9,http://fsecurify.com/how-to-learn-hacking/,y,,
10,https://github.com/WuTheFWasThat/vimflowy/blob/master/README.md,y,"<div><p>Vimflowy was designed to work with multiple storage backends.</p><p>By default, you own your own data, as it is stored locally on your computer.\nHowever, you can let Google host it for yo...","Vimflowy was designed to work with multiple storage backends.By default, you own your own data, as it is stored locally on your computer.\nHowever, you can let Google host it for you, or host it y..."


## Natural Language Processing

From the text column it is possible to call a vectorizer in order to turn the text data into a usable matrix format for Machine Learning. 


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=3)
tv = vect.fit_transform(df['text'])

In [17]:
tv

<56x662 sparse matrix of type '<class 'numpy.float64'>'
	with 3482 stored elements in Compressed Sparse Row format>

## Support Vector Machines

Building a Support Vector Machine model from the vectorised data. This step will require some evaluation of the quality of the model which has not yet been done as initially it was thought that the iterative process would ensure that the model was effective.

In [18]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
model = clf.fit(tv, df['wanted'])

In [19]:
# Evaluate the model and test

### Test Gspread credentials

In [52]:
import gspread
from oauth2client.service_account import ServiceAccountCredentials

json_file = '/Users/Leo/Documents/Programming/CONFIG_NOT_SHARED/Custom News Feed-d7876b12a476.json'

scope = ['https://spreadsheets.google.com/feeds']

credentials = ServiceAccountCredentials.from_json_keyfile_name(json_file, scope)

gc = gspread.authorize(credentials)

In [40]:
ws = gc.open('NewsFeed')
sh = ws.sheet1
zd = list(zip(sh.col_values(2), sh.col_values(3), sh.col_values(4)))
zf = pd.DataFrame(zd, columns=['title', 'urls', 'html'])
zf.replace('', pd.np.nan, inplace=True)
zf.dropna(inplace=True)
zf.head()

Unnamed: 0,title,urls,html
0,Firefighter catches child from burning building,http://www.bbc.co.uk/news/world-us-canada-42698308,Footage captures the moment a child was dropped from a burning building in Georgia.
1,"Collapse of Colombian bridge kills nine workers, injures five",http://www.reuters.com/article/us-colombia-accident/collapse-of-colombian-bridge-kills-nine-workers-injures-five-idUSKBN1F42JS?feedType=RSS&feedName=worldNews,"BOGOTA (Reuters) - At least nine construction workers were killed and five injured when a partially-constructed bridge collapsed in central Colombia on Monday, an official from the disaster respon..."
2,Donald Trump escalates feud over 'racial slur',http://www.bbc.co.uk/news/world-us-canada-42696389,"The US president says he has been ""totally misrepresented"" by others saying he used a racial slur."


In [42]:
zf.loc[:, 'text'] = zf['html'].map(get_text) 
zf.reset_index(drop=True, inplace=True)
test_matrix = vect.transform(zf['text'])
test_matrix

<3x662 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [43]:
results = pd.DataFrame(model.predict(test_matrix), columns = ['wanted'])

In [44]:
results

Unnamed: 0,wanted
0,y
1,y
2,y


In [45]:
rez = pd.merge(results, zf, left_index=True, right_index=True)
rez

Unnamed: 0,wanted,title,urls,html,text
0,y,Firefighter catches child from burning building,http://www.bbc.co.uk/news/world-us-canada-42698308,Footage captures the moment a child was dropped from a burning building in Georgia.,Footage captures the moment a child was dropped from a burning building in Georgia.
1,y,"Collapse of Colombian bridge kills nine workers, injures five",http://www.reuters.com/article/us-colombia-accident/collapse-of-colombian-bridge-kills-nine-workers-injures-five-idUSKBN1F42JS?feedType=RSS&feedName=worldNews,"BOGOTA (Reuters) - At least nine construction workers were killed and five injured when a partially-constructed bridge collapsed in central Colombia on Monday, an official from the disaster respon...","BOGOTA (Reuters) - At least nine construction workers were killed and five injured when a partially-constructed bridge collapsed in central Colombia on Monday, an official from the disaster respon..."
2,y,Donald Trump escalates feud over 'racial slur',http://www.bbc.co.uk/news/world-us-canada-42696389,"The US president says he has been ""totally misrepresented"" by others saying he used a racial slur.","The US president says he has been ""totally misrepresented"" by others saying he used a racial slur."


### Tune the Model

In [46]:
# Hypothetical correction method
change_to_no = [1, 7, 16]

change_to_yes = [0, 9, 27]

In [47]:
for i in rez.iloc[change_to_yes].index:
    rez.iloc[i]['wanted'] = 'y'
for i in rez.iloc[change_to_no].index:
    rez.iloc[i]['wanted'] = 'n'
rez

IndexError: positional indexers are out-of-bounds

In [48]:
combined = pd.concat([df[['wanted', 'text']], rez[['wanted', 'text']]])
combined

Unnamed: 0,wanted,text
0,y,
2,y,\n This website uses cookies. By continuing to browse EnglandRugby.com you are agreeing to our use of cookies. Find out more by viewing our privacy and cookie policy.\n
3,y,\nThis means you're free to copy and share these comics (but not to sell them). More details.
4,y,
5,y,"Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!On the 13th of June @TravCav published on medium an article arguing that reverse l..."
6,y,"Each year for April Fools’, rather than a prank, we like to create a project that explores the way that humans interact at large scales. This year we came up with Place, a collaborative canvas on ..."
7,y,
8,y,"SpyPi is the result of my high school graduation work that I've created with the intention to provide a different way to make data security/protection a subject of discussion, especially in class...."
9,y,
10,y,"Vimflowy was designed to work with multiple storage backends.By default, you own your own data, as it is stored locally on your computer.\nHowever, you can let Google host it for you, or host it y..."


In [50]:
# Rebuild model with new data
tvcomb = vect.fit_transform(combined['text'], combined['wanted'])
model = clf.fit(tvcomb, combined['wanted'])
# Iterate this process

### Output the Model to Pickle

In [51]:
import pickle
pickle.dump(model, open(r'news_model_pickle.pkl', 'wb'))
pickle.dump(vect, open(r'news_vect_pickle.pkl', 'wb'))

## TODO

- Add news article to supervised dataset
- More articles in supervised dataset
- Flask web app
- Add config variables to config file
- Score top articles rather than simple yes/no