<a href="https://colab.research.google.com/github/nleconte/MachineLearning/blob/master/Run_NLP_Experiments_using_the_Feedly_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run NLP Experiments using the Feedly API

## Introduction

We live in a world where the volume of information is increasing exponentially. We believe that advances in Machine Learning and Natural Language Processing can help process and prioritize information more efficiently.

The goal of this notebook is to help you create a simple KNN classifier that takes as input a Feedly board with 50 articles about AI and learns to predict which articles from The Verge and Engadget are about AI. All this in less than 20 minutes.

You will learn:
1. How to connect to a Feedly account using the Feedly API and download articles from feeds and boards
2. How to train a classifier to reliably classify AI and non-AI articles
3. How to apply the classifier to new articles
4. How to save articles to a Feedly board

[See blog post for more information](https://blog.feedly.com)

Terminology Mapping: There is sometimes a gap between the terms used in the Feedly UI and the concepts used in the Feedly API. What is called a board in the UI is labeled tags in the API. What is called feed in the UI is labeled category in the API.


## Setup

If you do not have a Feedly client, you can create one for free on https://feedly.com 

Let's install the feedly client and the NLP library gensim first.

In [1]:
!pip install feedly-client gensim --quiet

[33mYou are using pip version 18.1, however version 19.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
from feedly.session import FeedlySession
from feedly.data import StreamOptions, Streamable
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import urllib3
urllib3.disable_warnings()

## Connect to your Feedly

To log in to your account, you should get your feedly token. You can find it in the `console` page of your account [here](https://feedly.com/i/console) (recommended) by clicking on "Copy Token to clipboard"
. 
Or if you want to do more advanced stuff you can create a  [developer token](https://developer.feedly.com/v3/developer/) instead. 

To download articles, you first have to pick one of your feeds with `category = sess.get_category(category_uuid)` and then do `category.stream_contents()`. Let's see how it works.

In [4]:
# TODO: Enter your feedlyToken here
token = "A3XKf_UTeu42JUWUdhP-3T6ZCVJ_hiC6uhC2KCC3aC1k1h-oIX_-TXS1H1VqTagma2NL72DSiroN5IeZ0DlWH1wkT7CQ7Yl3AfsnT6gZXivFZQ1yRwZPAIUDCbqHTN2r6fyvdAUjxAAJ_-mls_nMQPVjUQ5zlw20Ne5XCknnzS1_q-YMzi_SsKvKbgBQtje27ndkuTfUVa2V31H2GEaXuzNW0kaQc954R0BS1T817eY30g:feedly"

if token is None:
  raise ValueError("Please enter a token")

In [5]:
# initialize session
sess = FeedlySession(auth=token)

In [6]:
# print the names of all your feeds (categories) and boards (tags)
print("Personal feeds:")
for category_uuid, category_data in sess.user.get_categories().items():
  print("\t", f'"{category_uuid}"', f"({category_data['label']})")
print("Personal boards:")
for tag_name in sess.user.get_tags():
  print("\t",f'"{tag_name}"')

Personal feeds:
	 "Glande" (Glande)
	 "Recherche" (Recherche)
	 "Tools" (Tools)
Personal boards:
	 "global.saved"


In [7]:
# print the titles of some articles

# get your first category
personal_category_uuid = list(sess.user.get_categories().keys())[0]

# download articles
entries = sess.user.get_category(personal_category_uuid).stream_contents()
entries = list(entries)

# get the titles in the json property
print(f"Articles from your personal feed '{personal_category_uuid}':")
for entry in entries[:5]:
	print("\t"+entry.json["title"])


Articles from your personal feed 'Glande':
	Saturday Morning Breakfast Cereal - Clothes
	Comic for 2019.01.23
	Saturday Morning Breakfast Cereal - Eve
	Comic for 2019.01.22
	Saturday Morning Breakfast Cereal - Swords


Now that you know how to get articles from your feedly, let's play with them to build a classifier !

## Train your Machine Learning  algorithm

Feedly makes it easy to aggregate articles from multiple sources in one place. The Feedly API allows you to get a normalized JSON representation of all the articles aggregated in your Feedly.

Let’s try to create a KNN classifier that determines which articles from your Tech feed (The Verge and Engadget) are related to the AI topic.

If you do not have a Tech feed, you can use the Add Content to quickly create a new Tech feed and add The Verge and Engadget to that feed.




To build a classifier, we need to train a model with positive and negative examples, so that it will be able te recognize new positive articles. Therefore we need a positive board with examples of 'on-topic' articles. You can spend 10-15min on your feedly to save interesting articles into a positive board (>20 articles, the more the better), or you can use a board we already built for you that contains articles about Artificial Intelligence.

We will call this board your `positive_board`

![machine learning filter](https://blog.feedly.com/wp-content/uploads/2019/01/Screenshot-2019-01-24-16.22.27-1024x548.png)

### 2.1 Gather the data
Let's see how you can build a model that can find 'on-topic' articles from the noise of a source feed.
We suppose that we can use the noise of your source feed as a negative class for a classifier (no need to have a negative board !).

In [8]:
# print all your personal feeds
print("Here are all your personal feeds:")
for category_uuid, category_data in sess.user.get_categories().items():
  print("\t", f'"{category_uuid}"', f"({category_data['label']})")



Here are all your personal feeds:
	 "Glande" (Glande)
	 "Recherche" (Recherche)
	 "Tools" (Tools)


In [9]:
# TODO Leave 'None' to use the default source feed with content from The Verge and Engadget, or select your category (uuid) here.
source_feed = None

if source_feed is not None and source_feed not in sess.user.get_categories():
  raise ValueError('Please select an existing feed')

In [10]:
print("Here are all your boards:")
for tag_name in sess.user.get_tags():
  print("\t",f'"{tag_name}"')

Here are all your boards:
	 "global.saved"


In [11]:
# TODO: Leave 'None' to use the default board with articles about AI, or enter your positive articles board.
positive_board = None

if positive_board not in sess.user.get_tags() and positive_board is not None:
  raise ValueError('Please select an existing board')


In [12]:
# download the articles

# positive articles from a board (if None, download articles about AI)
if positive_board is not None:
  positive_entries = list(sess.user.get_tag(positive_board).stream_contents())
else:
  print("downloading articles about AI...")
  import urllib.request, pickle
  url = "https://www.dropbox.com/s/p14pm6qlocz7z0g/50_ai_articles.pickle?dl=1"
  response = urllib.request.urlopen(url)
  positive_entries = list(pickle.load(response))
n_positive = len(positive_entries)

# negative articles from the source feed
if source_feed is not None:
  negative_entries = list(sess.user.get_category(source_feed).stream_contents(options=StreamOptions(max_count=4*n_positive)))
else:
  print("downloading content from The Verge and Engadget...")
  default_sources_ids = "feed/http://www.theverge.com/rss/full.xml,feed/http://www.engadget.com/rss-full.xml"
  negative_entries = list(Streamable({'id': default_sources_ids}, sess).stream_contents(options=StreamOptions(max_count=4*n_positive)))
n_negative = len(negative_entries)

# print results
if n_positive<15: raise Exception(f"You should have at least 15 positive articles (you only have {n_positive} articles at the moment)")
print(f"loaded {n_positive} positive articles and {n_negative} negative articles")
entries = positive_entries + negative_entries
labels = np.concatenate((np.ones(len(positive_entries)), np.zeros(len(negative_entries))), axis=0)

downloading articles about AI...


URLError: <urlopen error [Errno 61] Connection refused>

In [None]:
# build the dataframe

def get_text(entry):
  full_content = entry.json["fullContent"] if "fullContent" in entry.json else ""
  content = entry.json["content"]["content"] if "content" in entry.json else ""
  summary = entry.json["summary"]["content"] if "summary" in entry.json else ""
  title = entry.json["title"]
  best=max(full_content, content, summary, title, key=len)
  return BeautifulSoup(best.replace("\n", ""), 'html.parser').text

def build_dataframe(entries):
  df = pd.DataFrame()
  df["eid"] = [e.json.get("id") for e in entries]  # id of the entry
  df["title"] = [e.json["title"] for e in entries] # title of the entry
  df["content"] = [get_text(e) for e in entries]   # text content of the entry 
  return df

df = build_dataframe(entries)
df["label"] = labels
df.head()

Unnamed: 0,eid,title,content,label
0,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,Facebook sets a new task for AI: guide a virtu...,How do you teach computers to understand langu...,1.0
1,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,Apple’s new AI chief might actually be the rig...,"What is John Giannandrea, Google’s former hea...",1.0
2,t9NJb6rMt3WVivgF5JJtqOwhhWhdxPuJjlu7LCw1vdk=_1...,Banks are already bumping up against the limit...,While big tech companies might not face regu...,1.0
3,t9NJb6rMt3WVivgF5JJtqOwhhWhdxPuJjlu7LCw1vdk=_1...,It’s Google’s turn to ask the questions,At Google’s annual developer conference this ...,1.0
4,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,The White House has set up a task force to hel...,The White House has set up a new task force d...,1.0


Now that we have our dataset, let's preprocess the text content and build a model !

### Preprocess the articles content with TF-IDF

We first do lemmatization and stemming before building a TF-IDF vector for each article.

Lemmatization and stemming only keep the stem of words.

Example:
```
technology, technologies -> technolog
learnt, learns, learning -> learn
```
We also remove stop words and we tokenize the texts (we split the texts into list of words).

Then for each word of each article, we compute its TF-IDF value.

For a term $i$ in a document $j$:
>$W_{i,j} = tf_{i,j} * \log\frac{N}{df_i}$

where:

>$tf_{i,j} =$ term frequency of $i$ in $j$

>$df_i = $ number of documents containing $i$

>$N = $ total number of documents


In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import nltk
nltk.download('wordnet')
stemmer = SnowballStemmer('english')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
# preprocess text
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    n_sentences = text.count(". ")
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 1:
            result.append(lemmatize_stemming(token))
    return result
  
df["tokenized_content"] = df["content"].map(preprocess)
df.head()

Unnamed: 0,eid,title,content,label,tokenized_content
0,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,Facebook sets a new task for AI: guide a virtu...,How do you teach computers to understand langu...,1.0,"[teach, comput, understand, languag, transcrib..."
1,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,Apple’s new AI chief might actually be the rig...,"What is John Giannandrea, Google’s former hea...",1.0,"[john, giannandrea, googl, head, search, ai, g..."
2,t9NJb6rMt3WVivgF5JJtqOwhhWhdxPuJjlu7LCw1vdk=_1...,Banks are already bumping up against the limit...,While big tech companies might not face regu...,1.0,"[big, tech, compani, face, regul, artifici, in..."
3,t9NJb6rMt3WVivgF5JJtqOwhhWhdxPuJjlu7LCw1vdk=_1...,It’s Google’s turn to ask the questions,At Google’s annual developer conference this ...,1.0,"[googl, annual, develop, confer, past, week, c..."
4,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,The White House has set up a task force to hel...,The White House has set up a new task force d...,1.0,"[white, hous, set, new, task, forc, dedic, art..."


In [None]:
vocab = corpora.Dictionary(df.tokenized_content)

In [None]:
# apply tfidf model
def to_vector(key_value_tuples, vector_dim, default_value=0):
    rv = np.ones(vector_dim) * default_value
    for key, val in key_value_tuples:
        rv[key] = val
    return rv
  
df["bow_corpus"] = df.tokenized_content.map(vocab.doc2bow)
tfidf = models.TfidfModel(df.bow_corpus)
df["tfidf"] = df.bow_corpus.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df.head()

Unnamed: 0,eid,title,content,label,tokenized_content,bow_corpus,tfidf
0,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,Facebook sets a new task for AI: guide a virtu...,How do you teach computers to understand langu...,1.0,"[teach, comput, understand, languag, transcrib...","[(0, 2), (1, 1), (2, 1), (3, 5), (4, 16), (5, ...","[0.03314139276585894, 0.01696869807979925, 0.0..."
1,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,Apple’s new AI chief might actually be the rig...,"What is John Giannandrea, Google’s former hea...",1.0,"[john, giannandrea, googl, head, search, ai, g...","[(4, 11), (6, 7), (11, 2), (15, 2), (16, 1), (...","[0.0, 0.0, 0.0, 0.0, 0.1501294832085813, 0.0, ..."
2,t9NJb6rMt3WVivgF5JJtqOwhhWhdxPuJjlu7LCw1vdk=_1...,Banks are already bumping up against the limit...,While big tech companies might not face regu...,1.0,"[big, tech, compani, face, regul, artifici, in...","[(4, 5), (5, 1), (21, 2), (33, 2), (34, 1), (5...","[0.0, 0.0, 0.0, 0.0, 0.1298592676820325, 0.049..."
3,t9NJb6rMt3WVivgF5JJtqOwhhWhdxPuJjlu7LCw1vdk=_1...,It’s Google’s turn to ask the questions,At Google’s annual developer conference this ...,1.0,"[googl, annual, develop, confer, past, week, c...","[(1, 1), (4, 1), (6, 6), (9, 1), (11, 2), (13,...","[0.0, 0.019245058305825548, 0.0, 0.0, 0.018573..."
4,PSNTZO8gXFUe+cpCZyApw0vEKWPT4b14D6teBEocIAE=_1...,The White House has set up a task force to hel...,The White House has set up a new task force d...,1.0,"[white, hous, set, new, task, forc, dedic, art...","[(4, 14), (20, 1), (53, 1), (69, 1), (76, 1), ...","[0.0, 0.0, 0.0, 0.0, 0.3362381485574112, 0.0, ..."


And the vectorization part is done ! We can now use it to train a classifier !

### Train the model

We will use a [K Nearest Neighbors classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) after a [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

In [None]:
# split into a train set and a test set, so that we can evaluate the model on unseen articles
X = np.array(df.tfidf.tolist())
y = np.array(df.label.tolist())
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
# apply pca
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
# print the words with the highest weight for several principal components
words = set()
for i in range(len(pca.components_)):
  if i>= 30: break
  words.add(vocab[max(range(pca.n_components_), key=lambda j: np.abs(pca.components_[i][j]))])
print(words)

In [None]:
# train knn model
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_pca, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

You can now compute the accuracy of your classifier. You should see something between 0.80 and 1. Higher is better.

In [None]:
# score on train set (accuracy)
model.score(X_train_pca, y_train)

In [None]:
# score on test set (accuracy)
model.score(X_test_pca, y_test)

Your smart assistant is ready to classify new articles ! Let's save new interesting article into one of your boards

## Apply the model on new articles and save them

To do so, we will download many articles from your source feed, apply the classifier and then use the `feedly-client` to save them into a board.

To save an article to a board named `board_name` you can do as follows:

```
sess.user.get_tag(board_name).tag_entry(entry_id)
```

In [None]:
# download new entries
max_count = 300
if source_feed is not None:
  new_articles = list(sess.user.get_category(source_feed).stream_contents(options=StreamOptions(max_count=max_count)))
else:
  new_articles = list(Streamable({'id': default_sources_ids}, sess).stream_contents(options=StreamOptions(max_count=max_count)))

In [None]:
def preprocess_articles(articles):
  df = build_dataframe(articles)
  df["tokenized_content"] = df["content"].map(preprocess)
  return df
def compute_tfidf(df):
  df["bow_corpus"] = df.tokenized_content.map(vocab.doc2bow)
  df["tfidf"] = df.bow_corpus.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
  df["X"] = list(pca.transform(df.tfidf.tolist()))
  return df.X.tolist()

# process new entries data
preprocessed_articles = preprocess_articles(new_articles)
tfidf_vectors = compute_tfidf(preprocessed_articles)

# apply knn model
predictions = model.predict(tfidf_vectors)


In [None]:
# store predictions
new_df = preprocessed_articles
new_df["pred"] = predictions

In [None]:
# show results
print(f"Predicted {len(new_df.loc[new_df.pred==1])} positive articles. A few examples:\n{new_df.loc[new_df.pred==1].head().title}\n...")
print(f"Predicted {len(new_df.loc[new_df.pred==0])} negative articles. A few examples:\n{new_df.loc[new_df.pred==0].head().title}\n...")

Predicted 14 positive articles. A few examples:
17     Google is using AI to help The New York Times ...
23     Amazon told employees it would continue to sel...
125    China implements tech that can detect people b...
160    Microsoft Word is getting a to-do feature to h...
182     The way we text says a lot about our personality
Name: title, dtype: object
...
Predicted 286 negative articles. A few examples:
0    Google Walkout organizers: changes are a start...
1    The fake video era of US politics has arrived ...
2    Tesla's Model 3 gets quicker cornering with 'T...
3    New Apple patent hints that rumored over-ear h...
4    BMW’s i8 Roadster is a daily driver in superca...
Name: title, dtype: object
...


So, what do you think about the results ? They really depend on the data the classifier is trained on, but in general you should see that the topics of the predicted positive articles are close to the topic of your positive board.

**Feel free to try with other source feeds and boards !**

You can do that by changing the `source_feed` and `positive_board` values and rerun the notebook.

If you are happy with the results, you can enter the board name in which you want the new articles to be saved:

In [None]:
print("Here are all your boards:")
for tag_name in sess.user.get_tags():
  print("\t",f'"{tag_name}"')

In [None]:
# TODO: Enter your output board, where the new positive articles will be saved.
# If the board doesn't exist, a new board will be created for you.
output_board = None

if output_board is None:
  raise ValueError("Please enter a board name")

In [None]:
# tag the articles

def get_positive_predictions(df):
  return df.loc[new_df.pred==1]

tag = sess.user.get_tag(output_board)
output_articles = get_positive_predictions(new_df)

for i,eid in enumerate(output_articles.eid):
  title = output_articles.iloc[i].title
  tag.tag_entry(eid)
  print(f"Saved '{title}' into the board '{output_board}'")

You can now check your feedly to see the new articles saved into your board ! Feel free to build different classifiers for different topics/positive boards

Note for the developers: the feedly APIs rate is limited to 250 API requests per day (500 requests per day for feedly pro and team accounts).

## Next

You can now go further and build complex datasets on feedly on topics, trends... and use a better model:
- **Extract other features** from the articles, depending on you task (keywords detection, entity recognition, word/sentence embeddings, article size... )
- **Use other classifiers** instead of KNN: naive bayes, SVM, decision trees, neural networks...

## Thank you !

Congratulations, you just build yourself a smart assistant for your feedly account !
Feel free to give feedbacks in the comments section of the feedly blog post, or you can also email me at quentin@feedly.com (or at lhoest.q@gmail.com)

Passionate about the Web, NLP and Machine Learning? [Join the Feedly Lab on Slack](https://join.slack.com/t/feedlylab/shared_invite/enQtNDEyMzQ2Nzk4OTQ1LWQ3ZmExOWYwNDUwN2U4Yzg2MDZjNDJmZDQ4YTFiY2RmYjIyMjBmOGZiZDQwODQxZjRiZDY2Mzc1NTc1YjNjMmQ) and connect with the Feedly machine learning team!

Keep reading !

- [Getting started with Machine Learning](https://blog.feedly.com/getting-started-with-machine-learning/) <- a list of many good references

- [Transfer Learning in NLP](https://blog.feedly.com/transfer-learning-in-nlp/) <- advanced transfer learning techniques to improve your NLP model