


![](http://aerofarms.com/wp-content/uploads/2015/04/NYTimes-banner.jpg)



# Content-Based Recommender for NYT Articles
## Introduction 


In this notebook, I create a content based recommender for New York Times articles. This recommender is an example of a very simple data product. 

We'll be recommending new articles that a user should read based on the article that they are currently reading.

In [3]:
# import relavent packages 
import numpy as np
import pandas as pd
import pickle
from sklearn.utils import shuffle
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

In [4]:
# import data
data_path = "NYT-articles.pkl"
df = pd.read_pickle(data_path)

----

### Inspect the data

In [5]:
len(df.body.unique()),  len(df.body)

(18356, 24005)

In [6]:
df.head(2)

Unnamed: 0,_id,abstract,blog,body,byline,document_type,headline,keywords,lead_paragraph,multimedia,...,print_page,pub_date,section_name,slideshow_credits,snippet,source,subsection_name,type_of_material,web_url,word_count
0,580ae247253f0a1d0316f71e,,[],TOKYO — State-backed Japan Bank for Internati...,"{'person': [], 'original': 'By REUTERS', 'orga...",article,{'main': 'Japan to Lend to Sanctioned Russian ...,[],State-backed Japan Bank for International Coop...,[],...,,2016-10-21T23:51:28Z,Business Day,,State-backed Japan Bank for International Coop...,Reuters,,News,http://www.nytimes.com/reuters/2016/10/21/busi...,
1,580adf45253f0a1d0316f71d,,[],"INTERNATIONAL\nBecause of an editing error, an...",[],article,"{'main': 'Corrections: October 22, 2016', 'pri...",[],"Corrections appearing in print on Saturday, Oc...",[],...,,2016-10-21T23:38:36Z,Corrections,,"Corrections appearing in print on Saturday, Oc...",The New York Times,,News,http://www.nytimes.com/2016/10/22/pageoneplus/...,


In [7]:
df.body[0]

'TOKYO —  State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia\'s Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday.\nSberbank, Russia\'s biggest bank, will use the yen-denominated loan to help a company operating the port of Vostochny in the Russian Far East to buy coal-handling equipment.\nJBIC will issue the loan by the end of the year in a bid to encourage progress on a dispute over a string of Russia-controlled Pacific islands, called the Northern Territories in Japan and Southern Kuriles in Russia, at a December summit.\n"JBIC\'s move to provide financing to Russia comes because  the Japanese government aims to make progress in the negotiations," the Nikkei said.\nJBIC was not available for comment.\nJapanese foreign ministry and the prime minister\'s office were not available for comment.\nThe United States and the Eu

In [8]:
df.web_url[0]

'http://www.nytimes.com/reuters/2016/10/21/business/21reuters-japan-russia-loans.html'

----

### Identify the time range in which these articles were published

In [9]:
min(df.pub_date).split("T")[0], max(df.pub_date).split("T")[0]

('2016-10-05', '2016-11-27')



The US 2016 presidential election took place in this time range. 

---
### The number of articles in each section



In [10]:
Counter(df.section_name)

Counter({'Business Day': 3615,
         'Corrections': 35,
         'U.S.': 6039,
         'World': 5947,
         'Opinion': 727,
         'Sports': 3719,
         'Crosswords & Games': 54,
         'N.Y. / Region': 354,
         'Today’s Paper': 34,
         'Arts': 1037,
         'Technology': 416,
         'Briefing': 94,
         'Real Estate': 90,
         'Theater': 134,
         'Science': 129,
         'Your Money': 44,
         'Fashion & Style': 348,
         'Times Insider': 45,
         'Food': 102,
         'Movies': 157,
         'Magazine': 76,
         'T Magazine': 82,
         'Style': 35,
         'Books': 180,
         'The Upshot': 66,
         'Health': 56,
         'Well': 92,
         'Travel': 85,
         'The Learning Network': 116,
         'Automobiles': 13,
         'Podcasts': 17,
         'Job Market': 14,
         'NYT Now': 9,
         'Obituaries': 1,
         'Public Editor': 8,
         'Sunday Review': 3,
         'Giving': 13,
         'Education

----
##  Feature Engineering 

Split the data into a "train" and "test". As well as select a metric in which to measure the similarity between articles. Think of the "train" set as the corpus. Think of the "test" set as the NYT articles that users are currently reading. 

### Split the data

In [11]:
# move articles to an array
articles = df.body.values

# move article section names to an array
sections = df.section_name.values

# move article web_urls to an array
web_url = df.web_url.values

# shuffle these three arrays 
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)

In [12]:
# split the shuffled articles into two arrays
n = 10

# one will have all but the last 10 articles -- think of this as your training set/corpus 
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]

# the other will have those last 10 articles -- think of this as your test set/corpus 
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]

------

### Choose a Text Vectorizer 

We have options of choosing several different Text Vectorizers such as Bag-of-Words, Tf-Idf, Word2Vec, and so on. 
Here is one reason why one might choose Tf-Idf: 

Unlike BoW, Tf-Idf identifies the importance of words not merely by text frequency, but also by the inverse document frequency. So, for example, if a word like "Obama" only appears a few times in an article (like the word "he" or "a" or "the" which don't convey much information) but appears in several different articles then it would be given a higher weight. Which makes senses because "Obama" isn't a stop word nor is it mention without good reason (i.e. it's highly relevant to the article's topic). 

In [13]:
# instantiate your vectorizor 
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [14]:
# fit the vectorizer 
tfidf_vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
# transform both article splits 
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

----
### Similarity Metric

We have several different options when selecting a similarity metric such as Jacard and Cosine to name a couple. 

Jacard works by comparing two different sets and selecting the overlapping elements. Jacard similarity doesn't make sense as an option considering that we've choosen to use Tf-Idf as a vectorizer; it might make more sense to use Jacard had we selected the BoWs vectorization. 

The reason why we should choose Cosine as our similarity metric is because it make sense as an option having choosen Tf-Idf as our vectorizer. Since Tf-Idf provides weights to each token in each article, we can then take the dot product between the weights from tokens of different articles. If article A has a high weight for tokens like "Obama" and "White House" and so does article B, then their product will result in a larger similarity score had it been the case that article B had low weights for those same tokens (for simplicity assume that all other token wegihts are helf consent). 

-----
## Builidng a Content Based Recommender

This section is where the magic happends. Here you will build a function that outputs the top n articles to recommend to your user based on the similarity scores between the article they're currently reading and all other articles in the corpus (i.e. "train" data).

In [16]:
def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    '''This function calculates similarity scores bewteen a document and a corpus
    
       INPUT: vectorized document corpus, 2D array
              text document corpus, 1D array
              user article, 1D array
              article section names, 1D array
              article URLs, 1D array
              number of articles to recommend, int
              
       OUTPUT: top n recommendations, 1D array
               top n corresponding section names, 1D array
               top n corresponding URLs, 1D array
               similarity scores bewteen user article and entire corpus, 1D array
              '''
    # calculate similarity between the corpus (i.e. the "test" data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)

    # get sorted similairty score indicies 
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]

    # get sorted similarity socres
    sorted_sim_scores = similarity_scores[sorted_indicies]

    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]

    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]

    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]
    
    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores

In [17]:
# pick an article from the "test" set
# treat this as the article that the user is currently reading
k = 0
test_article = X_test_tfidf[k]

In [18]:
# return the top n most similar articles as recommendations 
top_n_recs, rec_sections, rec_urls, sorted_sim_scores = \
get_top_n_rec_articles(X_train_tfidf, X_train,  test_article,X_train_sections, X_train_urls, n = 5 )

----
## Validate the results 

Now that you have recommended articles for the user to read (based on what they are currently reading) check to see if the results make sense. 

Compare the user's article and corresponding section name with the recommended articles and corresponding section names. 

Also take a look at the similarity scores. 

In [19]:
# similarity scores
sorted_sim_scores[:5]

array([[[0.56601716]],

       [[0.49837752]],

       [[0.4792004 ]],

       [[0.46857784]],

       [[0.46037552]]])

In [22]:
# user's article
X_test[k]

'LOS ANGELES —  The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard.  If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we\'re not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'

In [20]:
# user's article's section name
X_test_sections[k]

'U.S.'

In [21]:
# corresponding section names for top n recs 
rec_sections

array([['World'],
       ['U.S.'],
       ['U.S.'],
       ['World'],
       ['U.S.']], dtype=object)

In [22]:
# top n article recs
top_n_recs

array([['WASHINGTON —  House Speaker Paul Ryan on Tuesday called for the Pentagon to immediately suspend efforts to recover enlistment bonuses paid to thousands of soldiers in California, even as the Pentagon said late Tuesday the number of soldiers affected was smaller than first believed.\n"When those Californians answered the call to duty" to serve in Iraq and Afghanistan, "they earned more from us than bureaucratic bungling and false promises," Ryan said. He urged the Pentagon to suspend collection efforts until "Congress has time ... to protect service members from lifelong liability for DOD\'s mistakes."\nRyan\'s comments came as the White House said President Barack Obama has warned the Defense Department not to "nickel and dime" service members who were victims of fraud by overzealous recruiters.\nWhite House spokesman Josh Earnest said Tuesday he did not believe Obama would support a blanket waiver of repayments, but said California National Guard members should not be held re

In [23]:
# corresonding URLs for top n recs 
rec_urls

array([['http://www.nytimes.com/2016/10/19/dining/wine-school-assignment-montsant.html'],
       ['http://www.nytimes.com/aponline/2016/10/26/us/ap-us-oil-pipeline-news-guide.html'],
       ['http://www.nytimes.com/reuters/2016/11/01/business/01reuters-britain-boe-may-welcome.html'],
       ['http://www.nytimes.com/reuters/2016/11/01/world/europe/01reuters-hongkong-china.html'],
       ['http://www.nytimes.com/reuters/2016/11/02/us/02reuters-colorado-bomb.html']],
      dtype=object)

----
### Additoinal Resources 

http://infolab.stanford.edu/~ullman/mmds/ch9.pdf

http://benanne.github.io/2014/08/05/spotify-cnns.html

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.5743&rep=rep1&type=pdf