Build Your Own News Search Engine

Objective: Use text feature engineering (TF-IDF) and some rules to make our first search engine for news articles. For any input query, we’ll present the five  most relevant news articles. 

Problem Statement: 
Reuters Ltd. is an international news agency headquartered in London and is a division of Thomson Reuters. The data was originally collected and labeled by Carnegie Group Inc. and Reuters Ltd. in the course of developing the construe text categorization system. 

An important step before assessing similarity between documents, or between documents and a search query, is the right representation i.e., correct feature engineering. We’ll make a process that provides the most similar news articles to a given text string (search query).

Domain: News

Analysis to be done: Document similarity assessment to a search query using Tf-Idf

Content: 
Dataset: ‘r8-all-terms.txt’
Dataset has no header. For each row, it has a  label and the article text.




In [1]:
import pandas as pd
import numpy as np
import pickle
import sys
import os
import io
import re
from sys import path
import numpy as np
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from string import punctuation, digits
from IPython.core.display import display, HTML
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

Using pandas, read in the text file

    Use the right delimiter

    The file has no header, while loading, give column names as label, text

In [2]:
doc = 'r8-all-terms.txt'
data = pd.read_table(doc, sep='\t',names=['Label','Text']) #use read_csv too
#data.columns=['Label', 'Text']

In [3]:
data.Label.value_counts()

earn        2840
acq         1596
crude        253
trade        251
money-fx     206
interest     190
ship         108
grain         41
Name: Label, dtype: int64

In [4]:
data

Unnamed: 0,Label,Text
0,earn,champion products ch approves stock split cham...
1,acq,computer terminal systems cpml completes sale ...
2,earn,cobanco inc cbco year net shr cts vs dlrs net ...
3,earn,am international inc am nd qtr jan oper shr lo...
4,earn,brown forman inc bfd th qtr net shr one dlr vs...
...,...,...
5480,earn,kelly oil and gas partners kly year dec shr ct...
5481,money-fx,japan seeks to strengthen paris currency accor...
5482,earn,tcw convertible securities cvt sets dividend t...
5483,money-fx,south korean won fixed at month high the bank ...


In [5]:
data.Text[0]

'champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter '

Get the text data into a list for easy manipulation

In [6]:
article= data.Text.values

In [7]:
len(article)

5485

In [8]:
article[:4]

array(['champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter ',
       'computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warr

Preprocess method 1

In [9]:
import nltk

In [10]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

In [11]:
def preprocess(input):
    result=[]
    for token in gensim.utils.simple_preprocess(input) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)
            
    return result

In [12]:
#now we preprocess all the news headline we have, we iterate the list of docs in our training sample 
processed_docs = []

for sentence in article:
    processed_docs.append(preprocess(sentence))

print(processed_docs[:2])


[['champion', 'products', 'approves', 'stock', 'split', 'champion', 'products', 'said', 'board', 'directors', 'approved', 'stock', 'split', 'common', 'shares', 'shareholders', 'record', 'april', 'company', 'said', 'board', 'voted', 'recommend', 'shareholders', 'annual', 'meeting', 'april', 'increase', 'authorized', 'capital', 'stock', 'shares', 'reuter'], ['terminal', 'systems', 'cpml', 'completes', 'sale', 'terminal', 'systems', 'said', 'completed', 'sale', 'shares', 'common', 'stock', 'warrants', 'acquire', 'additional', 'shares', 'sedio', 'lugano', 'switzerland', 'dlrs', 'company', 'said', 'warrants', 'exercisable', 'years', 'purchase', 'price', 'dlrs', 'share', 'terminal', 'said', 'sedio', 'right', 'additional', 'shares', 'increase', 'total', 'holdings', 'terminal', 'outstanding', 'common', 'stock', 'certain', 'circumstances', 'involving', 'change', 'control', 'company', 'company', 'said', 'conditions', 'occur', 'warrants', 'exercisable', 'price', 'equal', 'common', 'stock', 'marke

In [13]:
#BOW in the dataset

dictionary = gensim.corpora.Dictionary(processed_docs)


In [14]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break


0 annual
1 approved
2 approves
3 april
4 authorized
5 board
6 capital
7 champion
8 common
9 company
10 directors


Gensim filter_extremes filter out tokrns that appear in:

less than no_below docs(absolute number) or
more than no_above docs(fraction of total corpus size, not abs number)
after (1) and (2) keep only the first keep_n most freq tokens(or keep all if None)

In [15]:
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)


Gensim doc2bow


Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples.
Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). 
No further preprocessing is done on the words in the document; apply tokenization, stemming, etc. 
before calling this method.

● Create the bag-of-words model for each document i.e for each document we create a dictionary reporting how 
many words and how many times those words appear. Save this to 'bow_corpus'

In [16]:
bow_corpus = [dictionary.doc2bow(i) for i in processed_docs]

document_num = 20
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 27 ("delivery") appears 1 time.
Word 100 ("chicago") appears 4 time.
Word 154 ("lower") appears 2 time.
Word 157 ("outlook") appears 1 time.
Word 164 ("rates") appears 2 time.
Word 170 ("today") appears 1 time.
Word 234 ("offered") appears 10 time.
Word 306 ("ships") appears 1 time.
Word 307 ("south") appears 2 time.
Word 340 ("dealers") appears 1 time.
Word 341 ("demand") appears 1 time.
Word 342 ("exchange") appears 1 time.
Word 343 ("firmed") appears 1 time.
Word 344 ("freight") appears 2 time.
Word 345 ("gulf") appears 3 time.
Word 346 ("illinois") appears 4 time.
Word 347 ("included") appears 1 time.
Word 348 ("increasing") appears 1 time.
Word 349 ("level") appears 1 time.
Word 350 ("ohio") appears 2 time.
Word 351 ("percentage") appears 1 time.
Word 352 ("points") appears 4 time.
Word 353 ("quoted") appears 1 time.
Word 354 ("river") appears 9 time.
Word 355 ("section") appears 1 time.
Word 356 ("sept") appears 2 time.
Word 357 ("session") appears 1 time.
Word 358 ("station

Preprocess method 2

Case normalization

In [17]:
article_lower = [art.lower() for art in article]
article_lower[:2]

['champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter ',
 'computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would b

Tokenize the articles
    
    Use NLTKs word_tokenize for this

In [18]:
article_tokens = [word_tokenize(art) for art in article_lower]
print(article_tokens[:3])

[['champion', 'products', 'ch', 'approves', 'stock', 'split', 'champion', 'products', 'inc', 'said', 'its', 'board', 'of', 'directors', 'approved', 'a', 'two', 'for', 'one', 'stock', 'split', 'of', 'its', 'common', 'shares', 'for', 'shareholders', 'of', 'record', 'as', 'of', 'april', 'the', 'company', 'also', 'said', 'its', 'board', 'voted', 'to', 'recommend', 'to', 'shareholders', 'at', 'the', 'annual', 'meeting', 'april', 'an', 'increase', 'in', 'the', 'authorized', 'capital', 'stock', 'from', 'five', 'mln', 'to', 'mln', 'shares', 'reuter'], ['computer', 'terminal', 'systems', 'cpml', 'completes', 'sale', 'computer', 'terminal', 'systems', 'inc', 'said', 'it', 'has', 'completed', 'the', 'sale', 'of', 'shares', 'of', 'its', 'common', 'stock', 'and', 'warrants', 'to', 'acquire', 'an', 'additional', 'one', 'mln', 'shares', 'to', 'sedio', 'n', 'v', 'of', 'lugano', 'switzerland', 'for', 'dlrs', 'the', 'company', 'said', 'the', 'warrants', 'are', 'exercisable', 'for', 'five', 'years', 'at'

Remove stop words

In [19]:
from nltk.corpus import stopwords

stop_nltk = stopwords.words("english")

In [20]:
stop_nltk[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [21]:
def del_stop(input):
    res = [term for term in input if term not in stop_nltk]
    return res

In [22]:
article_delstop = [del_stop(art) for art in article_tokens] 

In [23]:
article_delstop[:2]

[['champion',
  'products',
  'ch',
  'approves',
  'stock',
  'split',
  'champion',
  'products',
  'inc',
  'said',
  'board',
  'directors',
  'approved',
  'two',
  'one',
  'stock',
  'split',
  'common',
  'shares',
  'shareholders',
  'record',
  'april',
  'company',
  'also',
  'said',
  'board',
  'voted',
  'recommend',
  'shareholders',
  'annual',
  'meeting',
  'april',
  'increase',
  'authorized',
  'capital',
  'stock',
  'five',
  'mln',
  'mln',
  'shares',
  'reuter'],
 ['computer',
  'terminal',
  'systems',
  'cpml',
  'completes',
  'sale',
  'computer',
  'terminal',
  'systems',
  'inc',
  'said',
  'completed',
  'sale',
  'shares',
  'common',
  'stock',
  'warrants',
  'acquire',
  'additional',
  'one',
  'mln',
  'shares',
  'sedio',
  'n',
  'v',
  'lugano',
  'switzerland',
  'dlrs',
  'company',
  'said',
  'warrants',
  'exercisable',
  'five',
  'years',
  'purchase',
  'price',
  'dlrs',
  'per',
  'share',
  'computer',
  'terminal',
  'said',
  's

Feature engineering: Use TF-IDF to represent each document.
Instantiate TF-IDF vectorizer with 3000 vocabulary size.


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(max_features=3000)


In [25]:
vectorizer

TfidfVectorizer(max_features=3000)

The vectorizer needs the strings, not vectors. Join the tokens into the string for each article.


In [26]:
articles_string = [" ".join(art) for art in article_delstop]
articles_string[:3]



['champion products ch approves stock split champion products inc said board directors approved two one stock split common shares shareholders record april company also said board voted recommend shareholders annual meeting april increase authorized capital stock five mln mln shares reuter',
 'computer terminal systems cpml completes sale computer terminal systems inc said completed sale shares common stock warrants acquire additional one mln shares sedio n v lugano switzerland dlrs company said warrants exercisable five years purchase price dlrs per share computer terminal said sedio also right buy additional shares increase total holdings pct computer terminal outstanding common stock certain circumstances involving change control company company said conditions occur warrants would exercisable price equal pct common stock market price time exceed dlrs per share computer terminal also said sold technolgy rights dot matrix impact technology including future improvements woodco inc hou

Apply TF-IDF vectorization on the articles and  determine the shape of the matrix.


In [27]:
articles_tfidf =vectorizer.fit_transform(articles_string)
articles_tfidf.shape


(5485, 3000)

Convert it to dense matrix:  The output is currently a sparse matrix, 
convert it to a dense matrix for ease of access

In [30]:
tfidf_dense = articles_tfidf.todense()
type(tfidf_dense)


numpy.matrix

In this representation, we can now compute the similarity between any two articles which are now vectors.
Calculate cosine similarity between fourth and fifth  articles (index three  and four)
What score do you get? Does it make sense? (you’ll need to look at the strings to assess).


In [32]:
from sklearn.metrics.pairwise import cosine_similarity
#Checking the similarity between rows 4 and 5
cosine_similarity(tfidf_dense[3,:], tfidf_dense[4,:])


array([[0.51969816]])

In [33]:
#This is good similarity. Let’s have a look at the strings to see if it makes sense.
articles_string[3:5]


['international inc nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated payment preferred dividends results exclude credits four cts nine cts qtr six mths vs six cts cts prior periods operating loss carryforwards reuter',
 'brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter']

Search engine: Find the five most relevant articles for any given query, fetch the text against them. Define a function that does the following: 
For any given input string, get the TF-IDF vector.
We’ll define a function ‘get_top5_query’ which takes in the input s ‘qry’.
Creating the Tf-Idf vector for the input string.


Defining function to -

a. For any given row number, extract the TfIdf vector

b. Compute similarity of this vector with all the others

c. Get indices of the top 5 matches

d. Return the text for the top 5 matches, and the text of the target row



In [35]:
target_row = 4
target_vector = tfidf_dense[target_row,:]

print(articles_string[target_row])

brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter


Compute similarity between  this vector and  all the others.
Using a list and iterating over the vectors in the TfIdf matrix, storing the similarity in the list.


In [36]:
sim_scores = []

for ind, vector in enumerate(tfidf_dense):
    sim = cosine_similarity(target_vector, tfidf_dense[ind,:])[0][0]
    sim_scores.append(sim)

Making a pandas series of similarity scores for easy manipulation



In [37]:
len(sim_scores)

5485

In [38]:
tfidf_dense.shape[0]

5485

In [39]:
similarity = pd.Series(sim_scores)
similarity.head()

0    0.076767
1    0.037874
2    0.619848
3    0.519698
4    1.000000
dtype: float64

In [40]:
top5_scores = similarity.sort_values(ascending=False).head(6)[1:]

In [41]:
top5_scores 

3633    0.895294
1526    0.884519
3939    0.873976
3686    0.871784
427     0.871580
dtype: float64

In [42]:
top5_index = top5_scores.index.values
top5_index

array([3633, 1526, 3939, 3686,  427])

In [43]:
for ind in top5_index:
    print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

Similarity score:0.9
Article text: technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.88
Article text: vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity score:0.87
Article text: nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: kay jewelers inc kji th qtr net shr dlrs vs dlrs net mln vs revs mln vs mln year shr dlrs vs dlrs net vs revs mln vs mln reuter



In [44]:
def get_top5(target_row):
    target_vector = tfidf_dense[target_row,:]
    
    sim_scores = []
    for ind, vector in enumerate(tfidf_dense):
        sim = cosine_similarity(target_vector, tfidf_dense[ind,:])[0][0]
        sim_scores.append(sim)
    
    
    similarity = pd.Series(sim_scores)
    top5_scores = similarity.sort_values(ascending=False).head(6)[1:]
    top5_index = top5_scores.index.values
    
    for ind in top5_index:
        print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

In [45]:
get_top5(4)


Similarity score:0.9
Article text: technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.88
Article text: vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity score:0.87
Article text: nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: kay jewelers inc kji th qtr net shr dlrs vs dlrs net mln vs revs mln vs mln year shr dlrs vs dlrs net vs revs mln vs mln reuter



In [47]:
for ind in top5_index:
    print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

Similarity score:0.9
Article text: technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.88
Article text: vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity score:0.87
Article text: nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: kay jewelers inc kji th qtr net shr dlrs vs dlrs net mln vs revs mln vs mln year shr dlrs vs dlrs net vs revs mln vs mln reuter



In [48]:
def get_top5_query(qry):
    #target_vector = tfidf_dense[target_row,:]
    target_vector = vectorizer.transform([qry])
    
    sim_scores = []
    for ind, vector in enumerate(tfidf_dense):
        sim = cosine_similarity(target_vector, tfidf_dense[ind,:])[0][0]
        sim_scores.append(sim)
    
    similarity = pd.Series(sim_scores)
    top5_scores = similarity.sort_values(ascending=False).head(5)
    top5_index = top5_scores.index.values
    
    print("Search query: " + qry + "\n")
    
    for ind in top5_index:
        print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

In [49]:
get_top5_query("crude oil price")

Search query: crude oil price

Similarity score:0.49
Article text: phillips p raises crude postings cts phillips petroleum said raised contract price grades crude oil cts barrel effective today increase brings phillip posted price west texas intermediate west texas sour grades dlrs bbl phillips last changed crude oil postings march price increase follows similar moves usx x subsidiary marathon oil sun co sun earlier today reuter

Similarity score:0.44
Article text: marathon petroleum reduces crude postings marathon petroleum co said reduced contract price pay grades crude oil one dlr barrel effective today decrease brings marathon posted price west texas intermediate west texas sour dlrs bbl south louisiana sweet grade crude reduced dlrs bbl company last changed crude postings jan reuter

Similarity score:0.43
Article text: diamond shamrock dia cuts crude prices diamond shamrock corp said effective today cut contract prices crude oil dlrs barrel reduction brings posted price west texas

In [50]:
get_top5_query("computer systems")

Search query: computer systems

Similarity score:0.55
Article text: vertex vetx buy computer transceiver stake vertex industries inc computer transceiver systems inc jointly announced agreement vertex acquire pct interest computer completes proposed reorganization computer reorganization proceedings chapter since september companies said agreement would allow computer unsecured creditors debenture holders receive new stock exchange exsiting debt shareholders receive one new share computer stock four shares previously held companies said united states bankruptcy court southern district new york given preliminary approval proposal subject formal approval computer creditors court agreement vertex also said would supply computer dlrs operating funds arrange renegotiation secured bank debt among things reuter

Similarity score:0.53
Article text: aw computer systems inc awcsa year end dec shr cts vs cts net vs revs vs reuter

Similarity score:0.48
Article text: hogan systems hogn acquisition