*Searching BBC News Articles similar to our search article/query using* 
* Tfidf Vectorizer
* Cosine Similarity

In [1]:
import pandas as pd #Dataframe Manipulation library
import numpy as np #Data Manipulation library
from pathlib import Path

#sklearn modules for Feature Extraction & Modelling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity

#Libraries for Plotting 
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

#Read files Iteratively
import glob
import os

#### Loading the dataset
*having news articles (as txt files) in different News Article Categories(business/politics/tech etc. folders) published by BBC*

##### Checking the working directories

In [2]:
print(f"Current working directory is: {os.getcwd()}")
dir = os.chdir("/kaggle")
print(f'Changing directory to {os.getcwd()}')

Current working directory is: /kaggle/working
Changing directory to /kaggle


In [3]:
#User defined function to read and store bbc Data from multiple folders
def load_data(folder_names, root_path):
    fileNames = [path + "/input/bbc-news-summary/BBC News Summary/News Articles/" + folder + '/' + "*.txt"
        for path,folder in zip([root_path]*len(folder_names), folder_names)]
    #print(fileNames)
    #print("\n")
    doc_list = []
    tags = folder_names
    for docs in fileNames:
        #print(docs)
        doc = glob.glob(docs)#glob method iterates through all files and reads the text in documents in the folders
        for text in doc:
            with open(text, encoding="latin-1") as f:
                topic = docs.split('/')[len(docs.split('/'))-2]
                lines = f.readlines()
                heading = lines[0].strip()#stripping the text by spaces and using first element into heading
                body = ' '.join([l.strip() for l in lines[1:]])
                doc_list.append([topic,heading,body])
        print(f"Loading data from \033[1m{topic}\033[0m directory")
    print("\nEntire Data is loaded successfully")
    
    return doc_list

In [4]:
folder_names = ['business','entertainment','politics','sport','tech']
docs = load_data(folder_names=folder_names,root_path=os.getcwd())

Loading data from [1mbusiness[0m directory
Loading data from [1mentertainment[0m directory
Loading data from [1mpolitics[0m directory
Loading data from [1msport[0m directory
Loading data from [1mtech[0m directory

Entire Data is loaded successfully


##### Converting the data from text files as DataFrame

In [5]:
docs = pd.DataFrame(docs, columns = ['Category','Heading','Article'])
docs.head()

Unnamed: 0,Category,Heading,Article
0,business,US consumer confidence up,Consumers' confidence in the state of the US ...
1,business,The 'ticking budget' facing the US,The budget proposals laid out by the administ...
2,business,Mitsubishi in Peugeot link talks,Trouble-hit Mitsubishi Motors is in talks wit...
3,business,BMW reveals new models pipeline,BMW is preparing to enter the market for car-...
4,business,World leaders gather to face uncertainty,"More than 2,000 business and political leader..."


* Use the heading /article and create **TFIDF matrix** (Term Frequency Inverse Document Frequency)
* Pass a query to it and the **query** gets **transformed** into tfidf matrix created based on raw text
* Use **Cosine Similarity** to calculate cosine values b/w query and each news article
* **Arrange** the documents based on cosine similarity value of that document and the query **in descending order**
* **Top N** values can become **as Recommendations** based on Cosine Similarity

In [6]:
vectorizer = TfidfVectorizer(stop_words = 'english')

In [7]:
vectors = vectorizer.fit_transform(docs["Heading"].values) # .values: convert DataFrame columns into List.List of docs will be transformed into tfidf vector
print(f"The shape of the tfidf matrix : {vectors.shape}")
print(f"There are {vectors.shape[0]} number of News Articles having {vectors.shape[1]} unique words in tfidf vectors")

The shape of the tfidf matrix : (2225, 3623)
There are 2225 number of News Articles having 3623 unique words in tfidf vectors


In [8]:
#new_query = ["Get ready for Machine Learning Engineer Interviews"]
new_query = ["Stock Market Rates are rising"]
new_query_vector = vectorizer.transform(new_query)
new_query_vector

<1x3623 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

While doing cosine similarity on new_query with historical data      
Transform the new_query using the same vectors/tfidf model **vectorizer** we have created earlier for historical data
* Converts new_query into the size of same matrix as that of historical data     

If we convert it into new tfidf vector, this will not map into the one we have created on historical data and size of new_query will change resulting in errors

In [9]:
sim = cosine_similarity(X = vectors, Y = new_query_vector)

In [10]:
sim
#individual cosine values computed for entire document against the new_query

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

In [11]:
#Extract Index of Maximum Valued similar document
argmax = np.argmax(sim)
print(f"Index of the maximum valued similar document is : \033[1m{argmax}\033[0m")
print(f"Retrieved Document Header is : \033[1m{docs.Heading[argmax]}\033[0m")

Index of the maximum valued similar document is : [1m504[0m
Retrieved Document Header is : [1mStock market eyes Japan recovery[0m


##### Top 10 News Articles similar to new_query

In [12]:
ind = np.argsort(sim,axis = 0)[::-1][:10]
print("Top 10 News Articles similar to new_query are : \n")
for i in ind:
    print(docs.Heading.values[i])

Top 10 News Articles similar to new_query are : 

['Stock market eyes Japan recovery']
['US interest rates increased to 2%']
['French consumer spending rising']
["Labour's core support takes stock"]
['Home loan approvals rising again']
["Small firms 'hit by rising costs'"]
['UK interest rates held at 4.75%']
['Australia rates at four year high']
['Bank set to leave rates on hold']
['Bank opts to leave rates on hold']


#### User Defined Method **retrieve_doc** 
##### for searching News Articles similar to our search article/query using **Tfidf Vectorizer** and **Cosine Similarity**

In [13]:
def retrieve_doc(new_query,raw_docs,colname = None): # inputs are new_query,corpus,colname from the dataframe to be used for raw document text
    vectorizer = TfidfVectorizer(stop_words = 'english') #convert to Tfidf Vectorizer
    vectors = vectorizer.fit_transform(raw_docs[colname]) #preprocess the document, fit the model of tfidf document, transform it
    print(f"The shape of the tfidf matrix : {vectors.shape}")
    print(f"There are {vectors.shape[0]} number of News Articles having {vectors.shape[1]} unique words in tfidf vectors")
    new_query = [new_query] #tfidf vectorizer accepts on list or an array(doesn't work on raw text)
    new_query_vector = vectorizer.transform(new_query) #just transforms/calculates the frequency(of new_query) against the tokens we already have in matrix 
    new_query_vector
    sim = cosine_similarity(X = vectors, Y = new_query_vector)#pairwise cosine similarity
    argmax = np.argmax(sim)
    print(f"\nIndex of the maximum valued similar document is : \033[1m{argmax}\033[0m")
    print(f"Retrieved Document Header is : \033[1m{docs.Heading[argmax]}\033[0m")
    ind = np.argsort(sim,axis = 0)[::-1][:10] #sorts similarity scores in [::-1] descending order ,[:10] top 10 most similar articles
    print("\nTop 10 News Articles similar to new_query are : \n")
    for i in ind:
        print(docs.Heading.values[i])#prints the Headings of the top 10 similar articles

In [14]:
new_query = "COVID19 hits global economy"
retrieve_doc(new_query,raw_docs=docs,colname =  "Article")

The shape of the tfidf matrix : (2225, 28980)
There are 2225 number of News Articles having 28980 unique words in tfidf vectors

Index of the maximum valued similar document is : [1m35[0m
Retrieved Document Header is : [1mJapan economy slides to recession[0m

Top 10 News Articles similar to new_query are : 

['Japan economy slides to recession']
['Band Aid retains number one spot']
['Israeli economy picking up pace']
['Singapore growth at 8.1% in 2004']
['Singapore growth at 8.1% in 2004']
['US box office set for record high']
['Australia rates at four year high']
["'Golden economic period' to end"]
['BBC poll indicates economic gloom']
['Business confidence dips in Japan']
