# Create A Simple Search Engine Using Python 
## Utilize TF-IDF and Cosine Similarity to retrieve similar articles with query

Information Retrieval right now is an important task. Probably you're wondering, how does the system can retrieve articles that we want using a query? Here are the steps,
1. Extract documents from the Internet (It could be Web Scraping or extract manually)
2. Clean the documents to make the retrieval much easier
3. Create a Term-Document Matrix with TF-IDF weighting
4. Write your queries and convert it as vector (based on TF-IDF)
5. Calculate the cosine similarity between the query and the document and repeat the process on each document.
6. Finally, show the document


In [1]:
import re
import string
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
def retrieve_docs_and_clean():

  r = requests.get('https://sports.ndtv.com/fifa-world-cup-2022/news')
  soup = BeautifulSoup(r.content, 'html.parser')

  #THE FOLLOWING CODE NEED TO BE MODIFIED TO SUITE FOR THE ABOVE URL
  link = []
  for i in soup.find('div', {'class':'lst-pg_hd'}).find_all('a',{'class':'lst-pg_ttl'}):
      i['href'] ='https://sports.ndtv.com/'+ i['href'] + '?page=all'
      link.append(i['href'])
  

  # Retrieve Paragraphs
  documents = []
  for i in link:
      r = requests.get(i)
      soup = BeautifulSoup(r.content, 'html.parser')

      sen = []
      for i in soup.find('div', {'class':'sp-cn pg-str-com js-ad-section'}).find_all('p'):
          sen.append(i.text)
      documents.append(' '.join(sen))

  # Clean Paragraphs
  documents_clean = []
  for d in documents:
      document_test = re.sub(r'[^\x00-\x7F]+', ' ', d)
      document_test = re.sub(r'@\w+', '', document_test)
      document_test = document_test.lower()
      document_test = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', document_test)
      document_test = re.sub(r'[0-9]', '', document_test)
      document_test = re.sub(r'\s{2,}', ' ', document_test)
      documents_clean.append(document_test)

  return documents_clean

In [3]:
docs = retrieve_docs_and_clean()
# Create Term-Document Matrix with TF-IDF weighting
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Create a DataFrame
df = pd.DataFrame(X.T.toarray(), index=vectorizer.get_feature_names_out())
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
abemahttps,0.0,0.018556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ability,0.0,0.0,0.0,0.077075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062382,0.0,0.0,0.0,0.0,0.0
about,0.026779,0.0,0.0,0.043727,0.017919,0.0,0.0,0.0,0.0,0.0,0.0,0.04995,0.0,0.047618,0.0,0.0,0.0,0.0
above,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024643,0.0,0.0,0.0
absent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
df.shape

(1993, 18)

In [5]:
def get_similar_articles(q, df):
  print("query:", q)
  print("The following are articles with the highest cosine similarity values: ")
  print()
  q = [q]
  q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
  sim = {}
  for i in range(10):
    sim[i] = np.dot(df.loc[:, i].values, q_vec) / np.linalg.norm(df.loc[:, i]) * np.linalg.norm(q_vec)
  
  sim_sorted = sorted(sim.items(), key=lambda x: x[1], reverse=True)
  
  for k, v in sim_sorted:
    if v != 0.0:
      print("Similarity Values:", v)
      print(docs[k])
      print()


q1 = 'poland'
q2 = 'spain'
q3 = 'croatia'

get_similar_articles(q1, df)
print('-'*100)
get_similar_articles(q2, df)
print('-'*100)
get_similar_articles(q3, df)

query: poland
The following are articles with the highest cosine similarity values: 

Similarity Values: 0.17231125520917862
poland s fifa world cup came to an end in the round of on sunday with a loss to france kylian mbappe set up a history making goal for olivier giroud and then scored two himself as holders france eased into the world cup quarter finals with a win over poland giroud s opening goal a minute before half time was his nd for his country allowing him to pass thierry henry and become france s all time record marksman mbappe s lethal strike in the th minute killed off any prospect of a poland comeback and he netted again at the death to move to nine goals in just world cup appearances window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js at the other end the threat of robert lewandowski was snuffed out by the 

In [7]:
from gensim.summarization.bm25 import BM25

def simple_tok(sent:str):
    return sent.split()

def bm25_similar_articles(query):
  print("query:", query)
  print("The following are articles with the highest BM25 scores: ")
  print()
  tok_corpus = [simple_tok(s) for s in docs]
  query = simple_tok(query)
  bm25 = BM25(tok_corpus)
  scores = bm25.get_scores(query, average_idf = 100)
  best_docs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
  for i, b in enumerate(best_docs):
      print(f"rank {i+1}: {docs[b]}")
      print()


q1 = 'poland'
q2 = 'spain'
q3 = 'croatia'


bm25_similar_articles(q1)
print('-'*100)
bm25_similar_articles(q2)
print('-'*100)
bm25_similar_articles(q3)
print('-'*100)

query: poland
The following are articles with the highest BM25 scores: 

rank 1: as brazil continue their rampant run in the fifa world cup arguably their greatest player ever pele has been admitted in the hospital the legendary football player who has been undergoing chemotherapy for colon cancer has been rooting for his country in the world cup despite being bedridden at present ahead of brazil s round of match against south korea pele took to twitter and posted a tweet for the selecao saying he will be cheering for them from the hospital bed window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js in i walked the streets thinking about fulfilling the promise i made to my father i know that today many have made similar promises and are also going in search of their first world cup i will watch the game from the hospital and 