<a href="https://colab.research.google.com/github/12-tom/Information_retrieval/blob/main/Simple_Search_Engine_(TF_IDF_and_BM25).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create A Simple Search Engine Using Python 
## Utilize TF-IDF and Cosine Similarity to retrieve similar articles with query

Information Retrieval right now is an important task. Probably you're wondering, how does the system can retrieve articles that we want using a query? Here are the steps,
1. Extract documents from the Internet (It could be Web Scraping or extract manually)
2. Clean the documents to make the retrieval much easier
3. Create a Term-Document Matrix with TF-IDF weighting
4. Write your queries and convert it as vector (based on TF-IDF)
5. Calculate the cosine similarity between the query and the document and repeat the process on each document.
6. Finally, show the document


In [1]:
import re
import string
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
def retrieve_docs_and_clean():

  r = requests.get('https://sports.ndtv.com/fifa-world-cup-2022/news')
  soup = BeautifulSoup(r.content, 'html.parser')

  link = []
  for i in soup.find('div', {'class':'lst-pg_hd'}).find_all('a',{'class':'lst-pg_ttl'}):
      i['href'] ='https://sports.ndtv.com/'+ i['href'] + '?page=all'
      link.append(i['href'])
  

  # Retrieve Paragraphs
  documents = []
  for i in link:
      r = requests.get(i)
      soup = BeautifulSoup(r.content, 'html.parser')

      sen = []
      for i in soup.find('div', {'class':'sp_txt'}).find_all('p'):
          sen.append(i.text)
      documents.append(' '.join(sen))

  # Clean Paragraphs
  documents_clean = []
  for d in documents:
      document_test = re.sub(r'[^\x00-\x7F]+', ' ', d)
      document_test = re.sub(r'@\w+', '', document_test)
      document_test = document_test.lower()
      document_test = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', document_test)
      document_test = re.sub(r'[0-9]', '', document_test)
      document_test = re.sub(r'\s{2,}', ' ', document_test)
      documents_clean.append(document_test)

  return documents_clean

In [3]:
docs = retrieve_docs_and_clean()
# Create Term-Document Matrix with TF-IDF weighting
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Create a DataFrame
df = pd.DataFrame(X.T.toarray(), index=vectorizer.get_feature_names_out())
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
abandoned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096431,0.0
ability,0.0,0.076983,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064472,0.0,0.0,0.0,0.0,0.0,0.0,0.0
about,0.0,0.047373,0.019363,0.0,0.0,0.0,0.0,0.0,0.0,0.053679,0.0,0.051801,0.0,0.0,0.0,0.0,0.0,0.0
above,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024571,0.0,0.0,0.0,0.0,0.0
absent,0.0,0.0,0.0,0.0,0.0,0.002383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
df.shape

(1762, 18)

In [5]:
def get_similar_articles(q, df):
  print("query:", q, end="\n\n")
  print("The following are articles with the highest cosine similarity values: ")
  print()
  q = [q]
  q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
  sim = {}
  for i in range(10):
    sim[i] = np.dot(df.loc[:, i].values, q_vec) / np.linalg.norm(df.loc[:, i]) * np.linalg.norm(q_vec)
  
  sim_sorted = sorted(sim.items(), key=lambda x: x[1], reverse=True)
  
  for k, v in sim_sorted:
    if v != 0.0:
      print("Similarity Values:", v)
      print(docs[k])
      print()


q1 = 'quater final'
q2 = 'germany'
q3 = 'lionel messi'

get_similar_articles(q1, df)
print('-'*100)
get_similar_articles(q2, df)
print('-'*100)
get_similar_articles(q3, df)

query: quater final

The following are articles with the highest cosine similarity values: 

Similarity Values: 0.020620140308364135
kylian mbappe said sunday he dreamed of winning the world cup for a second time after his brace helped france to a win over poland which advanced the holders to the quarter finals in qatar of course this world cup is an obsession for me it s the competition of my dreams said the year old who burst onto the global stage by starring when france won the title in russia four years ago i have built my season around this competition and around being ready both physically and mentally for it window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js i wanted to come here ready and so far things are going well but we are still a long way from the objective we set and that i set the paris saint germain supe

In [6]:
!pip install rank_bm25
from rank_bm25 import *

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
def get_top_n(query, documents, n):
  print("query:", query, end="\n\n")
  print("The following are articles with the highest BM25 scores: ")
  print()

  tokenized_corpus = [tok.split() for tok in documents]
  bm25 = BM25Okapi(tokenized_corpus)
  tokenized_query = query.split()
  doc_scores = bm25.get_scores(tokenized_query)
  top_n = np.argsort(doc_scores)[::-1][:n]
  for i, b in enumerate(top_n):
      print(f"rank {i+1}: {doc_scores[b]}")
      print(docs[b])
      print()


q1 = 'quater final'
q2 = 'germany'
q3 = 'lionel messi'

get_top_n(q1, docs, n=3)
print('-'*100)
get_top_n(q2, docs, n=3)
print('-'*100)
get_top_n(q3, docs, n=3)


query: quater final

The following are articles with the highest BM25 scores: 

rank 1: 0.9860008046336753
the battle between football and soccer concluded in the fifa world cup on saturday as netherlands defeated usa in the round of clash to book a spot in the quarter finals while fans on social media termed the result a victory for football over soccer the top political leaders of the netherlands and usa also got involved into the debate the subject prompted twitter wisecracks between united states of america president joe biden and netherlands prime minister mark rutte window rrcode window rrcode rrcode push function function v d o ai ai d createelement script ai defer true ai async true ai src v location protocol o d head appendchild ai window document a vdo ai core v ndtv vdo ai js it s called soccer biden said in a video posted on twitter wishing the us team luck ahead of their last clash against the dutch side saturday at the khalifa international stadium in doha but the clinica