In [1]:
!pip install numpy scikit-learn sentence-transformers -q

In [2]:
documents = [
    "The T20 World Cup 2024 is in full swing, bringing excitement and drama to cricket fans worldwide.India's team, captained by Rohit Sharma, is preparing for a crucial match against Ireland, with standout player Jasprit Bumrah expected to play a pivotal role in their campaign.The tournament has already seen controversy, particularly concerning the pitch conditions at Nassau County International Cricket Stadium in New York, which came under fire after a low-scoring game between Sri Lanka and South Africa.",
    "The world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globally.In the UEFA Champions League, the semi-final matchups have been set, with defending champions Real Madrid set to face Manchester City, while Bayern Munich will take on Paris Saint-Germain.Both ties promise thrilling encounters, featuring some of the best talents in world football.",
    "As election season heats up, the latest developments reveal a highly competitive atmosphere across several key races.The presidential election has seen intense campaigning from all major candidates, with recent polls indicating a tight race.Incumbent President Jane Doe is seeking re-election on a platform of economic stability and healthcare reform, while her main rival, Senator John Smith, focuses on education and climate change initiatives.",
    "The AI revolution continues to transform industries and reshape the global economy.Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety."
]

In [3]:
import re

def preprocessing(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

preprocessed_documents = [preprocessing(doc) for doc in documents]

for doc in preprocessed_documents:
    print(doc)

the t20 world cup 2024 is in full swing bringing excitement and drama to cricket fans worldwideindias team captained by rohit sharma is preparing for a crucial match against ireland with standout player jasprit bumrah expected to play a pivotal role in their campaignthe tournament has already seen controversy particularly concerning the pitch conditions at nassau county international cricket stadium in new york which came under fire after a lowscoring game between sri lanka and south africa
the world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globallyin the uefa champions league the semifinal matchups have been set with defending champions real madrid set to face manchester city while bayern munich will take on paris saintgermainboth ties promise thrilling encounters featuring some of the best talents in world football
as election season heats up the latest developments reveal a highly competitive atmosphere across several 

In [4]:
test_query = "machine learning is a subset of artificial intelligence"

## Keyword Search

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [30]:
vectorizer = TfidfVectorizer()

In [31]:
sparse_vectors = vectorizer.fit_transform(preprocessed_documents)

In [36]:
len(vectorizer.get_feature_names_out())

183

In [37]:
len(sparse_vectors.toarray()[0])

183

In [38]:
test_query_sparse_vector = vectorizer.transform([test_query])

In [39]:
len(test_query_sparse_vector.toarray()[0])

183

In [40]:
keyword_similarities = cosine_similarity(sparse_vectors, test_query_sparse_vector)

keyword_similarities

array([[0.05537393],
       [0.11902777],
       [0.07839555],
       [0.17677653]])

In [41]:
ranked_indexes = np.argsort(keyword_similarities, axis=0)[::-1].flatten()

ranked_indexes

array([3, 1, 2, 0])

In [42]:
ranked_documents = [documents[i] for i in ranked_indexes]

for doc in ranked_documents:
    print(doc)

The AI revolution continues to transform industries and reshape the global economy.Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety.
The world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globally.In the UEFA Champions League, the semi-final matchups have been set, with defending champions Real Madrid set to face Manchester City, while Bayern Munich will take on Paris Saint-Germain.Both ties promise thrilling encounters, featuring some of the best talents in world football.
As election season heats up, the latest developments reveal a highly competitive atmosphere across several key races.The presidential election has seen intense campaigning from all major candidates, with recent polls indica

## Semantic Search

In [44]:
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np




In [45]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

In [46]:
dense_vectors = embedding_model.encode(preprocessed_documents)

In [47]:
len(dense_vectors[0])

384

In [48]:
test_query_dense_vector = embedding_model.encode([test_query])

In [49]:
len(test_query_dense_vector[0])

384

In [50]:
semantic_similarities = cosine_similarity(dense_vectors, test_query_dense_vector)

semantic_similarities

array([[0.01992225],
       [0.09100752],
       [0.04911963],
       [0.3795429 ]], dtype=float32)

In [51]:
ranked_indexes = np.argsort(semantic_similarities, axis=0)[::-1].flatten()

ranked_indexes

array([3, 1, 2, 0])

In [52]:
ranked_documents = [documents[i] for i in ranked_indexes]

for doc in ranked_documents:
    print(doc)

The AI revolution continues to transform industries and reshape the global economy.Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety.
The world of football is buzzing with excitement as major tournaments and league matches continue to captivate fans globally.In the UEFA Champions League, the semi-final matchups have been set, with defending champions Real Madrid set to face Manchester City, while Bayern Munich will take on Paris Saint-Germain.Both ties promise thrilling encounters, featuring some of the best talents in world football.
As election season heats up, the latest developments reveal a highly competitive atmosphere across several key races.The presidential election has seen intense campaigning from all major candidates, with recent polls indica