<a href="https://colab.research.google.com/github/Alyxx-The-Sniper/NLP/blob/main/NLP_Semantic_Search_with%C2%A0FAISS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search with FAISS
HuggingFace get_neareast and Cosine Similarity Search

# About This Project
In this project, our objective is to conduct a similarity search utilizing the Hugging Face library. To validate the results, we will cross-reference them using cosine similarity. Our dataset will be sourced from Steam, a popular gaming hub, and we will utilize its API to retrieve game app data specifically for our similarity search.

Install Libraries

In [1]:
!pip install tqdm
import requests
from bs4 import BeautifulSoup
import random
import pandas as pd
from tqdm import tqdm



# Download data from Steam (Gaming Hub) using API

> When utilizing an API, it is essential to familiarize yourself with the documentation and adhere to the specified terms of use. This ensures proper integration and compliance with the API provider's guidelines and restrictions.

In [2]:
### Create Global Variable
# Define the API endpoint and parameters
endpoint = "https://api.steampowered.com/ISteamApps/GetAppList/v2/"
# Send the API request
response = requests.get(endpoint)
response.status_code

200

### Function for Fetching Game Apps Data

In [3]:
import requests
import json

def fetch_steam_game_info(num_games: int, random_state: int = 42, save_filename: str = "steam_games.json"):
    """
    This function takes a number of game apps from the Steam API and returns a list of dictionaries
    with the game app names and descriptions. The descriptions are cleaned using the Beautiful Soup library.
    :param num_games: The number of game apps to fetch.
    :param random_state: Random seed for reproducibility. If None, randomness is not controlled.
    :param save_filename: The filename to save the fetched game information as a JSON file.
    :return: List of dictionaries with game app names and descriptions.
    """
    # Set random seed if provided
    if random_state is not None:
        random.seed(random_state)

    # Initialize the list to store game information as dictionaries
    games_list = []

    # Make a request to the Steam API
    response = requests.get("https://api.steampowered.com/ISteamApps/GetAppList/v2/")

    if response.status_code == 200:
        data = response.json()

        # Extract the list of apps from the response
        if "applist" in data and "apps" in data["applist"]:
            apps = data["applist"]["apps"]

            # Shuffle the apps list for random sampling
            random.shuffle(apps)

            # Iterate over the apps and retrieve game information
            with tqdm(total=num_games, desc="Fetching game details", unit="app") as pbar:
                for app in apps:
                    app_id = app["appid"]
                    game_name = app["name"]

                    # Retrieve game details using the Steam Web API
                    game_details_url = f"https://store.steampowered.com/api/appdetails?appids={app_id}"
                    game_details_response = requests.get(game_details_url)

                    try:
                        game_details_data = game_details_response.json()
                    except ValueError:
                        continue

                    # Check if the game details request was successful
                    if game_details_response.status_code == 200 and str(app_id) in game_details_data:
                        if "data" in game_details_data[str(app_id)]:
                            game_data = game_details_data[str(app_id)]["data"]
                            game_description = game_data.get("detailed_description", "")

                            # Remove all HTML tags from the description
                            soup = BeautifulSoup(game_description, "html.parser")
                            game_description = soup.get_text()

                            # Create a dictionary with game name and description
                            game_dict = {
                                "game_name": game_name,
                                "game_description": game_description
                            }

                            # Add the game dictionary to the list
                            games_list.append(game_dict)
                            pbar.update(1)  # Update the progress bar for every successful fetch

                            # Check if the desired number of games has been fetched
                            if len(games_list) >= num_games:
                                break

        else:
            print("No 'applist' or 'apps' found in the API response.")

    else:
        print("Error:", response.status_code)

    # Print the fetch results
    print("\n#### Fetch Complete ####")
    print("\nNumber of games fetched:", len(games_list))

    if len(games_list) < num_games:
        print("\nNo more game apps available.")

    # Save the fetched game information as a JSON file
    with open(save_filename, "w") as file:
        json.dump(games_list, file)

    # Print a message to indicate the successful download
    print(f"\nFetched Steam games downloaded and saved as {save_filename} in the local folder.")

    return games_list


In [4]:
# Function Call
steam_games = fetch_steam_game_info(num_games = 100,
                                    save_filename="steam_games.json")


Fetching game details: 100%|██████████| 100/100 [00:23<00:00,  4.17app/s]


#### Fetch Complete ####

Number of games fetched: 100

Fetched Steam games downloaded and saved as steam_games.json in the local folder.





Load the downloaded Steam game apps and save it to a pandas DataFrame.

In [40]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

# Load the JSON file into a DataFrame
filename = "steam_games.json"  # Specify the filename and path if necessary
with open(filename, "r") as file:
    steam_games_data = json.load(file)

# Create a DataFrame from the loaded data
df = pd.DataFrame(steam_games_data)

# Print the DataFrame
df.sample(3)

Unnamed: 0,game_name,game_description
8,Call of Duty®: Black Ops II - Benjamins MP Personalization Pack,"The Call of Duty®: Black Ops 2 Benjamins MP Personalization Pack boasts a cash-themed weapon skin, three uniquely-shaped reticles, and an all-new themed Calling Card by which your enemies can remember you. (NOTICE: Reticles will only be available for ACOG, EOTECH and Red Dot scopes.)"
92,SteamDolls - Order Of Chaos : Graphic Novels,"Join the Whisper and get the two issues of the SteamDolls graphic novel !\n\r\nIn the mist of time, society is eaten by industry, and its citizens oppressed by an authoritarian and omniscient cult. The Whisper, a violent-tempered anarchist, dreams about a revolution that will destroy the order of things. He devotes himself to expose the dark side of those in power and bring awareness to the people. His obsessive quest for the ultimate truth will throw him into a world of betrayal and sorcery where the only certainty is the imminent end of the world."
23,Hegemony Rome: The Rise of Caesar - Advanced Tactics Pack,"Enhance your combat skills and master the battlefield with six all-new advanced tactical units. Call forth the Parthian Horse Archers and Cataphracts to gain the upper hand. Or take advantage of the Gallic Ambushers’ knowledge of the terrain to launch deadly surprise assaults.The ‘Advanced Tactics’ DLC makes use of the newly added ambush/recon game mechanic that lets units hide outside of the fog of war to get ambush bonuses against approaching enemies.Hire 6 new and highly skilled units using the mercenary system: Kush Archer, Parthian Cataphrats and Horse Archers, Berber Javelineer, Naked Skirmishers and Gallic Ambushers"




---



---



# Load Pre-trained Model And Tokenizer
We will be using the pre-trained model 'sentence-transformers/multi-qa-mpnet-base-dot-v1' as our model.
This model is specifically designed for semantic search tasks, mapping sentences and paragraphs to a 768-dimensional dense vector space. It has been trained on a large dataset consisting of 215 million (question, answer) pairs from various sources.

link: [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1)

In [6]:
# install first necessary libraries
!pip install faiss-cpu faiss-gpu datasets transformers torch

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m101.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[

In [7]:
import torch
# Setup device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [8]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

# put model to device
model = model.to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [9]:
# Convert DataFrame into arrow format (HuggingFace dataset format)
from datasets import Dataset

# Convert the DataFrame to the Hugging Face dataset format
dataset = Dataset.from_pandas(df)
print(dataset.shape)
dataset

(100, 2)


Dataset({
    features: ['game_name', 'game_description'],
    num_rows: 100
})

# Pooling And Embeddings

In [10]:
def cls_pooling(model_output):
    """
    Perform pooling to obtain the representation of the first token (often [CLS]) from the last_hidden_state.

    Args:
        model_output: The output of a language model or transformer model.

    Returns:
        A tensor representing the pooled representation of the first token across the batch.
    """
    # Select the first token's representation from the last_hidden_state tensor
    return model_output.last_hidden_state[:, 0]



def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

Apply Embeddings To All Rows (Using `map()` method)

In [11]:
# get_embeddings is in torch type, we will detach from torch and move to cpu then and convert to numpy
embeddings_dataset = dataset.map(
    lambda x: {"embeddings": get_embeddings(x["game_description"]).detach().cpu().numpy()[0]}
    )

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

# Facebook AI Semantic Search
Add Faiss Index

In [12]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['game_name', 'game_description', 'embeddings'],
    num_rows: 100
})



---



---



# Sample Game Apps Search

In [31]:
def search_engine(search, k):
    """
    Perform a search in a game embeddings dataset and return the top k nearest examples.

    Args:
        search (str): The search query or input.
        k (int): The number of nearest examples to retrieve.

    Returns:
        A dictionary containing the game names, descriptions, and scores of the top k nearest examples.
    """
    search_embedding = get_embeddings([search]).detach().cpu().numpy()
    scores, samples = embeddings_dataset.get_nearest_examples("embeddings", search_embedding, k=k)
    game_name = samples['game_name']
    games_description = samples['game_description']

    result = {'game_name': game_name,
              'game_description': games_description,
              'scores': scores}

    return result


- Sample Search1

In [41]:
search1 = '''Join the Whisper and get the two issues of the SteamDolls graphic novel '''

In [42]:
search_result = search_engine(search1, 3)
# convert to DataFrame
df = pd.DataFrame(search_result)
df

Unnamed: 0,game_name,game_description,scores
0,SteamDolls - Order Of Chaos : Graphic Novels,"Join the Whisper and get the two issues of the SteamDolls graphic novel !\n\r\nIn the mist of time, society is eaten by industry, and its citizens oppressed by an authoritarian and omniscient cult. The Whisper, a violent-tempered anarchist, dreams about a revolution that will destroy the order of things. He devotes himself to expose the dark side of those in power and bring awareness to the people. His obsessive quest for the ultimate truth will throw him into a world of betrayal and sorcery where the only certainty is the imminent end of the world.",19.898991
1,Kin's Chronicle,"Development of this game has been halted. Sales will be suspended and it will be given away for free as-is.Delve into winding caverns, climb creepy towers, and find shortcuts in devious labyrinths. Sneak past traps and wandering enemies if you can, or dash right through and heal up later! Keep your eyes peeled, and find the gear and relics that will help you to solve the mystery of the broken world.Pummel baddies in classic Role-Playing Game style with tons of attacks and spells. Use combat Finesse abilities to slow your enemies down and create an opening for a massive meteor spell! But if they can't take the heat...Jealous of that special ability the Giant Carnivorous Plant used that nearly knocked everyone out? Convince him to join you, and use it on your enemies!There will be 70+ beasts, spirits, and mythological creatures you can collect and add to your team! Gather as many wanderers as you can to help your cause, each with their own levels and abilities.Final version includes:150+ Spells and skills720+ Craftable weapons and armor80+ Recruitable monsters and charactersKeyboard & mouse controls (Customizable)Soundtrack includedX-Input/Xbox Gamepad support (Customizable)Lots of Graphics and Gameplay options!",40.793911
2,The Coma 2: Vicious Sisters,"Wishlist Scarlet Hood and the Wicked Wood!https://store.steampowered.com/app/1141120About the GameMina Park, a student of Sehwa High, awakens at night in her school. It isn’t long before she realizes that something is amiss. The once-familiar school where she spends her evenings studying looks twisted by something dark and sinister. She finds herself pursued by someone or something that looks eerily like her teacher. To survive, Mina must venture beyond the boundaries of her school and into the surrounding district. There, she will encounter strange creatures, mysterious strangers, and uneasy allies. The Coma 2: Vicious Sisters is an atmospheric, story-driven game. Immerse yourself in the warped Sehwa district as you encounter an engaging cast of characters, solve puzzles, discover revealing clues, and fight for survival against a relentless psycho.Heroines, forced to confront their worst fears, rarely emerge unscathed. Over the course of the story, Mina will encounter dangerous scenarios from which she could sustain permanent damage. Craft items to anticipate future dangers to avoid injury.Dark Song is now more fearsome and powerful than ever before. Falling into her grasp could spell instant death. Running and wielding your flashlight makes you an easy target. You must precariously balance the urgency of exploration with your absolute need to survive! Fear Dark Song’s relentless pursuit to kill you, now with an all-new AI.Craft items to prepare for critical life-or-death situations or risk permanent injury.Explore the nightmarish district of Sehwa and discover its dark secrets.Scavenge resources to survive deadly encounters and afflictions.Unlock tools and upgrades to reach previously inaccessible areas.Hide to avoid detection and certain death. Pass challenges to conceal your location. Featuring vibrant, hand-illustrated in-game visuals and comic strips.",41.182491


Based on this sample search1, it seems that the Hugging Face `get_nearest_examples()` function utilizes a scoring approach similar to the Euclidean distance metric. In this case, lower scores indicate higher similarity between game descriptions. Therefore, the closer the vectors or the smaller the distance, the more similar the samples are considered to be.

- Sample search 2
<br/>
Copy all the game_description from the game apps 	`SteamDolls - Order Of Chaos : Graphic Novels`.

> Please note that the set of game apps you receive may differ due to the randomness involved in fetching game apps from Steam.

In [58]:
search2 = '''Join the Whisper and get the two issues of the SteamDolls graphic novel !\n\r\nIn the mist of time,
             society is eaten by industry, and its citizens oppressed by an authoritarian and omniscient cult.
             The Whisper, a violent-tempered anarchist, dreams about a revolution that will destroy the order of things.
             He devotes himself to expose the dark side of those in power and bring awareness to the people.
             His obsessive quest for the ultimate truth will throw him into a world of betrayal and sorcery
             where the only certainty is the imminent end of the world.'''

In [59]:
  search_result = search_engine(search2, 3)
# convert to DataFrame
df = pd.DataFrame(search_result)
df

Unnamed: 0,game_name,game_description,scores
0,SteamDolls - Order Of Chaos : Graphic Novels,"Join the Whisper and get the two issues of the SteamDolls graphic novel !\n\r\nIn the mist of time, society is eaten by industry, and its citizens oppressed by an authoritarian and omniscient cult. The Whisper, a violent-tempered anarchist, dreams about a revolution that will destroy the order of things. He devotes himself to expose the dark side of those in power and bring awareness to the people. His obsessive quest for the ultimate truth will throw him into a world of betrayal and sorcery where the only certainty is the imminent end of the world.",0.0
1,Kin's Chronicle,"Development of this game has been halted. Sales will be suspended and it will be given away for free as-is.Delve into winding caverns, climb creepy towers, and find shortcuts in devious labyrinths. Sneak past traps and wandering enemies if you can, or dash right through and heal up later! Keep your eyes peeled, and find the gear and relics that will help you to solve the mystery of the broken world.Pummel baddies in classic Role-Playing Game style with tons of attacks and spells. Use combat Finesse abilities to slow your enemies down and create an opening for a massive meteor spell! But if they can't take the heat...Jealous of that special ability the Giant Carnivorous Plant used that nearly knocked everyone out? Convince him to join you, and use it on your enemies!There will be 70+ beasts, spirits, and mythological creatures you can collect and add to your team! Gather as many wanderers as you can to help your cause, each with their own levels and abilities.Final version includes:150+ Spells and skills720+ Craftable weapons and armor80+ Recruitable monsters and charactersKeyboard & mouse controls (Customizable)Soundtrack includedX-Input/Xbox Gamepad support (Customizable)Lots of Graphics and Gameplay options!",46.530907
2,The Coma 2: Vicious Sisters,"Wishlist Scarlet Hood and the Wicked Wood!https://store.steampowered.com/app/1141120About the GameMina Park, a student of Sehwa High, awakens at night in her school. It isn’t long before she realizes that something is amiss. The once-familiar school where she spends her evenings studying looks twisted by something dark and sinister. She finds herself pursued by someone or something that looks eerily like her teacher. To survive, Mina must venture beyond the boundaries of her school and into the surrounding district. There, she will encounter strange creatures, mysterious strangers, and uneasy allies. The Coma 2: Vicious Sisters is an atmospheric, story-driven game. Immerse yourself in the warped Sehwa district as you encounter an engaging cast of characters, solve puzzles, discover revealing clues, and fight for survival against a relentless psycho.Heroines, forced to confront their worst fears, rarely emerge unscathed. Over the course of the story, Mina will encounter dangerous scenarios from which she could sustain permanent damage. Craft items to anticipate future dangers to avoid injury.Dark Song is now more fearsome and powerful than ever before. Falling into her grasp could spell instant death. Running and wielding your flashlight makes you an easy target. You must precariously balance the urgency of exploration with your absolute need to survive! Fear Dark Song’s relentless pursuit to kill you, now with an all-new AI.Craft items to prepare for critical life-or-death situations or risk permanent injury.Explore the nightmarish district of Sehwa and discover its dark secrets.Scavenge resources to survive deadly encounters and afflictions.Unlock tools and upgrades to reach previously inaccessible areas.Hide to avoid detection and certain death. Pass challenges to conceal your location. Featuring vibrant, hand-illustrated in-game visuals and comic strips.",47.039906


See! We have a perfect score of 0 for the top 'k' in our search2. This conclusion confirms that our search2 was precisely positioned at the vector location of our search2 query.



---



---



# Validate with Cosine Similarity

Cosine similarity is a similarity metric used to assess the resemblance between vectors.

In NLP similarity search tasks, such as text similarity or document similarity, cosine similarity is commonly used. In this context, the range of cosine similarity values is typically between 0 and 1.

A cosine similarity of 1 indicates that the vectors being compared are exactly the same or have identical orientations, indicating a high degree of similarity. On the other hand, a cosine similarity of 0 suggests that the vectors are orthogonal or unrelated, indicating no similarity.

Therefore, in NLP tasks, the cosine similarity values range from 0 to 1, with higher values indicating greater similarity between the vectors being compared.

In [60]:
# downloading cosine similarity from sklearn library
from sklearn.metrics.pairwise import cosine_similarity

# store the top k game_description in a list (from our previous serch2)
game_desc_Top_K_list = [df['game_description'][0],
                        df['game_description'][1],
                        df['game_description'][2]]

# tokenize the list
game_desc_Top_K_list_tok = tokenizer(game_desc_Top_K_list, padding=True, truncation=True, return_tensors='pt')

import torch
with torch.no_grad():
  # move value into device (forfaster compute, and also our model is already into device)
  game_desc_Top_K_list_tok = {k: v.to(device) for k, v in game_desc_Top_K_list_tok.items()}
  # instantiate our pretrained model
  model_output = model(**game_desc_Top_K_list_tok)

In [61]:
# Mean Pooling - Take attention mask into account for correct averaging (previously we used cls_pooling)
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Each token is represented by one vector, we dont want that
# we want one vector per sentence.
model_ouputs = mean_pooling(model_output, game_desc_Top_K_list_tok['attention_mask'])

# normalize
import torch.nn.functional as F
model_ouputs = F.normalize(model_ouputs, p=2, dim=1)
model_ouputs.shape

torch.Size([3, 768])

In [62]:
# detach the model output from (device) and move it into cpu() then convert to numpy.
model_ouputs = model_ouputs.detach().cpu().numpy()
model_ouputs.shape

(3, 768)

Check the shape of each game_description.

In [63]:
model_ouputs[0].shape

(768,)

Instead of having our game descriptions in a shape of (769,), we need them to be in the shape of (1, 768), which represents a single row with 768 dimensions.

In [64]:
# rehsaping games descriptions
game_desc_steamdolls = model_ouputs[0].reshape(1, -1)
game_desc_kinschronicles = model_ouputs[1].reshape(1, -1)
game_desc_thecoma = model_ouputs[2].reshape(1, -1)

# check the shape
game_desc_steamdolls.shape

(1, 768)

## Search Query
Our search query will be the game_description of the game app 'Steam Dolls' (referred to as 'search2').

In [65]:
def search_games(game_desc):
  game_desc_tok = tokenizer(game_desc, padding=True, truncation=True, return_tensors='pt')

  with torch.no_grad():
    game_desc_tok = {k: v.to(device) for k, v in game_desc_tok.items()}
    model_output = model(**game_desc_tok)

    model_ouputs = mean_pooling(model_output, game_desc_tok['attention_mask'])

  return model_ouputs.detach().cpu().numpy()

In [66]:
search2 = search_games(search2)
search2.shape

(1, 768)

Great! Now we can evaluate the cosine similarities between 'search2' and every game description.

In [68]:
print(cosine_similarity(search2 , game_desc_steamdolls))
print(cosine_similarity(search2 , game_desc_kinschronicles))
print(cosine_similarity(search2 , game_desc_thecoma))

[[1.]]
[[0.58424795]]
[[0.6446697]]


When comparing the search2 variable, which contains an exact copy of the game description for "Steam Dolls" to the "game_desc_steamdolls," the similarity score is 0.9999. This exceptionally high score signifies a near-perfect match, indicating that the descriptions are highly similar to each other. Furthermore, both the "Kin Chronicless" and "The coma" game apps exhibit a relatively high level of similarity as well.