# Semantic Based Search Engine

# Enhancing Search Engine Relevance for Video Subtitles

## Background
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in connecting users with relevant information. For Google, providing a seamless and accurate search experience is paramount. This project focuses on improving the search relevance for video subtitles, enhancing the accessibility of video content.

## Objective
Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.

## Keyword-based vs Semantic Search Engines
- **Keyword Based Search Engine:** These search engines rely heavily on exact keyword matches between the user query and the indexed documents.
- **Semantic Search Engines:** Semantic search engines go beyond simple keyword matching to understand the meaning and context of user queries and documents.
- **Comparison:** While keyword-based search engines focus primarily on matching exact keywords in documents, semantic-based search engines aim to understand the deeper meaning and context of user queries to deliver more relevant and meaningful search results.

## Core Logic
To compare a user query against a video subtitle document, the core logic involves three key steps:
1. **Preprocessing of Data:** 
   - Read the given data.
   - Observe that the given data is a database file.
   - Go through the README.txt to understand what is there inside the database.
   - Take care of decoding the files inside the database.
   - If you have limited compute resources, you can take a random 30% of the data.
   - Apply appropriate cleaning steps on subtitle documents (whatever is required).

2. **Vectorization:**
   - Experiment with the following to generate text vectors of subtitle documents:
     - BOW / TFIDF to generate sparse vector representations. Note that this will only help you to build a Keyword Based Search Engine.
     - BERT based “SentenceTransformers” to generate embeddings which encode semantic information. This can help us build a Semantic Search Engine.
   - **Document Chunker:** Consider the challenge of embedding large documents: Information Loss. It is often not practical to embed an entire document as a single vector, particularly when dealing with long documents.
     - Divide a large document into smaller, more manageable chunks for embedding.
     - To mitigate accidentally cutting off important text between chunks, set overlapping windows with a specified amount of tokens to overlap so there are tokens shared between chunks.
   - Store embeddings in a ChromaDB database.

3. **Retrieving Documents:**
   - Take the user's search query.
   - Preprocess the query (if required).
   - Create query embedding.
   - Using cosine distance, calculate the similarity score between embeddings of documents and user search query embedding.
   - These cosine similarity scores will help in returning the most relevant candidate documents as per user’s search query.

## Step-by-Step Process

### Part 1: Ingesting Documents
1. **Read the given data.**
2. **Observe that the given data is a database file.**
3. **Go through the README.txt to understand what is there inside the database.**
4. **Take care of decoding the files inside the database.**
5. **If you have limited compute resources, you can take a random 30% of the data.**
6. **Apply appropriate cleaning steps on subtitle documents (whatever is required).**
7. **Experiment with:**
   - BOW / TFIDF to generate sparse vector representations.
   - BERT based “SentenceTransformers” to generate embeddings which encode semantic information.
8. **Document Chunker:**
   - Divide a large document into smaller, more manageable chunks for embedding.
   - Set overlapping windows with a specified amount of tokens to overlap so there are tokens shared between chunks.
9. **Store embeddings in a ChromaDB database.**

### Part 2: Retrieving Documents
1. **Take the user's search query.**
2. **Preprocess the query (if required).**
3. **Create query embedding.**
4. **Using cosine distance, calculate the similarity score between embeddings of documents and user search query embedding.**
5. **These cosine similarity scores will help in returning the most relevant candidate documents as per user’s search query.**



#### Importing all the libraries

In [1]:
import pandas as pd
import sqlite3
import zipfile
import io

### 1. connecting to database

#### The data is given the database.db so we have to first extract the data using sqlite 

In [2]:
conn = sqlite3.connect(r"E:\Github Files\Search Engine\eng_subtitles_database.db")
cursor = conn.cursor()
print(cursor)
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

<sqlite3.Cursor object at 0x000001FB6CE203C0>
[('zipfiles',)]


In [3]:
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [4]:
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [5]:
from tqdm import tqdm, tqdm_notebook
tqdm.pandas()

In [6]:
df['content']

0        b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1        b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2        b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3        b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4        b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...
                               ...                        
82493    b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xb8\xa6\x...
82494    b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x13\x97\x...
82495    b'PK\x03\x04\x14\x00\x00\x00\x08\x00$\x97\x9aV...
82496    b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x97\x...
82497    b'PK\x03\x04\x14\x00\x00\x00\x08\x00,\x97\x9aV...
Name: content, Length: 82498, dtype: object

#### The data is in bytes so we have to decode it using latin-1

In [7]:
def decode(data):
    with zipfile.ZipFile(io.BytesIO(data)) as zip_file:
        list = zip_file.namelist()[0]
        extract_zip = zip_file.read(list)
    return extract_zip.decode('latin-1')

df['content'] = df['content'].progress_apply(lambda x: decode(x))

100%|███████████████████████████████████████████████████████████████████████████| 82498/82498 [01:58<00:00, 695.48it/s]


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


In [9]:
df.shape

(82498, 3)

In [10]:
df.size

247494

In [11]:
df.isnull().sum()

num        0
name       0
content    0
dtype: int64

In [12]:
df.describe()

Unnamed: 0,num
count,82498.0
mean,9351228.0
std,98820.55
min,9180533.0
25%,9264094.0
50%,9349568.0
75%,9437720.0
max,9521941.0


In [13]:
df = df.sample(frac=0.3, random_state=42)

In [14]:
import re

In [15]:
df['content']

17262    ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...
7294     1\r\n00:00:09,275 --> 00:00:11,876\r\n¶ Oh, I ...
47707    1\r\n00:00:07,140 --> 00:00:14,220\r\n<i>Timin...
29914    1\r\n00:00:06,133 --> 00:00:08,900\r\n[etherea...
54266    ï»¿1\r\n00:00:01,480 --> 00:00:03,570\r\n[Chri...
                               ...                        
67460    ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
15296    ï»¿1\r\n00:00:03,440 --> 00:00:06,160\r\n-Wher...
40242    ï»¿1\r\n00:00:01,101 --> 00:00:02,865\r\n<i>Pr...
56391    ï»¿1\r\n00:00:01,768 --> 00:00:03,168\r\n<i>- ...
67859    ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O...
Name: content, Length: 24749, dtype: object

In [16]:
df['content']

17262    ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...
7294     1\r\n00:00:09,275 --> 00:00:11,876\r\n¶ Oh, I ...
47707    1\r\n00:00:07,140 --> 00:00:14,220\r\n<i>Timin...
29914    1\r\n00:00:06,133 --> 00:00:08,900\r\n[etherea...
54266    ï»¿1\r\n00:00:01,480 --> 00:00:03,570\r\n[Chri...
                               ...                        
67460    ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
15296    ï»¿1\r\n00:00:03,440 --> 00:00:06,160\r\n-Wher...
40242    ï»¿1\r\n00:00:01,101 --> 00:00:02,865\r\n<i>Pr...
56391    ï»¿1\r\n00:00:01,768 --> 00:00:03,168\r\n<i>- ...
67859    ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O...
Name: content, Length: 24749, dtype: object

### 2. **Preprocessing of Data:** 

#### **Data cleaning** step is crucial part, In data there are html tags ,timeseries, numbers and Special charatrics

In [18]:
import nltk
import string

In [19]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
def clean_data(doc):
    doc = re.sub(r"\r\n",'',doc)
    doc = re.sub(r"-->","",doc)
    doc = re.sub("[<>]","",doc)
    doc = re.sub('^.*?¶\s*|ï»¿\s*|¶|âª', '', doc)

    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])
    doc = doc.lower()
    doc = doc.strip()
    
    return doc

df['content'] = df['content'].progress_apply(clean_data)
    

100%|████████████████████████████████████████████████████████████████████████████| 24749/24749 [08:14<00:00, 50.04it/s]


In [21]:
### After Cleaning the context data 

In [22]:
df

Unnamed: 0,num,name,content
17262,9251120,maybe.this.time.(2014).eng.1cd,watch any video online with opensubtitlesfree ...
7294,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,oh i know that its getting late but i dont ...
47707,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,itiming and subtitles by the uncontrollable lo...
29914,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiopensubtitlesorg is depreca...
54266,9408707,battlebots.(2015).eng.1cd,chris oh nonot the minibots yelling oh you le...
...,...,...,...
67460,9458807,kevin.can.wait.s01.e13.ring.worm.(2017).eng.1cd,script infotitle default filescripttype vwraps...
15296,9244890,bia.s01.e29.episode.1.29.(2019).eng.1cd,where did that come fromi dont know its a tap...
40242,9345965,heroes.s02.e11.chapter.eleven.powerless.(2007)...,ipreviously oni heroes tell me where i can fi...
56391,9417351,hot.in.cleveland.s05.e09.bad.george.clooney.(2...,i hot in clevelandi is recorded in frontof a ...


### 3. **Chunking**

##### Dividing a large document into 500 token per chunk, more manageable chunks for embedding, and also applied overlapping windows with a 100 amount of tokens to overlap so there are tokens shared between chunks.

In [23]:
chunk_size  = 500 
overlap =100

def chunk_text(text, chunk_size,overlap):
    chunks = []
    start =0
    while start < len(text):
        chunk = text[start:start + chunk_size]
        chunks.append(chunk.lower())
        start+=chunk_size -overlap
    return chunks

df['chunks'] =df['content'].progress_apply(lambda x: chunk_text(x,chunk_size,overlap))

100%|██████████████████████████████████████████████████████████████████████████| 24749/24749 [00:11<00:00, 2164.43it/s]


In [24]:
df.head()

Unnamed: 0,num,name,content,chunks
17262,9251120,maybe.this.time.(2014).eng.1cd,watch any video online with opensubtitlesfree ...,[watch any video online with opensubtitlesfree...
7294,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,oh i know that its getting late but i dont ...,[oh i know that its getting late but i dont...
47707,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,itiming and subtitles by the uncontrollable lo...,[itiming and subtitles by the uncontrollable l...
29914,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiopensubtitlesorg is depreca...,[ethereal music apiopensubtitlesorg is deprec...
54266,9408707,battlebots.(2015).eng.1cd,chris oh nonot the minibots yelling oh you le...,[chris oh nonot the minibots yelling oh you l...


In [None]:
# pip install -U sentence-transformers

### 4. **Vectorization:**

# Bert Sentence Transfromer 

##### BERT based “SentenceTransformers which generate embeddings which encode semantic information. This can help us build a Semantic Search Engine.

In [25]:
from sentence_transformers import SentenceTransformer, util
from chromadb.utils import embedding_functions
import chromadb

In [38]:
df1 = df.sample(frac=0.5, random_state=42)

In [43]:
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name, device='cpu')
df1['encoding_data'] = df1.chunks(model.encode)

### 5. Store embeddings in a ChromaDB database.

In [44]:
client = chromadb.PersistentClient(path="searchengine_database")
collection = client.get_or_create_collection(name="search_engine", metadata={"hnsw:space": "cosine"})

def store_encoder_db(df):
    for i in range(df.shape[0]):
        embedding_list = df['encoding_data'].iloc[i].tolist()
        collection.add(
            documents=[df['name'].iloc[i]],
            embeddings=[embedding_list],  
            ids=[str(df['num'].iloc[i])]
        )

store_encoder_db(df)