# **Enhancing Search Engine Relevance for Video Subtitles**

### Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.

In [None]:
!pip install langchain_community
!pip install langchain-openai
!pip install datasets transformers sentence-transformers
!pip install langchain-huggingface
!pip install langchain ollama
!pip install chromadb
!pip install langchain-chroma

Collecting protobuf (from onnxruntime>=1.14.1->chromadb)
  Using cached protobuf-5.29.0-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Using cached protobuf-5.29.0-cp38-abi3-manylinux2014_x86_64.whl (319 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.5
    Uninstalling protobuf-4.25.5:
      Successfully uninstalled protobuf-4.25.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.17.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 5.29.0 which is incompatible.
tensorflow-metadata 1.13.1 requires protobuf<5,>=3.20.3, but you have protobuf 5.29.0 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-5.29.0


In [None]:
import numpy as np
import pandas as pd
import sqlite3
import re
from langchain.document_loaders import DataFrameLoader
import chromadb
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

## **Loading the database using sqlite3**

In [None]:

conn = sqlite3.connect('/content/drive/MyDrive/eng_subtitles_database.db')
cursor = conn.cursor()
cursor.execute("PRAGMA integrity_check;")
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('zipfiles',)]


In [None]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

num
name
content


## **Loading the Database Table inside a Pandas DataFrame**

In [None]:
df = pd.read_sql_query("SELECT * FROM  zipfiles",conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


In [None]:
df.shape

(82498, 3)

## **Unzipping the content**

In [None]:
# Using example of 385th row
# Content is the binary data of the database

import zipfile
import io

binary_data = df.iloc[385,2]

#Decompress the data
with io.BytesIO(binary_data) as f:
  with zipfile.ZipFile(f, 'r') as zip_file:
    subtitle_content=zip_file.read(zip_file.namelist()[0]) # Reading only one file in the ZIP archive

print(subtitle_content.decode('latin-1')) # Assuming the content is latin-1 encoded text

1
00:00:06,000 --> 00:00:12,074
Watch any video online with Open-SUBTITLES
Free Browser extension: osdb.link/ext

2
00:00:15,370 --> 00:00:16,506
You lose everything, my girl.

3
00:00:16,530 --> 00:00:19,360
So you've said - four times.

4
00:00:20,330 --> 00:00:22,120
I definitely had
it on yesterday.

5
00:00:22,465 --> 00:00:25,785
Your gloves, your keys, that
handkerchief I embroidered for you

6
00:00:25,809 --> 00:00:26,168
Everything!

7
00:00:26,192 --> 00:00:27,280
Five times.

8
00:00:31,610 --> 00:00:32,920
Miss Scarlet?
- Yes.

9
00:00:36,390 --> 00:00:37,390
I'm Miss Scarlet.

10
00:00:37,872 --> 00:00:40,880
May I inquire if
you've lost something?

11
00:00:41,350 --> 00:00:42,530
Some jewellery perhaps?

12
00:00:42,870 --> 00:00:45,130
Yes, my mother's wedding ring.

13
00:00:45,220 --> 00:00:45,840
Have you found it?

14
00:00:45,950 --> 00:00:47,656
Does your ring have
an inscription?

15
00:00:48,650 -->

In [None]:
# Unzipping the entire database

import zipfile
import io

#count = 0

def decode_method(binary_data):
    #global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    #count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])

    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')


In [None]:
df['file_content'] = df['content'].apply(decode_method)

df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


## **Cleaning the data**

In [None]:
def txt_clean(text):
  clntxt=text.replace('Watch any video online with Open-SUBTITLES\r\nFree Browser extension: osdb.link/ext\r\n\r\n2\r\n',"")
  clntxt=clntxt.replace('Please rate this subtitle at www.osdb.link/agwma\r\nHelp other users to choose the best subtitles',"")
  clntxt=clntxt.replace('Synced By JiSiN',"")
  #print(clntxt)
  pat1=r'[\d{2}:\d{2}:\d{2},\d{3},-->:»¿!ï\n\r]'
  clntxt=re.sub(pat1,"",clntxt)
  pat2=r'[<*</*[\]]'
  clntxt=re.sub(pat2,"",clntxt)
  pat3=r'i\b'
  clntxt=re.sub(pat3,"",clntxt)
  # pat4=r'\bi'
  # clntxt=re.sub(pat4,"",clntxt)
  return clntxt

In [None]:
df['file_content'] = df['file_content'].apply(txt_clean)

In [None]:
df.iloc[4,3]

'    If you\'re going to throw it awaythen don\'t give birth.  Please take care of her.  Anything?  She probably ran away.  I guess so.  Give me the towel.  Woosung I\'m sorry.  I\'ll be sure to come pick you up.  Here we go again.I\'ll come pick him up.  No number...  They have no intention of doing that.  Hey hurry up and delete the video.  Your eyes are bright too huh?  Just not much hair on the eyebrows.  Even so you\'re such a cute kid.  How could they think ofgetting rid of you? Really huh?  So you\'re Woosung.  That\'s right Woosung.  You can be happy with us now. OK?  Right?  iFor those wanting to transferto a bus or an express bus  iplease exit now.  iThe exit for the BusanWest Intercity Bus Terminal  iis exit .  iGimhae International Airport is...   How\'s your knee? Huh?   Your knees hurt right? Yes I\'m okay.   You okay? Oh it\'s gotten a lot better.  Okay.  You should come more often.  It\'ll be good for youto exercise.  It\'s been a while so I just wantto rest whenever I 

In [None]:
df_exp_2=df.copy()

In [None]:
df_exp_2.drop(['content'],axis=1,inplace=True)

In [None]:
df_exp_2.head()

Unnamed: 0,num,name,file_content
0,9180533,the.message.(1976).eng.1cd,In the name of God the most gracious the m...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,Ah There's PrincessDawn and Terry with the ...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,iYum's Cells iEpisode Extremely Polite Yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,iYum's Cells iEpisode Laptop First pla...
4,9180600,broker.(2022).eng.1cd,If you're going to throw it awaythen don't...


## **Load the data using DataFrame Loader**

In [None]:
loader = DataFrameLoader(df_exp_2, page_content_column="file_content")
data = loader.load()

In [None]:
print(data[0].page_content)
print(data[0].metadata)

    In the name of God the most gracious the most Merciful.  From Muhammad the Messenger of God  to Heraclius the emperor of Byzantium.  greetings to him who is thefollower of righteous guidance.  I bid you to hear the divine call.  I am the messenger of God to the people;  accept Islam for your salvation.  He speaks of a new prophet in Arabia.  Was it like this when John the Baptistcame to king Herod  out of the desert crying about salvation?  To Muqawqis Patriarch of Alexandria.  Kisra emperor of Persia.  Muhammad calls you with the call of God.  Accept Islam for your salvation...  embrace Islam.  You come out of the desertsmelling of camel and goat.  To tell Persia where he should kneel?  Muhammad Messenger of God.  Who gave him this authority?  God sent Muhammadas a mercy to mankind.  The Scholars and Historians of Islam The University of AlAzhar in CairoThe High Islamic Congress of the Shiat in Lebanon  The makers of this film honour the Islamic traditionwhich holds that the Imper

In [None]:
print(type(data))

<class 'list'>


In [None]:
doc_contents = [doc.page_content for doc in data] #taking the page_content of each document from data(which is a list)
meta_datas = [doc.metadata for doc in data]

## **Splitting and Chunking the data (Used RecursiveCharacterTextSplitter**)

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=1000,
    chunk_overlap=50,
)

In [None]:
chunks = text_splitter.create_documents(doc_contents, meta_datas)

In [None]:
chunks[0].page_content

'In the name of God the most gracious the most Merciful.  From Muhammad the Messenger of God  to Heraclius the emperor of Byzantium.  greetings to him who is thefollower of righteous guidance.  I bid you to hear the divine call.  I am the messenger of God to the people;  accept Islam for your salvation.  He speaks of a new prophet in Arabia.  Was it like this when John the Baptistcame to king Herod  out of the desert crying about salvation?  To Muqawqis Patriarch of Alexandria.  Kisra emperor of Persia.  Muhammad calls you with the call of God.  Accept Islam for your salvation...  embrace Islam.  You come out of the desertsmelling of camel and goat.  To tell Persia where he should kneel?  Muhammad Messenger of God.  Who gave him this authority?  God sent Muhammadas a mercy to mankind.  The Scholars and Historians of Islam The University of AlAzhar in CairoThe High Islamic Congress of the Shiat in Lebanon  The makers of this film honour the Islamic traditionwhich holds that the'

In [1]:
# from google.colab import userdata
# userdata.get('HF_TOKEN')

## **Define Embedding Model**

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"

embeddings_model = HuggingFaceEmbeddings(model_name=model_name)

In [None]:
def embed_data(text):
  embeddings=[]
  for i in text:
    embedding=embeddings_model.embed_documents(i.page_content)
    embeddings.append(embedding)
  return embeddings

## **Build ChromaDB collection and Embed Data in batches**

In [None]:
db = Chroma(collection_name="subtitle_project_vector_database",
            embedding_function=embeddings_model,
            persist_directory="subtitle_project_chroma_db_")

In [None]:
db.get()

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [None]:
def chunks_vdb_batch(db, documents, batch_size=41000):
    for i in range(0, len(documents), batch_size):
        batch = documents[i : i + batch_size]
        db.add_documents(batch)

In [None]:
# Use the function to add documents in batches
chunks_vdb_batch(db, chunks)

In [None]:
db.get()

{'ids': ['78c1290e-70cb-404f-879a-7f1954e5f6a4',
  'fe7470c2-c622-4153-b748-9aec09184ddf',
  'd9096b50-a263-4961-b971-2e114b9655ef',
  'd7f3a292-2dd7-4bfc-a8fa-7bfe6df1bfed',
  'b85ac5c8-1f53-4ebe-af7e-327249addeeb',
  'fa624278-a2ea-4308-9a12-c32299add58e',
  '72f79b12-2ae4-48a9-aef0-b1e518a23b1f',
  '0064d2b9-3127-4e22-a833-13a8c19168e5',
  '3f46d122-ba2d-451e-a23c-57c21c8de5be',
  '17f9a1e2-e8fc-4075-bd45-6d31ba0e0f27',
  '07dd7690-b2d3-40c0-8c57-41cedd4ead1d',
  'dee91c95-3f84-4584-9405-063c0f39debc',
  '900731ee-1ab0-4725-8010-b527efb57718',
  '8e2f11aa-103d-4915-936b-b2a9a9c327a6',
  '2258021d-9eed-4d18-817e-d90878d23608',
  '831e6cbf-29b8-4fa9-bc8b-73ae40f98f4b',
  '37f496ef-7e3c-4777-b440-b5f8c7620c97',
  '7f349a2c-a1f4-4974-9669-cae6b73ab024',
  '9c6ccc9d-095f-44ac-ba35-58b091a0cfcb',
  '992ef9ff-b609-4189-9c62-fc885966d4b7',
  'afb580f6-0fed-4441-b93e-59a42ca2d9e6',
  'bb91c73c-64fa-48ab-ac4a-397e3f591dd9',
  'd035c0da-54d4-4c99-a790-d60966afedb6',
  'ffe3567f-b952-4316-bce1-

In [None]:
print(len(db.get()["ids"]))

2469209


## **Validate using a Query(Preprocess and use Query)**

In [None]:
query = df.iloc[1,2]

In [None]:
print(query)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x99V\x12o\xb0\xc2g\x0f\x00\x00Z$\x00\x008\x00\x00\x00Here Comes the Grump - Ep. 9 - Joltin_ Jack-in boxia.srtuZ]s\x1b7\x96}g\x15\xff\x03\xf4\xe4\x97\x96\xa6\xbf\xd0\x1f\xaeT\\v\xb2\xb15\xc9lf\xc6N\xa9\xfc\xd8"A\xb1Gd7\xb7\xbb)\x86\xff~\xcf\xb9h@hm\xb6*e\xc5\xa4ppqq\xee\xb9\x1fp\xb2^\xc5\xf1{\xfc\x97\xd6Q\xa6ku{\xfb\xa3\xb2\x1fdi\x14\xe7\xd5z\xf5q\x7f\xa3\xbe\xed\xcd`\xde\x8d\xea\x9fC\xdbm\xcc8\xaeW?7\x97N5\xddV}3\xc3pU\x97v\xda\xabio\xd6\xab\xf5*u\x90@H\xf3*\x84\xcc\xa32\xaf\xd7\xabO\x87\xbe\xef\xccU\xfdf\x7f\x8c\xfda\xdb\x9a\x01\xa8\xa7\xa1\x9f\xccfj\xbb\'\x82\x1d\xef\x08\x97y\xb8:*\xe28\x80\xcb\xd3(O\xf2\xf5\xea\xf7}d\xcd\x88\x14\x80\x9fo\xd6\xab/0Wm\xfa\xa3\x19\x89\xa3>\x0f\xe7\xe3I\xc0r\x07\xc6\xb5i\x19\x82\xe9\xa8\xac\xf5z\xf5\xdd48\xf0G\x9c\x8cX\xa3:\xb4\xcfF\xedy\xf8\xa7^\xec\xea\xd7\xabq\xdf\xf7\x13\xcc>\x1a\xb5m\x86iT\xcd\xa4\xfa\xf3\xa0\x1e\x9b\x03O&;i\xbfS\x11\xe5Uh\xb6N\xa2:I\xd6\xab\x07s8D\xaa\xdd\xd1\xc4+\xfeh\xbbg\xf5\xcc-\x1e

In [None]:
#Preprocess the query
query=decode_method(query)

In [None]:
query_vector = embeddings_model.embed_query(query)

### **Using Vector Similarity for Results**

In [None]:
relevant_chunks = db.similarity_search_by_vector(embedding=query_vector, k=5)

print(len(relevant_chunks))

5


In [None]:
[doc.metadata for doc in relevant_chunks]

[{'name': 'texas.rangers.(2001).eng.1cd', 'num': 9326255},
 {'name': 'edge.of.tomorrow.(2014).eng.1cd', 'num': 9390866},
 {'name': 'the.lair.(2022).eng.1cd', 'num': 9473497},
 {'name': 'anne.of.green.gables.the.continuing.story.s01.e02.episode.1.2.(2000).eng.1cd',
  'num': 9371656},
 {'name': 'bombs.away.(1985).eng.1cd', 'num': 9489164}]

In [None]:
relevant_chunks

[Document(metadata={'name': 'texas.rangers.(2001).eng.1cd', 'num': 9326255}, page_content="Blam blam  George sell itin the dime stores.  She'd... makea good ranch horse.  I think you'd makea good ranch hand.  DunnisonMiss Dnkes  hope he hasn't tired youwith all his stories  all the bad men he's killedthe hangings the shootouts.  It's so hard to keep track.  Actually we were speakingof a horse.  He killed a horse?  ( horse burrs )  Miss Dukes.  McNELLYAmargosa's amile ride  but we'll make it by nightfall.  Send a man into Shepardton.  Get a telegram to Victor Logan.  Tell him to get hisfamily out of there.  And the Rangers are coming.  ( horse neighs )  I made the lady apromise Captain.  We're taking her to Carmargo.  Hyah.  ( cattle lowing )  ( lowing in distance )  Another hour  and it's sunup.  Any sign over there?  Dust Captain.  Dust and sky.  Night... whywouldn't they  raid at night?  She wasn't dead.  I don't follow.  The rest were dead.  And she wasn't.  So why'd they leave her 

In [None]:
! zip -r /content/subtitle_project_chroma_db_.zip /content/subtitle_project_chroma_db_

  adding: content/subtitle_project_chroma_db_/ (stored 0%)
  adding: content/subtitle_project_chroma_db_/chroma.sqlite3 (deflated 70%)
  adding: content/subtitle_project_chroma_db_/4b24e2bc-db56-4805-b610-59827ccc4411/ (stored 0%)
  adding: content/subtitle_project_chroma_db_/4b24e2bc-db56-4805-b610-59827ccc4411/link_lists.bin (deflated 64%)
  adding: content/subtitle_project_chroma_db_/4b24e2bc-db56-4805-b610-59827ccc4411/data_level0.bin (deflated 9%)
  adding: content/subtitle_project_chroma_db_/4b24e2bc-db56-4805-b610-59827ccc4411/length.bin (deflated 73%)
  adding: content/subtitle_project_chroma_db_/4b24e2bc-db56-4805-b610-59827ccc4411/index_metadata.pickle (deflated 50%)
  adding: content/subtitle_project_chroma_db_/4b24e2bc-db56-4805-b610-59827ccc4411/header.bin (deflated 52%)
