### Project Statement - Enhancing Search Engine Relevance for Video Subtitles
This project statement provides a comprehensive outline for enhancing search engine relevance for video subtitles. Here's a breakdown of the key components and steps involved:
#### Background and Objective:
##### Background:
Recognizes the importance of effective search engines in the digital content landscape, particularly for video subtitles, and highlights Google's emphasis on search experience.
##### Objective: 
Aims to develop an advanced search engine algorithm focusing on video subtitle content to improve search relevance and accuracy using natural language processing and machine learning techniques.
#### Keyword-based vs. Semantic Search Engines:
##### Keyword-based Search Engines: 
Relies on exact keyword matches between user queries and indexed documents.
##### Semantic Search Engines: 
Go beyond keyword matching to understand the meaning and context of queries and documents, delivering more relevant results.
#### Core Logic:
##### Preprocessing of Data: 
Cleaning and vectorization of subtitle documents and user queries.
##### Cosine Similarity Calculation: 
Comparing the vector representations of documents and user queries to determine relevance.
Return the most similar documents: Based on the computed similarity scores.
#### Step-by-Step Process:
##### Part 1: Ingesting Documents:
##### Read and Decode Data: 
Understand the database format and decode files, potentially working with a subset of data if resource limitations exist.
##### Cleaning and Vectorization: 
Apply appropriate cleaning steps and experiment with different methods for generating text vectors, including Bag of Words (BOW), TF-IDF, and BERT-based SentenceTransformers.
##### Document Chunking: 
Divide large documents into smaller chunks to manage information loss and mitigate context splitting issues.
##### Store Embeddings: 
Save embeddings in a ChromaDB database for efficient retrieval.
#### Part 2: Retrieving Documents:
##### User Query Processing: 
Preprocess user search queries if necessary.
##### Query Embedding: 
Create an embedding for the user's search query.
##### Cosine Similarity Calculation: 
Compute cosine similarity between document embeddings and the query embedding.
##### Return Relevant Documents: 
Based on cosine similarity scores, return the most relevant candidate documents to the user.

### Part 1: Ingesting Documents
#### 1. Read the given data

In [77]:
# Importing the required libraries
import sqlite3
import pandas as pd
import zipfile
import io
import nltk
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')

In [78]:
# Step 1 - Reading the Tables from Database file
conn = sqlite3.connect('eng_subtitles_database.db')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('zipfiles',)]


In [79]:
# Step 2 - Reading the columns of Table
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

num
name
content


In [80]:
# Step 3 - Loading the Database Table inside a Pandas DataFrame
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


In [82]:
# Step 4 - Printing content of 0th Row
b_data = df.iloc[0, 2]   # here 2 represent the index of content column and 0 represents the row number
print(b_data)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x99V\x9fx\x96\xf0\x8c\x9e\x00\x00\x86\x9b\x01\x00;\x00\x00\x00The.Message.1976.REMASTERED.1080p.BluRay.x264-PiGNUS.EN.srt\xad\xbdm\x93\xdc\xc6\x91.\xfa\x9d\x11\xfc\x0f-}\xe1=\x11-\x9d\x06P\x85\x17\x9d\x8d\xd5%%[\xa4-Y>&u\x15>\xdf\xd0\xd3\x98\x19x\xfae\x0cts<\xfe\xf57\x9f\'\xb3\n\xd9\xa4\xbc\xbb\xf7\xc6Fl\xacELW\xa2\xaa\x90\x95\x95\xafO\x16/_l6\xdf\xe0\xff\xea\xf5f\xb3Y}\xf5\xd5\xbf\xaf\xf4AQ\xae7Mx\xf9\xe2\xd7\xfe|s\xbf\xea\x8f\xcf\xab\x8f\xe3n8\xadN\xc7\xfdx\x1cVO\xe3\xf9~\xf5\xf3\xe3p\xfc\xea\xfd/o>\xbc\xfb\xf0\xe3\xef\xde\xbf|\xf1\xfbi\x18Vo\xa6\xd3\xd3<L\xab\xe1\x1f\xe7\xe18\x8f\xa7\xe37\xab\xd3\xbc\xdb~-\xc3\x1e\xfe\xa7<|\xf9\xe2\xe5\x8bR_[~S\xd6\xeb\xa2k\xf3k\xe5A\xb7\xeeb\xf5\xf2\xc5\xbb\xe3\xea|?\xac\x8e\xfdaX\x9dnW?\x9cvk>8\x9c\xe6\xf3\xean\xeao\xc6\xd3ev\x8f~\x1a\xa6\x9b\xf1\xf6\xb2\xff\x1a\xe4\xabD\xbe*d\x11\xa5#_U\xeb\xaa\xd9`\xa6\xa7\xc3\xea\xa7\xcb}\x7f8\xf4F\xf9\xa7a\x9e\x87\xe3\x9d\xcc\\\xdf\x07B!\x13\xaa\xd61n<!\xd9\xaf\xd0\

In [83]:
# Step 5 - Unzipping the content of 385th row and decoding using latin-1
binary_data = df.iloc[385, 2]
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        subtitle_content = zip_file.read(zip_file.namelist()[0])
print(subtitle_content.decode('latin-1'))

1
00:00:06,000 --> 00:00:12,074
Watch any video online with Open-SUBTITLES
Free Browser extension: osdb.link/ext

2
00:00:15,370 --> 00:00:16,506
You lose everything, my girl.

3
00:00:16,530 --> 00:00:19,360
So you've said - four times.

4
00:00:20,330 --> 00:00:22,120
I definitely had
it on yesterday.

5
00:00:22,465 --> 00:00:25,785
Your gloves, your keys, that
handkerchief I embroidered for you

6
00:00:25,809 --> 00:00:26,168
Everything!

7
00:00:26,192 --> 00:00:27,280
Five times.

8
00:00:31,610 --> 00:00:32,920
Miss Scarlet?
- Yes.

9
00:00:36,390 --> 00:00:37,390
I'm Miss Scarlet.

10
00:00:37,872 --> 00:00:40,880
May I inquire if
you've lost something?

11
00:00:41,350 --> 00:00:42,530
Some jewellery perhaps?

12
00:00:42,870 --> 00:00:45,130
Yes, my mother's wedding ring.

13
00:00:45,220 --> 00:00:45,840
Have you found it?

14
00:00:45,950 --> 00:00:47,656
Does your ring have
an inscription?

15
00:00:48,650 --> 00:00:51,720
From my father to my mother 'For
my beloved, Livi

In [84]:
# Step 6 - Applying the above Function on the Entire Data
count = 0

def decode_method(binary_data):
    global count
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            subtitle_content = zip_file.read(zip_file.namelist()[0])
    return subtitle_content.decode('latin-1')

In [85]:
df['file_content'] = df['content'].apply(decode_method)
df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           82498 non-null  int64 
 1   name          82498 non-null  object
 2   content       82498 non-null  object
 3   file_content  82498 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.5+ MB


### 2.Preprocessing and Cleaning of data: 

In [87]:
# Drop the content columns from the DataFrame
df = df.drop(columns=["content"])
df.head()

Unnamed: 0,num,name,file_content
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [88]:
# Rename the Columns with appropriate name
df = df.rename(columns = {"num":"Subtitle_Id", "name":"Subtitle_Name", "file_content": "Subtitle_Content"})
df.head()

Unnamed: 0,Subtitle_Id,Subtitle_Name,Subtitle_Content
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [89]:
import random

# Convert DataFrame to a list of rows (sequences)
data_list = df.values.tolist()

# Take a random 30% of the data if limited compute resources
data_sample = random.sample(data_list, int(0.3 * len(data_list)))

# Convert the sampled data back to a DataFrame if needed
sampled_df = pd.DataFrame(data_sample, columns=df.columns)

# Display the first few rows of the sampled DataFrame
sampled_df.head()

Unnamed: 0,Subtitle_Id,Subtitle_Name,Subtitle_Content
0,9450386,person.of.interest.s03.e19.most.likely.to.(201...,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
1,9376836,bang.s02.e05.episode.2.5.(2020).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."
2,9511930,criminal.minds.s03.e06.about.face.(2007).eng.1cd,"ï»¿1\r\n00:00:03,537 --> 00:00:06,138\r\n[DUCK..."
3,9381135,the.repair.shop.s07.e18.silver.salt.and.pepper...,"1\r\n00:00:02,240 --> 00:00:05,719\r\nWelcome ..."
4,9450349,person.of.interest.s03.e11.lethe.(2013).eng.1cd,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...


In [90]:
# Shape of the DataFrame
sampled_df.shape

(24749, 3)

#### a. Apply appropriate cleaning steps on subtitle Name

In [91]:
sampled_df["Subtitle_Name"][:10]

0    person.of.interest.s03.e19.most.likely.to.(201...
1              bang.s02.e05.episode.2.5.(2020).eng.1cd
2     criminal.minds.s03.e06.about.face.(2007).eng.1cd
3    the.repair.shop.s07.e18.silver.salt.and.pepper...
4      person.of.interest.s03.e11.lethe.(2013).eng.1cd
5     mapplethorpe.look.at.the.pictures.(2016).eng.1cd
6    earths.tropical.islands.s01.e03.hawaii.(2020)....
7    nashville.s02.e22.on.the.other.hand.(2014).eng...
8    the.mandalorian.s03.e01.chapter.17.the.apostat...
9    the.real.housewives.of.salt.lake.city.s03.e01....
Name: Subtitle_Name, dtype: object

In [92]:
import re
import re

def clean_subtitle_name(text):
    # Remove unnecessary parts
    cleaned_text = re.sub(r'\.eng.1cd', '', text)
    # Replace dots with spaces
    cleaned_text = cleaned_text.replace('.', ' ')
    # Capitalize the first letter of each word
    cleaned_text = cleaned_text.title()
    # Convert text to lowercase
    cleaned_text = cleaned_text.lower()
    # Remove redundant spaces
    cleaned_text = ' '.join(cleaned_text.split())
    return cleaned_text

# Test the function
sampled_df["Subtitle_Name"]= sampled_df["Subtitle_Name"].apply(clean_subtitle_name)
sampled_df.head()

Unnamed: 0,Subtitle_Id,Subtitle_Name,Subtitle_Content
0,9450386,person of interest s03 e19 most likely to (2014),ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
1,9376836,bang s02 e05 episode 2 5 (2020),"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."
2,9511930,criminal minds s03 e06 about face (2007),"ï»¿1\r\n00:00:03,537 --> 00:00:06,138\r\n[DUCK..."
3,9381135,the repair shop s07 e18 silver salt and pepper...,"1\r\n00:00:02,240 --> 00:00:05,719\r\nWelcome ..."
4,9450349,person of interest s03 e11 lethe (2013),ï»¿[Script Info]\r\nTitle: Default file\r\nScr...


#### b. Apply appropriate cleaning steps on subtitle Content

In [93]:
sampled_df["Subtitle_Content"][:10]

0    ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
1    ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...
2    ï»¿1\r\n00:00:03,537 --> 00:00:06,138\r\n[DUCK...
3    1\r\n00:00:02,240 --> 00:00:05,719\r\nWelcome ...
4    ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
5    ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nSuppo...
6    ï»¿1\r\n00:00:09,000 --> 00:00:13,080\r\nFar a...
7    ï»¿1\r\n00:00:02,203 --> 00:00:02,935\r\n<i>Pr...
8    ï»¿1\r\n00:00:02,041 --> 00:00:04,958\r\nIG-11...
9    1\r\n00:00:06,263 --> 00:00:07,572\r\nYou guys...
Name: Subtitle_Content, dtype: object

In [94]:
def clean_subtitle_content(text):
    # Remove timestamps
    clean_text = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '', text)
    # Remove HTML tags
    clean_text = re.sub(r'<[^>]*>', '', clean_text)
    # Remove special characters and symbols except spaces
    clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', clean_text)
    # Remove line numbers and extraneous meta text 
    clean_text = re.sub(r"\d+\s*", "", clean_text)  
    # Remove newlines
    clean_text = re.sub(r"\r\n", " ", clean_text)
    # Remove special characters (modify as needed)
    clean_text = re.sub(r"[^\w\s.,\-']", "", clean_text)  
    clean_text = re.sub(r'\s+', ' ', clean_text)
    
    # Convert text to lowercase
    clean_text = clean_text.lower()

    return clean_text.strip()

sampled_df["Subtitle_Content"] = sampled_df["Subtitle_Content"].apply(clean_subtitle_content)
sampled_df["Subtitle_Content"].iloc[0]

'script info title default file scripttype vwrapstyle playresx playresy scaledborderandshadow yes audio file video file video aspect ratio video zoom video position vstyles format name fontname fontsize primarycolour secondarycolour outlinecolour backcolour bold italic underline strikeout scalex scaley spacing angle borderstyle outline shadow alignment marginl marginr marginv encoding style dialoguetahomahbababahffhfhcstyle dialoguetahomahbababahffhfhcstyle dialoguetahomahbababahffhfhcevents format layer start end style name marginl marginr marginv effect text dialogue dialogueunknownanposiyou are being watchedidialogue dialogueunknownanposithe government has a secret systemidialogue dialogueunknownanposia machine that spies on youidialogue dialogueunknownanposievery hour of every dayidialogue dialogueunknownanposii designed the machineidialogue dialogueunknownanposito detect acts of terroridialogue dialogueunknownanposibut it sees everythingidialogue dialogueunknownanposiviolent crime

In [95]:
sampled_df.head()

Unnamed: 0,Subtitle_Id,Subtitle_Name,Subtitle_Content
0,9450386,person of interest s03 e19 most likely to (2014),script info title default file scripttype vwra...
1,9376836,bang s02 e05 episode 2 5 (2020),watch any video online with opensubtitles free...
2,9511930,criminal minds s03 e06 about face (2007),ducks quacking advertise your product or brand...
3,9381135,the repair shop s07 e18 silver salt and pepper...,welcome to the repair shop where precious but ...
4,9450349,person of interest s03 e11 lethe (2013),script info title default file scripttype vwra...


In [96]:
# This will save the DataFrame to the specified CSV file without including the index column.

sampled_df.to_csv("Dataset/cleaned_eng_subtitle.csv", index=False)