                                           **`SEMATIC SEARCH ENGINE PROJECT`**


In [1]:
import sqlite3
import pandas as pd

## **Step 1 - Reading the Tables from Database file**

In [2]:
# Read the code below and write your observation in the next cell

conn = sqlite3.connect('C:/Users/rjsek/OneDrive/Documents/Work and Professional documents/Innomatics Research Labs/Data Science Internship Jan 2024/Task - 9 Subtitle Search Engine/data/eng_subtitles_database.db')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('zipfiles',)]


**In the above cell, I am able to read the table inside the database. As mentioned earlier, table name is `zipfiles`. We also know from README.txt that this table contains three columns: 'num', 'name' and 'content'.**

## **Step 2 - Reading the columns of Table**

In [3]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

num
name
content


**The above code helps in checking the column names in the database table.**

**Let's now use `SELECT * FROM zipfiles` to read all the data into a `df_30_percent_data` variable.**

## **Step 3 - Loading the Database Table inside a Pandas DataFrame**

In [4]:
df_30_percent_data = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df_30_percent_data = df_30_percent_data.sample(frac=0.3)
df_30_percent_data.head()

Unnamed: 0,num,name,content
68884,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xf5d\x9aV...
72341,9478254,red.eye.(2005).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00*q\x9aV\xb...
44918,9367553,10.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00|\x0e\x9aV...
5016,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00d\x8d\x99V...
21135,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,b'PK\x03\x04\x14\x00\x00\x00\x08\x00*\x9d\x99V...


In [5]:
df_30_percent_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24749 entries, 68884 to 23396
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      24749 non-null  int64 
 1   name     24749 non-null  object
 2   content  24749 non-null  object
dtypes: int64(1), object(2)
memory usage: 773.4+ KB


**Looks like the `content` column donot contain the subtitles text. Instead as mentioned in README.txt, it might be latin-1 encoded.**

## **Step 4 - Printing `content` of 0th Row**

In [6]:
b_data = df_30_percent_data.iloc[0, 2]

# here 2 represent the index of content column
# 0 represents the row number

In [7]:
print(b_data)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xf5d\x9aV5\xda\xb9\xd5RL\x00\x009\x83\x01\x00,\x00\x00\x00CSI Miami S09E01 Fallen.DVD.NonHI.en.CBS.ass\xbd}[\x92\xdb\xd6\x96\xe5\xbf#<\x07\xd8\x1f\xed\xbeQ)\x06\xf1\x06\xfcS!Y~d\xb5_a\xc9W\xe5\xaa\xdb\x1f \t&\xe1$\x01\x16\x00*M\xdf\xf0@j\n5\x85\xfe\xe8\xb9\xd4\x04z\n\xbd\x9f S<Ls\x0b\xd4\xad\xba\xb6\xb6e\x19+\x0fp\xce>\xfb\xb9\xf6\xff\xfb?\xff\xf7\xdf_\xcd\xdbj\xdb{\xb7\xf5\xb2\xf9\xdf\x1f\x7f\xf4\xba\xea\xd7\xe5\xe7\xde\xcbrY\xec\xd6\xbd\xb7\xac\xd6\xe5\xc7\x1f\xf1\x1fy\xbd\xdf\xc2\xbfy\x1bM\xa6\xd3\x7f\xfa\xf8\xa37m\xb1}\xd5\xef\xf1\x0fO?\xfe\xe8\xc7u\xb1\xff\xa9\xec\xfe\xf5s/\r\x0e\xff\xf8\xcb\xe7^\x94M\xf1\xbf/\xd6\xe5\xe2E\xd3.\xca\xf6y\xbdx\xb5*\x16\xcd\xc3\xe7\xde\xbe\xec>\xfe\xe8\xf9nQ5\xdeW\x15>\xe8\xe3\x8f\xfeZ-\xcaw\xff\xe9y\xb7-\xe7\xbd\xf7S\xd1W\r\x81\xf1o\xff[\xd3l>\xf7\x12\xfd\xc7\x1f\x9b\xae\x82?P\xd3\x9f\xf8\xf8\xa3\x7f\xffk\xf4O\x1e\xfd|\x1d\xac\xea\xab\xa6\xdd\x14\xfd\xe7\xde\xf7\xc5\xa6\xbc\xf1\xbej\xea\xbe\x1e\xa4\xae\xfa\x1d\xa

**From the content, it appears to start with the bytes "PK\x03\......", which suggests that it might be a ZIP archive file. How do I know it? Experience! I have worked with something similar earlier.**

## **Step 5 - Unzipping the content of 380th row and decoding using `latin-1`**

In [8]:
import zipfile
import io

# Assuming 'content' is the binary data from your database
binary_data = df_30_percent_data.iloc[380, 2]

# Decompress the binary data using the zipfile module
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        # Reading only one file in the ZIP archive
        subtitle_content = zip_file.read(zip_file.namelist()[0])

# Now 'subtitle_content' should contain the extracted subtitle content
print(subtitle_content.decode('latin-1'))  # Assuming the content is latin-1 encoded text

ï»¿[Script Info]
Title: Default file
ScriptType: v4.00+
WrapStyle: 0
PlayResX: 720
PlayResY: 480
ScaledBorderAndShadow: yes
Audio File: 
Video File: 
Video Aspect Ratio: 0
Video Zoom: 6
Video Position: 0

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Dialogue1,Tahoma,30,&H00FDFDFD,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,60,60,15,1
Style: Dialogue2,Tahoma,30,&H00FDFDFD,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,60,60,15,1
Style: Dialogue4,Tahoma,30,&H00FDFDFD,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,60,60,15,1
Style: Dialogue3,Tahoma,28,&H00FDFDFD,&H000000FF,&H1F000000,&HC7000000,0,0,0,0,100,100,0,0,1,1.4,1.7,2,60,60,15,1

[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR,

## **Step 6 - Applying the above Function on the Entire Data**

In [9]:
count = 0

def decode_method(binary_data):
    global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])
    
    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')  # Assuming the content is UTF-8 encoded text

In [10]:
df_30_percent_data['file_content'] = df_30_percent_data['content'].apply(decode_method)

df_30_percent_data.head()

Unnamed: 0,num,name,content,file_content
68884,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xf5d\x9aV...,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
72341,9478254,red.eye.(2005).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00*q\x9aV\xb...,"ï»¿1\r\n00:00:33,909 --> 00:00:37,204\r\n[jet ..."
44918,9367553,10.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00|\x0e\x9aV...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
5016,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00d\x8d\x99V...,"ï»¿1\r\n00:00:02,102 --> 00:00:03,468\r\nFRANC..."
21135,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,b'PK\x03\x04\x14\x00\x00\x00\x08\x00*\x9d\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver..."


In [11]:
df_30_percent_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24749 entries, 68884 to 23396
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           24749 non-null  int64 
 1   name          24749 non-null  object
 2   content       24749 non-null  object
 3   file_content  24749 non-null  object
dtypes: int64(1), object(3)
memory usage: 966.8+ KB


In [12]:
df_30_percent_data.tail()

Unnamed: 0,num,name,content,file_content
20877,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xcd\x9c\x...,"ï»¿1\r\n00:00:02,194 --> 00:00:05,218\r\n<i>Pr..."
75381,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xf9\x82\x...,"1\r\n00:01:01,006 --> 00:01:02,423\r\nI think ..."
30008,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"b'PK\x03\x04\x14\x00\x00\x00\x08\x00,\xab\x99V...","ï»¿1\r\n00:00:00,801 --> 00:00:03,673\r\n[dram..."
8170,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00H\x90\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nUse t..."
23396,9274955,bad.sisters.s01.e10.saving.grace.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00G\x9f\x99V...,"1\r\n00:00:08,132 --> 00:00:11,136\r\n<i>""Any ..."


### DROPPING THE "CONTENT" COLUMN

In [13]:
df_30_percent_data =  df_30_percent_data.drop(columns='content')

In [14]:
df_30_percent_data.head()

Unnamed: 0,num,name,file_content
68884,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
72341,9478254,red.eye.(2005).eng.1cd,"ï»¿1\r\n00:00:33,909 --> 00:00:37,204\r\n[jet ..."
44918,9367553,10.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
5016,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"ï»¿1\r\n00:00:02,102 --> 00:00:03,468\r\nFRANC..."
21135,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver..."


### RENAMING THE COLUMN NAMES

In [15]:
df_30_percent_data.rename(columns={"num":"id",
             "name":"Movies/Series",
             "file_content":"subtitles"},inplace=True)

In [16]:
df_30_percent_data.head()

Unnamed: 0,id,Movies/Series,subtitles
68884,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
72341,9478254,red.eye.(2005).eng.1cd,"ï»¿1\r\n00:00:33,909 --> 00:00:37,204\r\n[jet ..."
44918,9367553,10.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
5016,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"ï»¿1\r\n00:00:02,102 --> 00:00:03,468\r\nFRANC..."
21135,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver..."


In [17]:
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles
68884,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
72341,9478254,red.eye.(2005).eng.1cd,"ï»¿1\r\n00:00:33,909 --> 00:00:37,204\r\n[jet ..."
44918,9367553,10.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
5016,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"ï»¿1\r\n00:00:02,102 --> 00:00:03,468\r\nFRANC..."
21135,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver..."
...,...,...,...
20877,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,"ï»¿1\r\n00:00:02,194 --> 00:00:05,218\r\n<i>Pr..."
75381,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,"1\r\n00:01:01,006 --> 00:01:02,423\r\nI think ..."
30008,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"ï»¿1\r\n00:00:00,801 --> 00:00:03,673\r\n[dram..."
8170,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nUse t..."


In [20]:
# Reset Index

In [21]:
# Reset index for the DataFrame
df_30_percent_data = df_30_percent_data.reset_index(drop=True)
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles
0,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
1,9478254,red.eye.(2005).eng.1cd,"ï»¿1\r\n00:00:33,909 --> 00:00:37,204\r\n[jet ..."
2,9367553,10.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
3,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"ï»¿1\r\n00:00:02,102 --> 00:00:03,468\r\nFRANC..."
4,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver..."
...,...,...,...
24744,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,"ï»¿1\r\n00:00:02,194 --> 00:00:05,218\r\n<i>Pr..."
24745,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,"1\r\n00:01:01,006 --> 00:01:02,423\r\nI think ..."
24746,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"ï»¿1\r\n00:00:00,801 --> 00:00:03,673\r\n[dram..."
24747,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nUse t..."


#  Preprocessing of data

In [22]:
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

**`removing special characters`**

In [23]:
df_30_percent_data['subtitles'] = df_30_percent_data['subtitles'].str.replace(r'[^a-zA-Z\s]', '')

  df['subtitles'] = df['subtitles'].str.replace(r'[^a-zA-Z\s]', '')


**`removing non ASCII characters`**

In [24]:
df_30_percent_data['subtitles'] = df_30_percent_data['subtitles'].str.replace(r'[^\x00-\x7F]+', '')

  df['subtitles'] = df['subtitles'].str.replace(r'[^\x00-\x7F]+', '')


**`removing time-stamp`**

In [25]:
df_30_percent_data['subtitles'] = df_30_percent_data['subtitles'].str.replace(r'\d{2}:\d{2}:\d{2}\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '')

  df['subtitles'] = df['subtitles'].str.replace(r'\d{2}:\d{2}:\d{2}\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '')


**`converting to lowercase`**

In [26]:
df_30_percent_data['subtitles'] = df_30_percent_data['subtitles'].str.lower()

**`removing extra spaces`**

In [27]:
df_30_percent_data['subtitles'] = df_30_percent_data['subtitles'].str.replace(r'\s+', ' ' )

  df['subtitles'] = df['subtitles'].str.replace(r'\s+', ' ' )


In [28]:
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles
0,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,script info title default file scripttype v wr...
1,9478254,red.eye.(2005).eng.1cd,jet engine revving anman lets go anwoman tayl...
2,9367553,10.(2022).eng.1cd,advertise your product or brand here contact ...
3,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,francis ipreviously oni reign her majesty mar...
4,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,advertise your product or brand here contact ...
...,...,...,...
24744,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,ipreviously oni font colorfffthe real housewi...
24745,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,i think thats enough for today good now pleas...
24746,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,dramatic orchestral music plays indistinct ch...
24747,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,use the free code joinnow at wwwplayshipseu s...


In [29]:
!pip install swifter



In [30]:
import swifter

**`Tokenization step`**

In [31]:
df_30_percent_data['subtitles'] = df_30_percent_data['subtitles'].swifter.apply(lambda x: word_tokenize(x))

Pandas Apply:   0%|          | 0/24749 [00:00<?, ?it/s]

In [32]:
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles
0,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,"[script, info, title, default, file, scripttyp..."
1,9478254,red.eye.(2005).eng.1cd,"[jet, engine, revving, anman, lets, go, anwoma..."
2,9367553,10.(2022).eng.1cd,"[advertise, your, product, or, brand, here, co..."
3,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"[francis, ipreviously, oni, reign, her, majest..."
4,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"[advertise, your, product, or, brand, here, co..."
...,...,...,...
24744,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,"[ipreviously, oni, font, colorfffthe, real, ho..."
24745,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,"[i, think, thats, enough, for, today, good, no..."
24746,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"[dramatic, orchestral, music, plays, indistinc..."
24747,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,"[use, the, free, code, joinnow, at, wwwplayshi..."



### BERT based “SentenceTransformers” to generate embeddings which encode semantic information. 
### This can help us build a Semantic Search Engine.

In [34]:
!pip install sentence-transformers



In [35]:
from sentence_transformers import SentenceTransformer
import pandas as pd

# Load a pre-trained BERT model (using CPU)
model = SentenceTransformer('paraphrase-MiniLM-L3-v2', device='cpu')

# Function to encode text using BERT
def encode_text(text):
    # Encode the text into an embedding
    try:
        embedding = model.encode(text)
        return embedding
    except Exception as e:
        print(f"Error encoding text: {e}")
        return None

# Convert list of words into a single string sentence
df_30_percent_data['subtitles_sentence'] = df_30_percent_data['subtitles'].apply(lambda x: ' '.join(x))

# Apply BERT vectorization to the 'subtitles_sentence' column
df_30_percent_data['bert_embedding'] = df_30_percent_data['subtitles_sentence'].apply(encode_text)

# Display the updated DataFrame
print(df_30_percent_data.head())

        id                                      Movies/Series  \
0  9464726            csi.miami.s09.e01.fallen.(2010).eng.1cd   
1  9478254                             red.eye.(2005).eng.1cd   
2  9367553                                  10.(2022).eng.1cd   
3  9203091             reign.s02.e15.forbidden.(2015).eng.1cd   
4  9265916  barbie.it.takes.two.s02.e01.weve.got.magic.to....   

                                           subtitles  \
0  [script, info, title, default, file, scripttyp...   
1  [jet, engine, revving, anman, lets, go, anwoma...   
2  [advertise, your, product, or, brand, here, co...   
3  [francis, ipreviously, oni, reign, her, majest...   
4  [advertise, your, product, or, brand, here, co...   

                                  subtitles_sentence  \
0  script info title default file scripttype v wr...   
1  jet engine revving anman lets go anwoman taylo...   
2  advertise your product or brand here contact w...   
3  francis ipreviously oni reign her majesty mar

In [36]:
df_30_percent_data['bert_embedding']

0        [0.019202188, -0.10135785, 0.05124608, 0.07146...
1        [0.06227134, -0.15009682, 0.23519003, -0.04188...
2        [-0.08674303, 0.113942094, 0.07312267, -0.0547...
3        [0.057274677, 0.18626337, -0.005120895, -0.154...
4        [-0.04211163, -0.040654175, -0.010191287, -0.1...
                               ...                        
24744    [0.14130643, -0.16698056, 0.33102593, 0.074704...
24745    [0.111511536, -0.026505634, 0.20732762, -0.159...
24746    [-0.11328379, -0.06661039, 0.11664734, 0.11008...
24747    [0.033705004, 0.0017321683, 0.049905933, -0.02...
24748    [-0.024892502, -0.14684628, 0.045202546, 0.053...
Name: bert_embedding, Length: 24749, dtype: object

In [37]:
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles,subtitles_sentence,bert_embedding
0,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,"[script, info, title, default, file, scripttyp...",script info title default file scripttype v wr...,"[0.019202188, -0.10135785, 0.05124608, 0.07146..."
1,9478254,red.eye.(2005).eng.1cd,"[jet, engine, revving, anman, lets, go, anwoma...",jet engine revving anman lets go anwoman taylo...,"[0.06227134, -0.15009682, 0.23519003, -0.04188..."
2,9367553,10.(2022).eng.1cd,"[advertise, your, product, or, brand, here, co...",advertise your product or brand here contact w...,"[-0.08674303, 0.113942094, 0.07312267, -0.0547..."
3,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"[francis, ipreviously, oni, reign, her, majest...",francis ipreviously oni reign her majesty mari...,"[0.057274677, 0.18626337, -0.005120895, -0.154..."
4,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"[advertise, your, product, or, brand, here, co...",advertise your product or brand here contact w...,"[-0.04211163, -0.040654175, -0.010191287, -0.1..."
...,...,...,...,...,...
24744,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,"[ipreviously, oni, font, colorfffthe, real, ho...",ipreviously oni font colorfffthe real housewiv...,"[0.14130643, -0.16698056, 0.33102593, 0.074704..."
24745,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,"[i, think, thats, enough, for, today, good, no...",i think thats enough for today good now please...,"[0.111511536, -0.026505634, 0.20732762, -0.159..."
24746,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"[dramatic, orchestral, music, plays, indistinc...",dramatic orchestral music plays indistinct cha...,"[-0.11328379, -0.06661039, 0.11664734, 0.11008..."
24747,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,"[use, the, free, code, joinnow, at, wwwplayshi...",use the free code joinnow at wwwplayshipseu si...,"[0.033705004, 0.0017321683, 0.049905933, -0.02..."


### DOCUMNET CHUNKER

In [None]:
# Creating  a function that takes a large document as input and divides it into smaller chunks 
# with an overlap to maintain context continuity.

In [40]:
# Define semantic chunking function
def semantic_chunking(embeddings, chunk_size=1000, overlap=50):
    chunks = []
    if isinstance(embeddings, np.ndarray) and len(embeddings.shape) == 1:
        # Single embedding (numpy array or float)
        embeddings = [embeddings]  # Convert to a list for consistency
    for embedding in embeddings:
        if isinstance(embedding, np.ndarray):
            embedding = ' '.join(map(str, embedding))  # Convert array to string
        else:
            embedding = str(embedding)  # Convert float to string
        for start in range(0, len(embedding), chunk_size - overlap):
            chunk = embedding[start:start + chunk_size]
            chunks.append(chunk)
    return chunks

# Apply semantic chunking to the 'bert_embedding' column
df_30_percent_data['bert_embedding_chunks'] = df_30_percent_data['bert_embedding'].apply(semantic_chunking)

In [41]:
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles,subtitles_sentence,bert_embedding,bert_embedding_chunks
0,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,"[script, info, title, default, file, scripttyp...",script info title default file scripttype v wr...,"[0.019202188, -0.10135785, 0.05124608, 0.07146...",[0.019202188 -0.10135785 0.05124608 0.07146899...
1,9478254,red.eye.(2005).eng.1cd,"[jet, engine, revving, anman, lets, go, anwoma...",jet engine revving anman lets go anwoman taylo...,"[0.06227134, -0.15009682, 0.23519003, -0.04188...",[0.06227134 -0.15009682 0.23519003 -0.04188167...
2,9367553,10.(2022).eng.1cd,"[advertise, your, product, or, brand, here, co...",advertise your product or brand here contact w...,"[-0.08674303, 0.113942094, 0.07312267, -0.0547...",[-0.08674303 0.113942094 0.07312267 -0.0547033...
3,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"[francis, ipreviously, oni, reign, her, majest...",francis ipreviously oni reign her majesty mari...,"[0.057274677, 0.18626337, -0.005120895, -0.154...",[0.057274677 0.18626337 -0.005120895 -0.154800...
4,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"[advertise, your, product, or, brand, here, co...",advertise your product or brand here contact w...,"[-0.04211163, -0.040654175, -0.010191287, -0.1...",[-0.04211163 -0.040654175 -0.010191287 -0.1076...
...,...,...,...,...,...,...
24744,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,"[ipreviously, oni, font, colorfffthe, real, ho...",ipreviously oni font colorfffthe real housewiv...,"[0.14130643, -0.16698056, 0.33102593, 0.074704...",[0.14130643 -0.16698056 0.33102593 0.0747046 -...
24745,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,"[i, think, thats, enough, for, today, good, no...",i think thats enough for today good now please...,"[0.111511536, -0.026505634, 0.20732762, -0.159...",[0.111511536 -0.026505634 0.20732762 -0.159436...
24746,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"[dramatic, orchestral, music, plays, indistinc...",dramatic orchestral music plays indistinct cha...,"[-0.11328379, -0.06661039, 0.11664734, 0.11008...",[-0.11328379 -0.06661039 0.11664734 0.11008331...
24747,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,"[use, the, free, code, joinnow, at, wwwplayshi...",use the free code joinnow at wwwplayshipseu si...,"[0.033705004, 0.0017321683, 0.049905933, -0.02...",[0.033705004 0.0017321683 0.049905933 -0.02711...


In [None]:
# Saving the data

In [None]:
# Assuming df_30_percent_data is your DataFrame
df_30_percent_data.to_csv('df_30_percent_data.csv', index=False, escapechar='\\')

# Providing a download link
from IPython.display import FileLink
FileLink('df_30_percent_data.csv')

## SAVING THE df_30_percent_data

In [None]:
import joblib

In [None]:
## Specify the file path where you want to save the DataFrame
file_path='C:/Users/rjsek/OneDrive/Desktop/Search Engine Project'

In [None]:
#saving the dataframe using joblib
joblib.dump(df_30_percent_data,file_path)

In [None]:
import joblib

# Update the file path to a location with write permissions
file_path = 'C:/Users/rjsek/Desktop/search_engine_data.pkl'  # Replace with your desired path

# Save the DataFrame
joblib.dump(df_30_percent_data, file_path)

## Preparing indexes

In [42]:
def indexer(item):
    index = []
    temp_df_30_percent_data = df_30_percent_data[df_30_percent_data['id'] == item]
    if not temp_df_30_percent_data.empty:
        temp_index = temp_df_30_percent_data.index[0]
        print("Temp index:", temp_index)
        bert_chunks_length = len(df_30_percent_data["bert_embedding_chunks"])
        print("Shape of bert_embedding_chunks DataFrame:", df_30_percent_data["bert_embedding_chunks"].shape)
        if temp_index < bert_chunks_length:
            for j in range(len(df_30_percent_data["bert_embedding_chunks"].iloc[temp_index])):
                index.append(str(item) + "-" + str(j))
            return index
        else:
            print("Index out of bounds for item:", item)
    else:
        print("Item not found in DataFrame:", item)
    return []

In [43]:
df_30_percent_data['num_list'] = df_30_percent_data['id'].apply(indexer)

Temp index: 0
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15
Shape

Temp index: 974
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 975
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 976
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 977
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 978
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 979
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 980
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 981
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 982
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 983
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 984
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 985
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 986
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 987
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 988
Shape of bert_embedding_chunks DataFrame: (247

Temp index: 1939
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1940
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1941
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1942
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1943
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1944
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1945
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1946
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1947
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1948
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1949
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1950
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1951
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1952
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 1953
Shape of bert_embedding_chunks 

Temp index: 2918
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2919
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2920
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2921
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2922
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2923
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2924
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2925
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2926
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2927
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2928
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2929
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2930
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2931
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 2932
Shape of bert_embedding_chunks 

Temp index: 3895
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3896
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3897
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3898
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3899
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3900
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3901
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3902
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3903
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3904
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3905
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3906
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3907
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3908
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 3909
Shape of bert_embedding_chunks 

Temp index: 4863
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4864
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4865
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4866
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4867
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4868
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4869
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4870
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4871
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4872
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4873
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4874
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4875
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4876
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 4877
Shape of bert_embedding_chunks 

Temp index: 5851
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5852
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5853
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5854
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5855
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5856
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5857
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5858
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5859
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5860
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5861
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5862
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5863
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5864
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 5865
Shape of bert_embedding_chunks 

Temp index: 6804
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6805
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6806
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6807
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6808
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6809
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6810
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6811
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6812
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6813
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6814
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6815
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6816
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6817
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 6818
Shape of bert_embedding_chunks 

Temp index: 7769
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7770
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7771
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7772
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7773
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7774
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7775
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7776
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7777
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7778
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7779
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7780
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7781
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7782
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 7783
Shape of bert_embedding_chunks 

Temp index: 8739
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8740
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8741
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8742
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8743
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8744
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8745
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8746
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8747
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8748
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8749
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8750
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8751
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8752
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 8753
Shape of bert_embedding_chunks 

Temp index: 9700
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9701
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9702
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9703
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9704
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9705
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9706
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9707
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9708
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9709
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9710
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9711
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9712
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9713
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 9714
Shape of bert_embedding_chunks 

Temp index: 10680
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10681
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10682
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10683
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10684
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10685
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10686
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10687
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10688
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10689
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10690
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10691
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10692
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10693
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 10694
Shape of bert_em

Temp index: 11644
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11645
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11646
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11647
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11648
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11649
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11650
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11651
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11652
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11653
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11654
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11655
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11656
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11657
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 11658
Shape of bert_em

Temp index: 12542
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12543
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12544
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12545
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12546
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12547
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12548
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12549
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12550
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12551
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12552
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12553
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12554
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12555
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 12556
Shape of bert_em

Temp index: 13507
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13508
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13509
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13510
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13511
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13512
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13513
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13514
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13515
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13516
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13517
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13518
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13519
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13520
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 13521
Shape of bert_em

Temp index: 14499
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14500
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14501
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14502
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14503
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14504
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14505
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14506
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14507
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14508
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14509
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14510
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14511
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14512
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 14513
Shape of bert_em

Temp index: 15461
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15462
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15463
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15464
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15465
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15466
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15467
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15468
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15469
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15470
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15471
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15472
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15473
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15474
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 15475
Shape of bert_em

Temp index: 16404
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16405
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16406
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16407
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16408
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16409
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16410
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16411
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16412
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16413
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16414
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16415
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16416
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16417
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 16418
Shape of bert_em

Temp index: 17344
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17345
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17346
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17347
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17348
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17349
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17350
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17351
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17352
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17353
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17354
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17355
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17356
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17357
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 17358
Shape of bert_em

Temp index: 18318
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18319
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18320
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18321
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18322
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18323
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18324
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18325
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18326
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18327
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18328
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18329
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18330
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18331
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 18332
Shape of bert_em

Temp index: 19288
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19289
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19290
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19291
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19292
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19293
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19294
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19295
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19296
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19297
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19298
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19299
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19300
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19301
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 19302
Shape of bert_em

Temp index: 20254
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20255
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20256
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20257
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20258
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20259
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20260
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20261
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20262
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20263
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20264
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20265
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20266
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20267
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 20268
Shape of bert_em

Temp index: 21189
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21190
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21191
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21192
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21193
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21194
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21195
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21196
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21197
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21198
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21199
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21200
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21201
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21202
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 21203
Shape of bert_em

Temp index: 22170
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22171
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22172
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22173
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22174
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22175
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22176
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22177
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22178
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22179
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22180
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22181
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22182
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22183
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 22184
Shape of bert_em

Temp index: 23147
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23148
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23149
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23150
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23151
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23152
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23153
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23154
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23155
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23156
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23157
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23158
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23159
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23160
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 23161
Shape of bert_em

Temp index: 24127
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24128
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24129
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24130
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24131
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24132
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24133
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24134
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24135
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24136
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24137
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24138
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24139
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24140
Shape of bert_embedding_chunks DataFrame: (24749,)
Temp index: 24141
Shape of bert_em

In [44]:
df_30_percent_data["bert_embedding_chunks"].shape

(24749,)

In [45]:
df_30_percent_data

Unnamed: 0,id,Movies/Series,subtitles,subtitles_sentence,bert_embedding,bert_embedding_chunks,num_list
0,9464726,csi.miami.s09.e01.fallen.(2010).eng.1cd,"[script, info, title, default, file, scripttyp...",script info title default file scripttype v wr...,"[0.019202188, -0.10135785, 0.05124608, 0.07146...",[0.019202188 -0.10135785 0.05124608 0.07146899...,"[9464726-0, 9464726-1, 9464726-2, 9464726-3, 9..."
1,9478254,red.eye.(2005).eng.1cd,"[jet, engine, revving, anman, lets, go, anwoma...",jet engine revving anman lets go anwoman taylo...,"[0.06227134, -0.15009682, 0.23519003, -0.04188...",[0.06227134 -0.15009682 0.23519003 -0.04188167...,"[9478254-0, 9478254-1, 9478254-2, 9478254-3, 9..."
2,9367553,10.(2022).eng.1cd,"[advertise, your, product, or, brand, here, co...",advertise your product or brand here contact w...,"[-0.08674303, 0.113942094, 0.07312267, -0.0547...",[-0.08674303 0.113942094 0.07312267 -0.0547033...,"[9367553-0, 9367553-1, 9367553-2, 9367553-3, 9..."
3,9203091,reign.s02.e15.forbidden.(2015).eng.1cd,"[francis, ipreviously, oni, reign, her, majest...",francis ipreviously oni reign her majesty mari...,"[0.057274677, 0.18626337, -0.005120895, -0.154...",[0.057274677 0.18626337 -0.005120895 -0.154800...,"[9203091-0, 9203091-1, 9203091-2, 9203091-3, 9..."
4,9265916,barbie.it.takes.two.s02.e01.weve.got.magic.to....,"[advertise, your, product, or, brand, here, co...",advertise your product or brand here contact w...,"[-0.04211163, -0.040654175, -0.010191287, -0.1...",[-0.04211163 -0.040654175 -0.010191287 -0.1076...,"[9265916-0, 9265916-1, 9265916-2, 9265916-3, 9..."
...,...,...,...,...,...,...,...
24744,9265204,the.real.housewives.of.beverly.hills.s12.e19.w...,"[ipreviously, oni, font, colorfffthe, real, ho...",ipreviously oni font colorfffthe real housewiv...,"[0.14130643, -0.16698056, 0.33102593, 0.074704...",[0.14130643 -0.16698056 0.33102593 0.0747046 -...,"[9265204-0, 9265204-1, 9265204-2, 9265204-3, 9..."
24745,9492149,murdoch.mysteries.s07.e06.murdochophobia.(2013...,"[i, think, thats, enough, for, today, good, no...",i think thats enough for today good now please...,"[0.111511536, -0.026505634, 0.20732762, -0.159...",[0.111511536 -0.026505634 0.20732762 -0.159436...,"[9492149-0, 9492149-1, 9492149-2, 9492149-3, 9..."
24746,9301734,dangerous.liaisons.s01.e01.love.or.war.(2022)....,"[dramatic, orchestral, music, plays, indistinc...",dramatic orchestral music plays indistinct cha...,"[-0.11328379, -0.06661039, 0.11664734, 0.11008...",[-0.11328379 -0.06661039 0.11664734 0.11008331...,"[9301734-0, 9301734-1, 9301734-2, 9301734-3, 9..."
24747,9215721,preacher.s02.e13.the.end.of.the.road.(2017).en...,"[use, the, free, code, joinnow, at, wwwplayshi...",use the free code joinnow at wwwplayshipseu si...,"[0.033705004, 0.0017321683, 0.049905933, -0.02...",[0.033705004 0.0017321683 0.049905933 -0.02711...,"[9215721-0, 9215721-1, 9215721-2, 9215721-3, 9..."


In [46]:
df_30_percent_data['num_list']

0        [9464726-0, 9464726-1, 9464726-2, 9464726-3, 9...
1        [9478254-0, 9478254-1, 9478254-2, 9478254-3, 9...
2        [9367553-0, 9367553-1, 9367553-2, 9367553-3, 9...
3        [9203091-0, 9203091-1, 9203091-2, 9203091-3, 9...
4        [9265916-0, 9265916-1, 9265916-2, 9265916-3, 9...
                               ...                        
24744    [9265204-0, 9265204-1, 9265204-2, 9265204-3, 9...
24745    [9492149-0, 9492149-1, 9492149-2, 9492149-3, 9...
24746    [9301734-0, 9301734-1, 9301734-2, 9301734-3, 9...
24747    [9215721-0, 9215721-1, 9215721-2, 9215721-3, 9...
24748    [9274955-0, 9274955-1, 9274955-2, 9274955-3, 9...
Name: num_list, Length: 24749, dtype: object

## Stroing embeddings in a ChromaDB database. 

In [None]:
!pip install chromadb

In [None]:
import chromadb

In [None]:
client = chromadb.PersistentClient(path='C:/Users/rjsek/OneDrive/Desktop/Search Engine Project')

In [None]:
collection = client.get_or_create_collection(
        name="subtitles",
        metadata={"hnsw:space": "cosine"} 
    )

In [None]:
collection_2 = client.get_or_create_collection(
        name="titles",
        metadata={"hnsw:space": "cosine"} 
    )

In [None]:
def add_func_v1():
    for i in range(df_30_percent_data.shape[0]):  # setting the range as total no. of rows in dataframe
        collection_2.add(
            documents=[df_30_percent_data['Movies/Series'].iloc[i]],  # adding each filename
            embeddings=[[1, 2, 34, 45]],  # adding a random data, as we don't need it when retrieving file_name
            ids=[df_30_percent_data['num_list'].iloc[i]]  # entering unique 'num' id
        )

In [None]:
def add_func_v2():
    for i in range(df_30_percent_data.shape[0]):  # setting the range as total no. of rows in dataframe
        collection.add(
            documents=df_30_percent_data['bert_embedding_chunks'].iloc[i],  # adding each chunk
            embeddings=df_30_percent_data['bert_embedding'].iloc[i],  # adding the corresponding chunk embedding
            ids=df_30_percent_data['num_list'].iloc[i]  # entering the unique 'num' id
        )


In [None]:
# Building Search Engine App

In [None]:
# VS Code

In [None]:
# Libraries

In [None]:
import re
import chromadb
from sentence_transformers import SentenceTransformer
import streamlit as st

In [None]:
# Initializing chromaDB

In [None]:
client = chromadb.PersistentClient(path="C:/Users/rjsek/OneDrive/Desktop/chroma.sqlite3")
collection = client.get_collection(name="subtitles") #test_collection 
collection_name = client.get_collection(name="titles")
model_name="paraphrase-MiniLM-L3-V2"
model = SentenceTransformer(model_name, device="cpu")

In [None]:
# Cleaning steps for the user query

In [None]:
def clean_data(data): # data is the query text
    
    #removing timestamps 
    data = re.sub("\d{2}:\d{2}:\d{2},\d{3}\s-->\s\d(2):\d{2}:\d{2},\d{3}","", data)
    #resoving index no. of dialogues 
    data = re.sub(r"\n?\d+\r", "", data)
    #removing escape sequences like \n \r 
    data = re.sub("\r|\n", "", data)
    #removing <i> and </i> 
    data = re.sub("<i>|</i>", data)
    #resoving links
    data = re.sub("(?:www.)osdb\.link\/[\w\d]+|www.OpenSubtitles\.org|osdb\.link\/ext|api\.OpenSubtitles\.org|OpenSubtitles\.com","",data)
    #Converting to lower case
    data = data.lower()
                
    #return 
    return data

In [None]:
# Creating a function to extract the subtitle_id

In [None]:
def extract_id(id_list):
    new_id_list=[]
    for item in id_list:
        match = re.match(r'^(\d+)', item)
        if match:
            extracted_number = match.group(1)
            new_id_list.append(extracted_number)
    return new_id_list

In [None]:
# Final VS Code [App]

In [None]:
import streamlit as st
import re
import pandas as pd
import chromadb
from sentence_transformers import SentenceTransformer

# Load the pre-trained SentenceTransformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Function to clean user query
def clean_data(data):
    # Remove punctuation and convert to lowercase
    data = re.sub(r'[^\w\s]', '', data)
    data = data.lower()
    return data

# Function to extract subtitle IDs from the 'num_list' column
def extract_id(id_list):
    return [item.split('-')[0] for item in id_list]

# Streamlit UI
st.title("🎥Movies/Series Subtitle Search Engine🔎")
search_query = st.text_input("Search here🔬")

if st.button("Search"):
    st.subheader("Subtitle Files ⬇️")
    search_query = clean_data(search_query)
    query_embed = model.encode(search_query).tolist()
    
    # Simulated query to the provided DataFrame
    df_30_percent_data = pd.read_csv("C:/Users/rjsek/Downloads/df_30_percent_data.csv")  # Replace "your_data.csv" with the actual file path
    
    # Replace this section with your actual database querying code
    # For demonstration purposes, we'll just simulate a query using the provided DataFrame
    id_list = df_30_percent_data['Movies/Series'].tolist()

    # Create a dictionary to map IDs to movie/series names
    id_to_name = df_30_percent_data.set_index('id')['Movies/Series'].to_dict()
    
    # Display the results
    for id in id_list:
        file_name = f"{id}"
        st.markdown(f"[{file_name}](https://www.opensubtitles.org/en/subtitles/{id})")