###           Project Statement - Enhancing Search Engine Relevance for Video Subtitles

### AIM Of The Project
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in connecting users with relevant information. For Google, providing a seamless and accurate search experience is paramount. This project focuses on improving the search relevance for video subtitles, enhancing the accessibility of video content.


### Objective:
Developing an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content.
The primary goal is
to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.


## **Step 1 - Reading the Tables from Database file**

In [2]:
import sqlite3
import pandas as pd

In [3]:
# Read the code below and write your observation in the next cell

conn = sqlite3.connect(r'C:/Users/deshp/Downloads/data/eng_subtitles_database.db')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
#"C:\Users\deshp\Downloads\data\eng_subtitles_database.db"

[('zipfiles',)]


**In the above cell, I am able to read the table inside the database. As mentioned earlier, table name is `zipfiles`. We also know from README.txt that this table contains three columns: 'num', 'name' and 'content'.**

## **Step 2 - Reading the columns of Table**

In [4]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

num
name
content


**The above code helps in checking the column names in the database table.**

**Let's now use `SELECT * FROM zipfiles` to read all the data into a `df` variable.**

## **Step 3 - Loading the Database Table inside a Pandas DataFrame**

In [5]:
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [6]:
len(df)

82498

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


**Looks like the `content` column donot contain the subtitles text. Instead as mentioned in README.txt, it might be latin-1 encoded.**

## **Step 4 - Printing `content` of 0th Row**

In [8]:
b_data = df.iloc[0, 2]

# here 2 represent the index of content column
# 0 represents the row number

In [9]:
print(b_data)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x99V\x9fx\x96\xf0\x8c\x9e\x00\x00\x86\x9b\x01\x00;\x00\x00\x00The.Message.1976.REMASTERED.1080p.BluRay.x264-PiGNUS.EN.srt\xad\xbdm\x93\xdc\xc6\x91.\xfa\x9d\x11\xfc\x0f-}\xe1=\x11-\x9d\x06P\x85\x17\x9d\x8d\xd5%%[\xa4-Y>&u\x15>\xdf\xd0\xd3\x98\x19x\xfae\x0cts<\xfe\xf57\x9f\'\xb3\n\xd9\xa4\xbc\xbb\xf7\xc6Fl\xacELW\xa2\xaa\x90\x95\x95\xafO\x16/_l6\xdf\xe0\xff\xea\xf5f\xb3Y}\xf5\xd5\xbf\xaf\xf4AQ\xae7Mx\xf9\xe2\xd7\xfe|s\xbf\xea\x8f\xcf\xab\x8f\xe3n8\xadN\xc7\xfdx\x1cVO\xe3\xf9~\xf5\xf3\xe3p\xfc\xea\xfd/o>\xbc\xfb\xf0\xe3\xef\xde\xbf|\xf1\xfbi\x18Vo\xa6\xd3\xd3<L\xab\xe1\x1f\xe7\xe18\x8f\xa7\xe37\xab\xd3\xbc\xdb~-\xc3\x1e\xfe\xa7<|\xf9\xe2\xe5\x8bR_[~S\xd6\xeb\xa2k\xf3k\xe5A\xb7\xeeb\xf5\xf2\xc5\xbb\xe3\xea|?\xac\x8e\xfdaX\x9dnW?\x9cvk>8\x9c\xe6\xf3\xean\xeao\xc6\xd3ev\x8f~\x1a\xa6\x9b\xf1\xf6\xb2\xff\x1a\xe4\xabD\xbe*d\x11\xa5#_U\xeb\xaa\xd9`\xa6\xa7\xc3\xea\xa7\xcb}\x7f8\xf4F\xf9\xa7a\x9e\x87\xe3\x9d\xcc\\\xdf\x07B!\x13\xaa\xd61n<!\xd9\xaf\xd0\

**From the content, it appears to start with the bytes "PK\x03\......", which suggests that it might be a ZIP archive file. How do I know it? Experience! I have worked with something similar earlier.**

## **Step 5 - Unzipping the content of 380th row and decoding using `latin-1`**

In [10]:
import zipfile
import io

# Assuming 'content' is the binary data from your database
binary_data = df.iloc[380, 2]

# Decompress the binary data using the zipfile module
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        # Reading only one file in the ZIP archive
        subtitle_content = zip_file.read(zip_file.namelist()[0])

# Now 'subtitle_content' should contain the extracted subtitle content
print(subtitle_content.decode('latin-1'))  # Assuming the content is latin-1 encoded text

ï»¿1
00:00:10,543 --> 00:00:13,083
Previously on
"The Great Food Truck Race"...

2
00:00:13,079 --> 00:00:15,779
Eight teams of aspiring
food truck owners

3
00:00:15,782 --> 00:00:17,982
launched a frantic
crossâcountry journey.

4
00:00:17,984 --> 00:00:19,694
The prize?

5
00:00:19,686 --> 00:00:22,186
A stateâofâtheâart
food truck and $50,000.

6
00:00:22,188 --> 00:00:24,388
Yeah, buddy!
For five
of those teams,

7
00:00:24,391 --> 00:00:25,891
what started as
a dream come true

8
00:00:25,892 --> 00:00:27,662
became a wake up call.

9
00:00:27,660 --> 00:00:28,900
I got beat by
a quesadilla.

10
00:00:28,895 --> 00:00:31,125
Now,
only three teams remain.

11
00:00:31,197 --> 00:00:33,567
Lone Star Chuck Wagon
started the race

12
00:00:33,566 --> 00:00:35,436
by taking first place
in Venice Beach.

13
00:00:35,435 --> 00:00:38,905
But in Oklahoma City,
they almost went home.

14
00:00:38,905 --> 00:00:40,665

## **Step 6 - Applying the above Function on the Entire Data**

In [11]:
count = 0

def decode_method(binary_data):
    global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])
    
    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')  # Assuming the content is UTF-8 encoded text

In [12]:
df['file_content'] = df['content'].apply(decode_method)

df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [13]:
len(df)

82498

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           82498 non-null  int64 
 1   name          82498 non-null  object
 2   content       82498 non-null  object
 3   file_content  82498 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.5+ MB


In [15]:
df.tail()

Unnamed: 0,num,name,content,file_content
82493,9521935,the.prophets.game.(2000).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xb8\xa6\x...,"ï»¿1\r\n00:01:16,284 --> 00:01:19,537\r\nGod,\..."
82494,9521937,west.beirut.(1998).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x13\x97\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\napi.Open..."
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00$\x97\x9aV...,"1\r\n00:00:01,001 --> 00:00:04,630\r\n(Dramati..."
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x97\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
82497,9521941,zombie.island.massacre.(1984).eng.1cd,"b'PK\x03\x04\x14\x00\x00\x00\x08\x00,\x97\x9aV...","1\r\n00:00:01,919 --> 00:00:03,253\r\n(Sharp w..."


### DROPPING THE "CONTENT" COLUMN

In [17]:
df =  df.drop(columns='content')

In [18]:
df.tail()

Unnamed: 0,num,name,file_content
82493,9521935,the.prophets.game.(2000).eng.1cd,"ï»¿1\r\n00:01:16,284 --> 00:01:19,537\r\nGod,\..."
82494,9521937,west.beirut.(1998).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\napi.Open..."
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,"1\r\n00:00:01,001 --> 00:00:04,630\r\n(Dramati..."
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
82497,9521941,zombie.island.massacre.(1984).eng.1cd,"1\r\n00:00:01,919 --> 00:00:03,253\r\n(Sharp w..."


### RENAMING THE COLUMN NAMES

In [19]:
df.rename(columns={"num":"id",
             "name":"Movies/Series",
             "file_content":"subtitles"},inplace=True)

In [20]:
df.head()

Unnamed: 0,id,Movies/Series,subtitles
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


### RANDOMLY TAKING 10% DATA

In [21]:
total_rows=len(df)
percent_data=int(total_rows* 0.10)
df_10_percent_data=df.iloc[:percent_data]

In [22]:
len(df_10_percent_data)

8249

#  Preprocessing of data

In [23]:
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords



**`removing special characters`**

In [24]:
df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'[^a-zA-Z\s]', '')

  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'[^a-zA-Z\s]', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'[^a-zA-Z\s]', '')


**`removing non ASCII characters`**

In [25]:
df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'[^\x00-\x7F]+', '')

  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'[^\x00-\x7F]+', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'[^\x00-\x7F]+', '')


**`removing time-stamp`**

In [26]:
df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'\d{2}:\d{2}:\d{2}\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '')

  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'\d{2}:\d{2}:\d{2}\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'\d{2}:\d{2}:\d{2}\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '')


**`converting to lowercase`**

In [27]:
df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.lower()


**`removing extra spaces`**

In [28]:
df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'\s+', ' ' )

  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'\s+', ' ' )
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].str.replace(r'\s+', ' ' )


In [29]:
df_10_percent_data['subtitles']

0        watch any video online with opensubtitles fre...
1        ah theres princess dawn and terry with the bl...
2        iyumis cells i iepisode extremely polite yumi...
3        watch any video online with opensubtitles fre...
4        watch any video online with opensubtitles fre...
                              ...                        
8244     apiopensubtitlesorg is deprecated please impl...
8245     advertise your product or brand here contact ...
8246     watch any video online with opensubtitles fre...
8247    script info title english us original script k...
8248     support us and become vip member to remove al...
Name: subtitles, Length: 8249, dtype: object

In [30]:
!pip install swifter





In [32]:
import swifter

**`Tokenization step`**

In [33]:
df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].swifter.apply(lambda x: word_tokenize(x))

Pandas Apply:   0%|          | 0/8249 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['subtitles'] = df_10_percent_data['subtitles'].swifter.apply(lambda x: word_tokenize(x))


In [34]:
df_10_percent_data['subtitles']


0       [watch, any, video, online, with, opensubtitle...
1       [ah, theres, princess, dawn, and, terry, with,...
2       [iyumis, cells, i, iepisode, extremely, polite...
3       [watch, any, video, online, with, opensubtitle...
4       [watch, any, video, online, with, opensubtitle...
                              ...                        
8244    [apiopensubtitlesorg, is, deprecated, please, ...
8245    [advertise, your, product, or, brand, here, co...
8246    [watch, any, video, online, with, opensubtitle...
8247    [script, info, title, english, us, original, s...
8248    [support, us, and, become, vip, member, to, re...
Name: subtitles, Length: 8249, dtype: object

In [35]:
df_10_percent_data

Unnamed: 0,id,Movies/Series,subtitles
0,9180533,the.message.(1976).eng.1cd,"[watch, any, video, online, with, opensubtitle..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"[ah, theres, princess, dawn, and, terry, with,..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"[iyumis, cells, i, iepisode, extremely, polite..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"[watch, any, video, online, with, opensubtitle..."
4,9180600,broker.(2022).eng.1cd,"[watch, any, video, online, with, opensubtitle..."
...,...,...,...
8244,9215813,gossamer.folds.(2020).eng.1cd,"[apiopensubtitlesorg, is, deprecated, please, ..."
8245,9215815,gossamer.folds.(2020).eng.1cd,"[advertise, your, product, or, brand, here, co..."
8246,9215856,the.app.from.heaven.(2017).eng.1cd,"[watch, any, video, online, with, opensubtitle..."
8247,9215897,orient.s02.e07.a.matter.of.trust.(2022).eng.1cd,"[script, info, title, english, us, original, s..."


In [36]:
df_10_percent_data.head()

Unnamed: 0,id,Movies/Series,subtitles
0,9180533,the.message.(1976).eng.1cd,"[watch, any, video, online, with, opensubtitle..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"[ah, theres, princess, dawn, and, terry, with,..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"[iyumis, cells, i, iepisode, extremely, polite..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"[watch, any, video, online, with, opensubtitle..."
4,9180600,broker.(2022).eng.1cd,"[watch, any, video, online, with, opensubtitle..."


In [37]:
len(df_10_percent_data)

8249

## TFIDF-approach

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(df_10_percent_data['subtitles'])

# Sparse matrix representation of TF-IDF
print(tfidf_matrix)


  (0, 226502)	0.001525948224892838
  (0, 245915)	0.0015495171655121648
  (0, 256404)	0.008243210796051071
  (0, 226062)	0.0015317260714399195
  (0, 197080)	0.001470572380728476
  (0, 128507)	0.008243210796051071
  (0, 229341)	0.006523011873745068
  (0, 42799)	0.0026871656045789373
  (0, 107898)	0.004918282155982931
  (0, 190873)	0.006588523867654451
  (0, 244444)	0.0025353788036615757
  (0, 103895)	0.005390885775324752
  (0, 253216)	0.005397700828561956
  (0, 187202)	0.005738629305752286
  (0, 157289)	0.006351376525133158
  (0, 121602)	0.004456488685117844
  (0, 167449)	0.008243210796051071
  (0, 72669)	0.004386199435110583
  (0, 111000)	0.0024583444650589027
  (0, 190889)	0.005738629305752286
  (0, 255791)	0.012329155455972197
  (0, 166129)	0.0078847766502921
  (0, 63949)	0.004167797366322736
  (0, 110377)	0.002979352468366305
  (0, 77027)	0.005179474699146711
  :	:
  (8248, 252788)	0.010523406498066037
  (8248, 234524)	0.04100554006128041
  (8248, 138785)	0.010380190702557966
  (8248


### BERT based “SentenceTransformers” to generate embeddings which encode semantic information. 
### This can help us build a Semantic Search Engine.

In [42]:
!pip install sentence-transformers



ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\deshp\\anaconda3\\Lib\\site-packages\\~2mpy.libs\\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll'
Consider using the `--user` option or check the permissions.




Collecting numpy
  Using cached numpy-1.22.4-cp39-cp39-win_amd64.whl (14.7 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4


In [43]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained BERT model
name_of_model='multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(name_of_model)
text=df_10_percent_data['subtitles']
# Function to encode text using BERT
def encode_text(text):
    # Encode the text into an embedding
    embedding = model.encode(text)
    return embedding



In [44]:
# Apply BERT vectorization to the 'file_content' column



df_10_percent_data['bert_embedding'] = df_10_percent_data['subtitles'].apply(encode_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['bert_embedding'] = df_10_percent_data['subtitles'].apply(encode_text)


In [45]:
df_10_percent_data['bert_embedding']

0       [0.0429545, 0.047213517, -0.048022557, -0.0589...
1       [-0.017032498, 0.0084469365, 0.07423645, -0.01...
2       [-0.04869099, -0.04203135, 0.06548537, 0.03954...
3       [-0.043026306, -0.03555924, 0.059036024, -0.04...
4       [-0.024402311, 0.013387982, 0.123676725, -0.02...
                              ...                        
8244    [0.077503555, -0.041504238, 0.03078112, -0.035...
8245    [0.033693567, -0.08421364, 0.07477469, -0.0781...
8246    [-0.06501349, 0.013136341, 0.05656268, -0.0415...
8247    [0.005435122, -0.025132079, 0.015010854, -0.03...
8248    [-0.020894008, 0.02490764, 0.028566448, -0.082...
Name: bert_embedding, Length: 8249, dtype: object

In [46]:


df_10_percent_data

Unnamed: 0,id,Movies/Series,subtitles,bert_embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with opensubtitles free...,"[0.0429545, 0.047213517, -0.048022557, -0.0589..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah theres princess dawn and terry with the blo...,"[-0.017032498, 0.0084469365, 0.07423645, -0.01..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,iyumis cells i iepisode extremely polite yumii...,"[-0.04869099, -0.04203135, 0.06548537, 0.03954..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with opensubtitles free...,"[-0.043026306, -0.03555924, 0.059036024, -0.04..."
4,9180600,broker.(2022).eng.1cd,watch any video online with opensubtitles free...,"[-0.024402311, 0.013387982, 0.123676725, -0.02..."
...,...,...,...,...
8244,9215813,gossamer.folds.(2020).eng.1cd,apiopensubtitlesorg is deprecated please imple...,"[0.077503555, -0.041504238, 0.03078112, -0.035..."
8245,9215815,gossamer.folds.(2020).eng.1cd,advertise your product or brand here contact w...,"[0.033693567, -0.08421364, 0.07477469, -0.0781..."
8246,9215856,the.app.from.heaven.(2017).eng.1cd,watch any video online with opensubtitles free...,"[-0.06501349, 0.013136341, 0.05656268, -0.0415..."
8247,9215897,orient.s02.e07.a.matter.of.trust.(2022).eng.1cd,script info title english us original script k...,"[0.005435122, -0.025132079, 0.015010854, -0.03..."


In [47]:
df_10_percent_data['bert_embedding'].to_numpy()
subtitle_embeddings = df_10_percent_data['bert_embedding'].to_numpy()
np.save('subtitle_embeddings.npy', subtitle_embeddings)

In [48]:

#import json

# Function to convert numpy array to string
#def array_to_string(arr):
   # return json.dumps(arr.tolist())

# Apply the function to convert numpy arrays to strings
#df_10_percent_data['bert_embedding_str'] = df_10_percent_data['bert_embedding'].apply(array_to_string)



### DOCUMNET CHUNKER

In [49]:
# Creating  a function that takes a large document as input and divides it into smaller chunks 
# with an overlap to maintain context continuity.

In [50]:
def semantic_chunking(embeddings, chunk_size=1000, overlap=50):
    chunks = []
    if isinstance(embeddings, np.ndarray) and len(embeddings.shape) == 1:
        # Single embedding (numpy array or float)
        embeddings = [embeddings]  # Convert to a list for consistency
    for embedding in embeddings:
        if isinstance(embedding, np.ndarray):
            embedding = ' '.join(map(str, embedding))  # Convert array to string
        else:
            embedding = str(embedding)  # Convert float to string
        for start in range(0, len(embedding), chunk_size - overlap):
            chunk = embedding[start:start + chunk_size]
            chunks.append(chunk)
    return chunks


In [51]:
df_10_percent_data['bert_embedding_chunks'] = df_10_percent_data['bert_embedding'].apply(semantic_chunking)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_10_percent_data['bert_embedding_chunks'] = df_10_percent_data['bert_embedding'].apply(semantic_chunking)


In [54]:
df_10_percent_data

Unnamed: 0,id,Movies/Series,subtitles,bert_embedding,bert_embedding_chunks
0,9180533,the.message.(1976).eng.1cd,watch any video online with opensubtitles free...,"[0.0429545, 0.047213517, -0.048022557, -0.0589...",[0.0429545 0.047213517 -0.048022557 -0.0589091...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah theres princess dawn and terry with the blo...,"[-0.017032498, 0.0084469365, 0.07423645, -0.01...",[-0.017032498 0.0084469365 0.07423645 -0.01032...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,iyumis cells i iepisode extremely polite yumii...,"[-0.04869099, -0.04203135, 0.06548537, 0.03954...",[-0.04869099 -0.04203135 0.06548537 0.03954125...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with opensubtitles free...,"[-0.043026306, -0.03555924, 0.059036024, -0.04...",[-0.043026306 -0.03555924 0.059036024 -0.04153...
4,9180600,broker.(2022).eng.1cd,watch any video online with opensubtitles free...,"[-0.024402311, 0.013387982, 0.123676725, -0.02...",[-0.024402311 0.013387982 0.123676725 -0.02826...
...,...,...,...,...,...
8244,9215813,gossamer.folds.(2020).eng.1cd,apiopensubtitlesorg is deprecated please imple...,"[0.077503555, -0.041504238, 0.03078112, -0.035...",[0.077503555 -0.041504238 0.03078112 -0.035498...
8245,9215815,gossamer.folds.(2020).eng.1cd,advertise your product or brand here contact w...,"[0.033693567, -0.08421364, 0.07477469, -0.0781...",[0.033693567 -0.08421364 0.07477469 -0.0781623...
8246,9215856,the.app.from.heaven.(2017).eng.1cd,watch any video online with opensubtitles free...,"[-0.06501349, 0.013136341, 0.05656268, -0.0415...",[-0.06501349 0.013136341 0.05656268 -0.0415723...
8247,9215897,orient.s02.e07.a.matter.of.trust.(2022).eng.1cd,script info title english us original script k...,"[0.005435122, -0.025132079, 0.015010854, -0.03...",[0.005435122 -0.025132079 0.015010854 -0.03842...


In [55]:
#Reset index for the DataFrame
df_10_percent_data = df_10_percent_data.reset_index(drop=True)
df_10_percent_data

Unnamed: 0,id,Movies/Series,subtitles,bert_embedding,bert_embedding_chunks
0,9180533,the.message.(1976).eng.1cd,watch any video online with opensubtitles free...,"[0.0429545, 0.047213517, -0.048022557, -0.0589...",[0.0429545 0.047213517 -0.048022557 -0.0589091...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah theres princess dawn and terry with the blo...,"[-0.017032498, 0.0084469365, 0.07423645, -0.01...",[-0.017032498 0.0084469365 0.07423645 -0.01032...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,iyumis cells i iepisode extremely polite yumii...,"[-0.04869099, -0.04203135, 0.06548537, 0.03954...",[-0.04869099 -0.04203135 0.06548537 0.03954125...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with opensubtitles free...,"[-0.043026306, -0.03555924, 0.059036024, -0.04...",[-0.043026306 -0.03555924 0.059036024 -0.04153...
4,9180600,broker.(2022).eng.1cd,watch any video online with opensubtitles free...,"[-0.024402311, 0.013387982, 0.123676725, -0.02...",[-0.024402311 0.013387982 0.123676725 -0.02826...
...,...,...,...,...,...
8244,9215813,gossamer.folds.(2020).eng.1cd,apiopensubtitlesorg is deprecated please imple...,"[0.077503555, -0.041504238, 0.03078112, -0.035...",[0.077503555 -0.041504238 0.03078112 -0.035498...
8245,9215815,gossamer.folds.(2020).eng.1cd,advertise your product or brand here contact w...,"[0.033693567, -0.08421364, 0.07477469, -0.0781...",[0.033693567 -0.08421364 0.07477469 -0.0781623...
8246,9215856,the.app.from.heaven.(2017).eng.1cd,watch any video online with opensubtitles free...,"[-0.06501349, 0.013136341, 0.05656268, -0.0415...",[-0.06501349 0.013136341 0.05656268 -0.0415723...
8247,9215897,orient.s02.e07.a.matter.of.trust.(2022).eng.1cd,script info title english us original script k...,"[0.005435122, -0.025132079, 0.015010854, -0.03...",[0.005435122 -0.025132079 0.015010854 -0.03842...


## Preparing indexes

In [56]:
len(df_10_percent_data['bert_embedding_chunks'])

8249

In [57]:
def indexer(item):
    index = []
    temp_df = df_10_percent_data[df_10_percent_data['id'] == item]
    if not temp_df.empty:
        temp_index = temp_df.index[0]
        print("Temp index:", temp_index)
        bert_chunks_length = len(df_10_percent_data["bert_embedding_chunks"])
        print("Shape of bert_embedding_chunks DataFrame:", df_10_percent_data["bert_embedding_chunks"].shape)
        if temp_index < bert_chunks_length:
            for j in range(len(df_10_percent_data["bert_embedding_chunks"].iloc[temp_index])):
                index.append(str(item) + "-" + str(j))
            return index
        else:
            print("Index out of bounds for item:", item)
    else:
        print("Item not found in DataFrame:", item)
    return []

df_10_percent_data['num_list'] = df_10_percent_data['id'].apply(indexer)

Temp index: 0
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 1
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 2
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 3
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 4
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 5
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 6
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 7
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 8
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 9
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 10
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 11
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 12
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 13
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 14
Shape of bert_embedding_chunks DataFrame: (8249,)
Temp index: 15
Shape of bert_embedd

## Stroringng embeddings in a ChromaDB database. 

In [58]:
!pip install chromadb



ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\deshp\\anaconda3\\Lib\\site-packages\\~7mpy\\.libs\\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll'
Consider using the `--user` option or check the permissions.




Collecting numpy>=1.22.5
  Using cached numpy-1.26.4-cp39-cp39-win_amd64.whl (15.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4


In [59]:
pip install protobuf==3.20.0

Collecting protobuf==3.20.0
  Using cached protobuf-3.20.0-cp39-cp39-win_amd64.whl (904 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.3
    Uninstalling protobuf-4.25.3:
      Successfully uninstalled protobuf-4.25.3
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\deshp\\anaconda3\\Lib\\site-packages\\google\\~-pb\\_message.cp39-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



In [60]:
pip install --user protobuf==3.20.0

Note: you may need to restart the kernel to use updated packages.




In [63]:
import chromadb
import os

In [66]:
client = chromadb.PersistentClient(path="C:/Users/deshp/Downloads/data/chromadb_saved_file/chroma.sqlite3")

In [67]:
collection = client.get_or_create_collection(
        name="subtitles",
        metadata={"hnsw:space": "cosine"} # l2 is the default
    )


In [68]:
def add_func():
    
    for i in range(df.shape[0]): #setting the range as total no. of rows in dataframe
        collection.add(
       
        embeddings=df_10_percent_data['bert_embedding'].iloc[i], # adding the corresponding chunk embedding
        documents=df_10_percent_data['bert_embedding_chunks'].iloc[i],  # adding each chunk
        subtit = df_10_percent_data['subtitles'].iloc[i],
        movies= df_10_percentage_data['Movies/Series'],
        ids=df_10_percent_data['num_list'].iloc[i] #entering the unique 'num' id
        )

In [69]:
results=collection.query(query_texts=["This is a query document"],n_results=10)

In [70]:
df.to_csv('df_10_percentage.csv', index=False, escapechar='\\')

# Providing a download link
from IPython.display import FileLink
FileLink('df_10_percentage.csv')