# **Project Hint - Reading the Data from Database**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# from google.colab import drive
# drive.flush_and_unmount()

In [None]:
import sqlite3
import pandas as pd

In [None]:
!ls -l /eng_subtitles_database.db

-rw-r--r-- 1 root root 59768832 Apr 10 05:22 /eng_subtitles_database.db


In [None]:
!wc -c /eng_subtitles_database.db

59768832 /eng_subtitles_database.db


## **Step 1 - Reading the Tables from Database file**

In [None]:
# Read the code below and write your observation in the next cell

conn = sqlite3.connect('/content/eng_subtitles_database.db')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='zipfiles'")
print(cursor.fetchall())

[]


**In the above cell, I am able to read the table inside the database. As mentioned earlier, table name is `zipfiles`. We also know from README.txt that this table contains three columns: 'num', 'name' and 'content'.**

## **Step 2 - Reading the columns of Table**

In [None]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

**The above code helps in checking the column names in the database table.**

**Let's now use `SELECT * FROM zipfiles` to read all the data into a `df` variable.**

## **Step 3 - Loading the Database Table inside a Pandas DataFrame**

In [None]:
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

DatabaseError: Execution failed on sql 'SELECT * FROM zipfiles': no such table: zipfiles

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


**Looks like the `content` column donot contain the subtitles text. Instead as mentioned in README.txt, it might be latin-1 encoded.**

## **Step 4 - Printing `content` of 0th Row**

In [None]:
b_data = df.iloc[0, 2]

# here 2 represent the index of content column
# 0 represents the row number

In [None]:
print(b_data)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x99V\x9fx\x96\xf0\x8c\x9e\x00\x00\x86\x9b\x01\x00;\x00\x00\x00The.Message.1976.REMASTERED.1080p.BluRay.x264-PiGNUS.EN.srt\xad\xbdm\x93\xdc\xc6\x91.\xfa\x9d\x11\xfc\x0f-}\xe1=\x11-\x9d\x06P\x85\x17\x9d\x8d\xd5%%[\xa4-Y>&u\x15>\xdf\xd0\xd3\x98\x19x\xfae\x0cts<\xfe\xf57\x9f\'\xb3\n\xd9\xa4\xbc\xbb\xf7\xc6Fl\xacELW\xa2\xaa\x90\x95\x95\xafO\x16/_l6\xdf\xe0\xff\xea\xf5f\xb3Y}\xf5\xd5\xbf\xaf\xf4AQ\xae7Mx\xf9\xe2\xd7\xfe|s\xbf\xea\x8f\xcf\xab\x8f\xe3n8\xadN\xc7\xfdx\x1cVO\xe3\xf9~\xf5\xf3\xe3p\xfc\xea\xfd/o>\xbc\xfb\xf0\xe3\xef\xde\xbf|\xf1\xfbi\x18Vo\xa6\xd3\xd3<L\xab\xe1\x1f\xe7\xe18\x8f\xa7\xe37\xab\xd3\xbc\xdb~-\xc3\x1e\xfe\xa7<|\xf9\xe2\xe5\x8bR_[~S\xd6\xeb\xa2k\xf3k\xe5A\xb7\xeeb\xf5\xf2\xc5\xbb\xe3\xea|?\xac\x8e\xfdaX\x9dnW?\x9cvk>8\x9c\xe6\xf3\xean\xeao\xc6\xd3ev\x8f~\x1a\xa6\x9b\xf1\xf6\xb2\xff\x1a\xe4\xabD\xbe*d\x11\xa5#_U\xeb\xaa\xd9`\xa6\xa7\xc3\xea\xa7\xcb}\x7f8\xf4F\xf9\xa7a\x9e\x87\xe3\x9d\xcc\\\xdf\x07B!\x13\xaa\xd61n<!\xd9\xaf\xd0\

**From the content, it appears to start with the bytes "PK\x03\......", which suggests that it might be a ZIP archive file. How do I know it? Experience! I have worked with something similar earlier.**

## **Step 5 - Unzipping the content of 385th row and decoding using `latin-1`**

In [None]:
import zipfile
import io

# Assuming 'content' is the binary data from your database
binary_data = df.iloc[385, 2]

# Decompress the binary data using the zipfile module
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        # Reading only one file in the ZIP archive
        subtitle_content = zip_file.read(zip_file.namelist()[0])

# Now 'subtitle_content' should contain the extracted subtitle content
print(subtitle_content.decode('latin-1'))  # Assuming the content is latin-1 encoded text

1
00:00:06,000 --> 00:00:12,074
Watch any video online with Open-SUBTITLES
Free Browser extension: osdb.link/ext

2
00:00:15,370 --> 00:00:16,506
You lose everything, my girl.

3
00:00:16,530 --> 00:00:19,360
So you've said - four times.

4
00:00:20,330 --> 00:00:22,120
I definitely had
it on yesterday.

5
00:00:22,465 --> 00:00:25,785
Your gloves, your keys, that
handkerchief I embroidered for you

6
00:00:25,809 --> 00:00:26,168
Everything!

7
00:00:26,192 --> 00:00:27,280
Five times.

8
00:00:31,610 --> 00:00:32,920
Miss Scarlet?
- Yes.

9
00:00:36,390 --> 00:00:37,390
I'm Miss Scarlet.

10
00:00:37,872 --> 00:00:40,880
May I inquire if
you've lost something?

11
00:00:41,350 --> 00:00:42,530
Some jewellery perhaps?

12
00:00:42,870 --> 00:00:45,130
Yes, my mother's wedding ring.

13
00:00:45,220 --> 00:00:45,840
Have you found it?

14
00:00:45,950 --> 00:00:47,656
Does your ring have
an inscription?

15
00:00:48,650 --> 00:00:51,720
From my father to my mother 'For
my beloved, Livi

**Look's like it worked.**

## **Step 6 - Applying the above Function on the Entire Data**

In [None]:
import zipfile
import io

count = 0

def decode_method(binary_data):
    global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])

    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')  # Assuming the content is UTF-8 encoded text

In [None]:
df['file_content'] = df['content'].apply(decode_method)

df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           82498 non-null  int64 
 1   name          82498 non-null  object
 2   content       82498 non-null  object
 3   file_content  82498 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.5+ MB


In [None]:
df.tail()

Unnamed: 0,num,name,content,file_content
82493,9521935,the.prophets.game.(2000).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xb8\xa6\x...,"ï»¿1\r\n00:01:16,284 --> 00:01:19,537\r\nGod,\..."
82494,9521937,west.beirut.(1998).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x13\x97\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\napi.Open..."
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00$\x97\x9aV...,"1\r\n00:00:01,001 --> 00:00:04,630\r\n(Dramati..."
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x97\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
82497,9521941,zombie.island.massacre.(1984).eng.1cd,"b'PK\x03\x04\x14\x00\x00\x00\x08\x00,\x97\x9aV...","1\r\n00:00:01,919 --> 00:00:03,253\r\n(Sharp w..."


# Clean the “file_content” by removing html tags¶

In [None]:
from tqdm import tqdm, tqdm_notebook
tqdm.pandas()

# special characters, digits, and applying lemmatization.

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocess(text):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", text)

    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()

    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    preprocessed_text = ' '.join(tokens)

    return preprocessed_text

In [None]:
df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           82498 non-null  int64 
 1   name          82498 non-null  object
 2   content       82498 non-null  object
 3   file_content  82498 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.5+ MB


In [None]:
# df['clean_content'] = df['content'].progress_apply(preprocess)
# df['clean_name'] = df['name'].progress_apply(preprocess)
df['clean_file_content'] = df['file_content'].progress_apply(preprocess)

# it is done

100%|█████████████████████████████████████████████████████████████████████████| 82498/82498 [00:05<00:00, 13937.23it/s]
100%|████████████████████████████████████████████████████████████████████████████| 82498/82498 [50:59<00:00, 26.97it/s]


In [None]:
# df['clean_content'] = df['content'].progress_apply(preprocess)

# it is showing some error

In [None]:
df.head()

Unnamed: 0,num,name,content,file_content,clean_name,clean_file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",the message eng cd,watch any video online with open subtitle free...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",here come the grump s e joltin jack in boxia e...,ah there s princess dawn and terry with the bl...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",yumis cell s e episode eng cd,i yumi s cell i i episode extremely polite yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",yumis cell s e episode eng cd,watch any video online with open subtitle free...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",broker eng cd,watch any video online with open subtitle free...


In [None]:
# import pandas as pd
# from bs4 import BeautifulSoup

# def remove_html_tags(html_text):
#     # Parse the HTML text
#     soup = BeautifulSoup(html_text, 'html.parser')

#     # Extract the text content without HTML tags
#     text_content = soup.get_text(separator=' ', strip=True)

#     return text_content

# # df['Text_Content'] = df['File_Content'].progress_apply(remove_html_tags)



In [None]:
# s = str(df.content)

In [None]:
# htmltext = s.read().decode('utf-8')

In [None]:
# df['clean_content'] = df['content'].progress_apply(remove_html_tags)

In [None]:
# df['clean_content'] = df['content'].progress_apply(preprocess)

In [None]:
df.head()

Unnamed: 0,num,name,content,file_content,clean_name,clean_file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",the message eng cd,watch any video online with open subtitle free...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",here come the grump s e joltin jack in boxia e...,ah there s princess dawn and terry with the bl...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",yumis cell s e episode eng cd,i yumi s cell i i episode extremely polite yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",yumis cell s e episode eng cd,watch any video online with open subtitle free...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",broker eng cd,watch any video online with open subtitle free...


In [None]:
df.to_csv('df.csv')

In [46]:
import pandas as pd

In [47]:
df = pd.read_csv("df_30k.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,num,name,file_content
0,0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...
1,1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...
2,2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...
3,3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...
4,4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...


In [48]:
df.columns

Index(['Unnamed: 0', 'num', 'name', 'file_content'], dtype='object')

In [49]:
df.drop(columns='Unnamed: 0', inplace =True)

In [50]:
df.head()

Unnamed: 0,num,name,file_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...


In [None]:
# import pandas as pd

# # Assuming df is your DataFrame

# # Define the size of each chunk
# chunk_size = 4

# # Create an iterator to iterate through chunks
# chunk_iterator = pd.read_csv('/content/drive/MyDrive/df_30k.csv', chunksize=chunk_size)

# # Iterate over the chunks
# for i, chunk in enumerate(chunk_iterator):
#     # Process each chunk as needed
#     # For example, you can apply operations on each chunk or save them to separate files
#     # For demonstration, let's print the first few rows of each chunk
#     print(f"Chunk {i+1}:")
#     display(chunk.head())  # Print the first few rows of each chunk


In [None]:
# chunk.head()

In [None]:
# #non -overlapping chunk
# import pandas as pd

# # Assuming df contains a column named 'srt_content' containing the content of each .srt file
# # You may need to adjust column names accordingly

# # Define a function to chunk the .srt content
# def chunk_srt(srt_content, chunk_size=1000):
#     chunks = []
#     current_chunk = ''
#     for line in srt_content.split('\n'):
#         current_chunk += line + '\n'
#         if line.strip() == '':
#             if current_chunk.strip() != '':
#                 chunks.append(current_chunk)
#                 current_chunk = ''
#     if current_chunk.strip() != '':
#         chunks.append(current_chunk)
#     return chunks

# # Apply the chunking function to each .srt file content in the DataFrame
# df['chunks'] = df['srt_content'].apply(lambda x: chunk_srt(x))

# # Now df['chunks'] contains a list of chunks for each .srt file
# # Each chunk is a string containing a part of the .srt content


In [None]:
# #overlapping chunks
# import pandas as pd

# # Assuming df contains a column named 'file_content' containing the content of each .srt file

# # Define a function to create overlapping chunks for each .srt file content
# def overlapping_chunks(file_content, chunk_size=4, overlap_size=5):
#     chunks = []
#     lines = file_content.split('\n')
#     total_lines = len(lines)
#     for i in range(0, total_lines, chunk_size - overlap_size):
#         chunk = '\n'.join(lines[i:i + chunk_size])
#         chunks.append(chunk)
#     return chunks

# # Apply the overlapping chunks function to each .srt file content in the DataFrame
# df['overlapping_chunks_file_content'] = df['file_content'].apply(lambda x: overlapping_chunks(x))

# # Now df['overlapping_chunks'] contains a list of overlapping chunks for each .srt file
# # Each chunk is a string containing a part of the .srt content with overlap


In [None]:
df.head()

Unnamed: 0,num,name,file_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...


In [None]:
# import pandas as pd

# # Assuming df contains a column named 'srt_content' containing the content of each .srt file
# # You may need to adjust column names accordingly

# # Define a function to create overlapping chunks for each .srt file content
# def overlapping_chunks(file_content, chunk_size=4, overlap_size=50):
#     chunks = []
#     lines = file_content.split('\n')
#     total_lines = len(lines)
#     for i in range(0, total_lines - chunk_size + 1, chunk_size - overlap_size):
#         chunk = '\n'.join(lines[i:i + chunk_size])
#         chunks.append(chunk)
#     return chunks

# # Apply the overlapping chunks function to each .srt file content in the DataFrame
# df['overlapping_chunks_file_content'] = df['file_content'].apply(lambda x: overlapping_chunks(x))

# # Now df['overlapping_chunks'] contains a list of overlapping chunks for each .srt file
# # Each chunk is a string containing a part of the .srt content with overlap
# df.head()

Unnamed: 0,num,name,file_content,overlapping_chunks_file_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...


In [18]:
# df = df.loc[:200]
# df.reset_index(inplace=True)
df.head()

Unnamed: 0,num,name,file_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 10000 to 20000
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   num              10001 non-null  int64 
 1   name             10001 non-null  object
 2   file_content     10001 non-null  object
 3   chunked_content  10001 non-null  object
dtypes: int64(1), object(3)
memory usage: 312.7+ KB


In [51]:
def chunk_document(document, token_window=250, overlap=50):
    """
    Chunk a large document into smaller chunks with specified token window size and overlap.
    
    Parameters:
        document (str): The large document to be chunked.
        token_window (int): The desired token window size for each chunk.
        overlap (int): The amount of overlap (in tokens) between consecutive chunks.
    
    Returns:
        List of smaller document chunks.
    """
    # Split the document into tokens
    tokens = document.split()
    
    # Initialize variables
    chunks = []
    start_idx = 0

    
    # Iterate through the tokens with overlap
    while start_idx < len(tokens):
        # Calculate the end index of the chunk
        end_idx = min(start_idx + token_window, len(tokens))
        
        # Append the chunk to the list
        chunks.append(' '.join(tokens[start_idx:end_idx]))
        
        # Move the start index forward with overlap
        start_idx += token_window - overlap
    
    return chunks

# Apply document chunking to the 'clean_content' column of the sampled DataFrame
df['chunked_content'] = df['file_content'].apply(lambda x: chunk_document(x))
df.head()

Unnamed: 0,num,name,file_content,chunked_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...


In [44]:
df.head()

Unnamed: 0,index,num,name,file_content,chunked_content
0,10000,9223792,recess.(1997).eng.1cd,bell ring child cheer ah api opensubtitles org...,[bell ring child cheer ah api opensubtitles or...
1,10001,9223793,recess.(1997).eng.1cd,bell ring door slammed and child screaming pop...,[bell ring door slammed and child screaming po...
2,10002,9223794,recess.(1997).eng.1cd,bell ring child cheering waa umph support u an...,[bell ring child cheering waa umph support u a...
3,10003,9223795,recess.(1997).eng.1cd,bell ringing kid screaming uh oh advertise you...,[bell ringing kid screaming uh oh advertise yo...
4,10004,9223796,recess.(1997).eng.1cd,bell ring child cheer waah watch any video onl...,[bell ring child cheer waah watch any video on...


In [20]:
df.chunked_content

0        [watch any video online with open subtitle fre...
1        [ah there s princess dawn and terry with the b...
2        [i yumi s cell i i episode extremely polite yu...
3        [watch any video online with open subtitle fre...
4        [watch any video online with open subtitle fre...
                               ...                        
29996    [no audio light music light music light music ...
29997    [api opensubtitles org is deprecated please im...
29998    [watch any video online with open subtitle fre...
29999    [ah thanks tracy taking an early lunch jeff ye...
30000    [suspenseful music playing advertise your prod...
Name: chunked_content, Length: 30001, dtype: object

In [11]:
# ! pip install -U sentence-transformers



In [52]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

In [53]:
# from sentence_transformers import SentenceTransformer, util

# model = SentenceTransformer('all-MiniLM-L6-v2')
def encode_text(text):
  embedding = model.encode(text)
  return embedding

In [61]:
df1 = df.loc[:2000,:]
df1.head()

Unnamed: 0,num,name,file_content,chunked_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...


In [62]:
df1['embedding'] = df1['chunked_content'].apply(encode_text)
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['embedding'] = df1['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [63]:
df2 = df.loc[2001:4000,:]
df2.head()

Unnamed: 0,num,name,file_content,chunked_content
2001,9189883,betty.danger.vice.cop.(2020).eng.1cd,cricket chirping upbeat music api opensubtitle...,[cricket chirping upbeat music api opensubtitl...
2002,9189909,the.teenie.weenie.bikini.squad.(2012).eng.1cd,twangy rock music api opensubtitles org is dep...,[twangy rock music api opensubtitles org is de...
2003,9189910,bikini.jones.and.the.temple.of.eros.(2010).eng...,projector rattling dramatic music explosion wh...,[projector rattling dramatic music explosion w...
2004,9189912,the.seduction.of.rose.parrish.(2021).eng.1cd,bat squeaking wolf howling mysterious music su...,[bat squeaking wolf howling mysterious music s...
2005,9189919,off.the.lip.(2004).eng.1cd,upbeat music graphic whooshing wave crashing s...,[upbeat music graphic whooshing wave crashing ...


In [64]:
df2['embedding'] = df2['chunked_content'].apply(encode_text)
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['embedding'] = df2['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
2001,9189883,betty.danger.vice.cop.(2020).eng.1cd,cricket chirping upbeat music api opensubtitle...,[cricket chirping upbeat music api opensubtitl...,"[[-0.03349987, -0.061510466, 0.029355139, -0.0..."
2002,9189909,the.teenie.weenie.bikini.squad.(2012).eng.1cd,twangy rock music api opensubtitles org is dep...,[twangy rock music api opensubtitles org is de...,"[[-0.039968226, 0.035003703, -0.0064153518, -0..."
2003,9189910,bikini.jones.and.the.temple.of.eros.(2010).eng...,projector rattling dramatic music explosion wh...,[projector rattling dramatic music explosion w...,"[[-0.024186479, -0.022270437, -0.0075046164, -..."
2004,9189912,the.seduction.of.rose.parrish.(2021).eng.1cd,bat squeaking wolf howling mysterious music su...,[bat squeaking wolf howling mysterious music s...,"[[-0.040527362, -0.01084912, 0.06816627, -0.07..."
2005,9189919,off.the.lip.(2004).eng.1cd,upbeat music graphic whooshing wave crashing s...,[upbeat music graphic whooshing wave crashing ...,"[[-0.026077606, -0.058352977, 0.071203, 0.0218..."


In [65]:
df_new = pd.concat([df1,df2],ignore_index=True, axis=0)
df_new.head()

Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [66]:
df3 = df.loc[4001:6000,:]
df3['embedding'] = df3['chunked_content'].apply(encode_text)
df3.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['embedding'] = df3['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
4001,9199218,empire.s03.e07.what.we.may.be.(2016).eng.1cd,i previously on i empire what is that i wouldn...,[i previously on i empire what is that i would...,"[[-0.041658957, -0.09661416, 0.0013693264, -0...."
4002,9199219,empire.s03.e08.the.unkindest.cut.(2016).eng.1cd,i previously on i empire i you got shot i your...,[i previously on i empire i you got shot i you...,"[[-0.057505704, -0.04033067, 0.01015596, -0.05..."
4003,9199220,empire.s03.e09.a.furnace.for.your.foe.(2016).e...,i previously on i empire i want you to write a...,[i previously on i empire i want you to write ...,"[[-0.042650774, -0.03094323, -0.0051559196, -0..."
4004,9199221,empire.s03.e10.sound.fury.(2017).eng.1cd,i previously on i empire we are empire s origi...,[i previously on i empire we are empire s orig...,"[[-0.052542172, -0.08518107, 0.010912725, -0.0..."
4005,9199222,empire.s03.e11.play.on.(2017).eng.1cd,i previously on i empire i inferno i is an alb...,[i previously on i empire i inferno i is an al...,"[[-0.0542408, -0.12611762, -0.048472945, -0.13..."


In [67]:
df_new = pd.concat([df_new,df3],ignore_index=True, axis=0)
df_new.head()

Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [68]:
df4 = df.loc[6001:8000,:]
df4['embedding'] = df4['chunked_content'].apply(encode_text)
df4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['embedding'] = df4['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
6001,9207415,the.jeffersons.s06.e01.the.announcement.(1979)...,i well we re movin on up movin on up i i to th...,[i well we re movin on up movin on up i i to t...,"[[-0.06978582, -0.029881814, 0.016883716, -0.0..."
6002,9207416,the.jeffersons.s06.e02.a.short.story.(1979).en...,i well we re movin on up movin on up i i to th...,[i well we re movin on up movin on up i i to t...,"[[-0.09514737, -0.04934125, -0.0064340895, -0...."
6003,9207417,the.jeffersons.s06.e03.louises.old.boyfriend.(...,i well we re movin on up movin on up i i to th...,[i well we re movin on up movin on up i i to t...,"[[-0.025038512, -0.04996716, -0.016983239, -0...."
6004,9207418,the.jeffersons.s06.e04.now.you.see.it.now.you....,i well we re movin on up movin on up i i to th...,[i well we re movin on up movin on up i i to t...,"[[-0.065551475, -0.0044518393, 0.00813088, -0...."
6005,9207419,the.jeffersons.s06.e05.now.you.see.it.now.you....,i well we re movin on up movin on up i i to th...,[i well we re movin on up movin on up i i to t...,"[[-0.05058959, -0.03531585, -0.0012658216, -0...."


In [69]:
df_new = pd.concat([df_new,df4],ignore_index=True, axis=0)
df_new.head()

Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [70]:
df5 = df.loc[8001:10000,:]
df5['embedding'] = df5['chunked_content'].apply(encode_text)
df5.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df5['embedding'] = df5['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
8001,9214938,united.states.of.tara.s02.e10.open.house.(2010...,i previously on united state of tara i so on t...,[i previously on united state of tara i so on ...,"[[-0.07934143, -0.099133424, 0.017863601, -0.0..."
8002,9214939,united.states.of.tara.s02.e11.to.have.and.to.h...,i previously on united state of tara i let s g...,[i previously on united state of tara i let s ...,"[[-0.08671663, -0.08847381, 0.02160982, -0.005..."
8003,9214940,united.states.of.tara.s02.e12.from.this.day.fo...,i previously on united state of tara i max i r...,[i previously on united state of tara i max i ...,"[[-0.06490324, -0.06776245, 0.085882254, -0.01..."
8004,9214941,children.of.camp.blood.(2020).eng.1cd,suspenseful music advertise your product or br...,[suspenseful music advertise your product or b...,"[[0.011157505, -0.06438216, 0.024798673, -0.04..."
8005,9214943,united.states.of.tara.s03.e01.youwillnotwin.(2...,i previously on united state of tara i are you...,[i previously on united state of tara i are yo...,"[[-0.11709833, -0.12667039, 0.030737886, -0.00..."


In [71]:
df_new = pd.concat([df_new,df5],ignore_index=True, axis=0)
df_new.head()

Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [72]:
df6 = df.loc[10001:12000,:]
df6['embedding'] = df6['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df6],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df6['embedding'] = df6['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [73]:
df7 = df.loc[12001:14000,:]
df7['embedding'] = df7['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df7],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df7['embedding'] = df7['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [74]:
df8 = df.loc[14001:16000,:]
df8['embedding'] = df8['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df8],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df8['embedding'] = df8['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [75]:
df9 = df.loc[16001:18000,:]
df9['embedding'] = df9['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df9],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df9['embedding'] = df9['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [76]:
df10 = df.loc[18001:20000,:]
df10['embedding'] = df10['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df10],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df10['embedding'] = df10['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [77]:
df11 = df.loc[20001:22000,:]
df11['embedding'] = df11['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df11],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df11['embedding'] = df11['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [78]:
df12 = df.loc[22001:24000,:]
df12['embedding'] = df12['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df12],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df12['embedding'] = df12['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [79]:
df13 = df.loc[24001:26000,:]
df13['embedding'] = df13['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df13],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df13['embedding'] = df13['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [80]:
df14 = df.loc[26001:28000,:]
df14['embedding'] = df14['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df14],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df14['embedding'] = df14['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [81]:
df15 = df.loc[28001:,:]
df15['embedding'] = df15['chunked_content'].apply(encode_text)
df_new = pd.concat([df_new,df15],ignore_index=True, axis=0)
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df15['embedding'] = df15['chunked_content'].apply(encode_text)


Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [1]:
# df_new.to_csv('embedding.csv')

In [43]:
import pandas as pd

In [44]:
df_emb = pd.read_csv('embedding.csv')
df_emb.drop(columns = 'Unnamed: 0' , inplace = True)
df_emb.head()

Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,['watch any video online with open subtitle fr...,[[-0.04548948 0.10605931 -0.05575116 ... -0.0...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,['ah there s princess dawn and terry with the ...,[[-0.0816742 -0.00562783 0.04651853 ... -0.0...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,['i yumi s cell i i episode extremely polite y...,[[-1.4068541e-01 -1.5667278e-01 6.6977121e-02...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,['watch any video online with open subtitle fr...,[[-0.09335469 -0.10753913 0.04193733 ... 0.0...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,['watch any video online with open subtitle fr...,[[-0.08041111 -0.02626064 0.05431385 ... 0.1...


In [45]:
df_emb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   num              30001 non-null  int64 
 1   name             30001 non-null  object
 2   file_content     30001 non-null  object
 3   chunked_content  30001 non-null  object
 4   embedding        30001 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.1+ MB


In [53]:
df_emb.embedding[0]

'[[-0.04548948  0.10605931 -0.05575116 ... -0.03221984  0.03973673\n  -0.01685703]\n [ 0.01061917  0.11946609  0.0040393  ... -0.0344262  -0.01158409\n  -0.04801248]\n [-0.00196127  0.10313164 -0.0204013  ... -0.04394552  0.02881121\n  -0.03886633]\n ...\n [-0.02714827  0.13265875 -0.02498526 ... -0.06658785  0.00752406\n  -0.03631395]\n [-0.00024475  0.0624979  -0.02359401 ... -0.02522611  0.00748727\n  -0.03956302]\n [-0.03241341  0.00616958  0.00667954 ...  0.03616384  0.00052885\n   0.02536997]]'

In [26]:
embeddings = df_emb.embedding

In [None]:

from sklearn.metrics.pairwise import cosine_similarity

In [46]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')
def encode_text(text):
  embedding = model.encode(text)
  return embedding

In [47]:
dtm = df_emb.embedding

In [50]:

def similarity_finder(search_query, k, model, dtm):

    # Transform the search query into a TF-IDF vector
    search_query_vector = model.encode([search_query])

    # Calculate cosine similarity between the query and the documents
    similarity_scores = cosine_similarity(search_query_vector, dtm)

    # Sort the similarity scores and get the indices of most similar documents
    similar_doc_indices = similarity_scores.argsort()[0][::-1][:min(k, len(df))]

    # Retrieve and return the most similar documents
    similar_texts = [df_emb['name'].iloc[i] for i in similar_doc_indices]
    # df = pd.DataFrame(similar_texts)
    # return df
    return similar_texts



In [51]:
search_query=input("Enter a Query: ")

similar_text = similarity_finder(search_query, 10, model, dtm)  # Get up to 10 most similar documents

for text in enumerate(similar_text):
    # print(dtm(idx))
    print(text)
    print()


Enter a Query:  peter


ValueError: could not convert string to float: '[[-0.04548948  0.10605931 -0.05575116 ... -0.03221984  0.03973673\n  -0.01685703]\n [ 0.01061917  0.11946609  0.0040393  ... -0.0344262  -0.01158409\n  -0.04801248]\n [-0.00196127  0.10313164 -0.0204013  ... -0.04394552  0.02881121\n  -0.03886633]\n ...\n [-0.02714827  0.13265875 -0.02498526 ... -0.06658785  0.00752406\n  -0.03631395]\n [-0.00024475  0.0624979  -0.02359401 ... -0.02522611  0.00748727\n  -0.03956302]\n [-0.03241341  0.00616958  0.00667954 ...  0.03616384  0.00052885\n   0.02536997]]'

In [31]:
# # Assuming you have already encoded your text using BERT and stored the embeddings in the 'embeddings' variable
# # Convert each embedding array to a byte string
# embedding_bytes = [embedding.tobytes() for embedding in embeddings]

# # Now, 'embedding_bytes' contains byte strings, which you can store in the SQLite database


In [30]:
# Assuming 'embedding' is a BERT embedding (list of floating-point numbers)
# Convert the list of floats to a string representation
embedding_str = ','.join(map(str, df_emb.embedding))

# Now, 'embedding_str' contains the embedding as a comma-separated string
# Store 'embedding_str' in the database


In [32]:
# Assuming 'embedding_str' is a string retrieved from the database
# Convert the string back to a list of floats
embedding = list(map(float, embedding_str.split(',')))


ValueError: could not convert string to float: '[[-0.04548948  0.10605931 -0.05575116 ... -0.03221984  0.03973673\n  -0.01685703]\n [ 0.01061917  0.11946609  0.0040393  ... -0.0344262  -0.01158409\n  -0.04801248]\n [-0.00196127  0.10313164 -0.0204013  ... -0.04394552  0.02881121\n  -0.03886633]\n ...\n [-0.02714827  0.13265875 -0.02498526 ... -0.06658785  0.00752406\n  -0.03631395]\n [-0.00024475  0.0624979  -0.02359401 ... -0.02522611  0.00748727\n  -0.03956302]\n [-0.03241341  0.00616958  0.00667954 ...  0.03616384  0.00052885\n   0.02536997]]'

In [39]:
import numpy as np
import re

# Assuming 'embedding_str' is a string retrieved from the database
# Parse the string representation of the embedding to extract the numbers
embedding_str = re.sub(r'[\[\]]', '', embedding_str)  # Remove square brackets
numbers = re.findall(r'-?\d+\.\d+', embedding_str)  # Extract numbers as strings

# Convert the list of number strings to floats
embedding = [float(num) for num in numbers]


In [40]:
len(embedding)

1305216

In [37]:
# from sklearn.metrics.pairwise import cosine_similarity

In [38]:
# # Assuming 'query_embedding' is the embedding of the user's query obtained from BERT
# # Assuming 'database_embeddings' is a list of embeddings retrieved from your database

# # Calculate cosine similarity between query embedding and database embeddings
# similarities = [cosine_similarity(query_embedding, embedding) for embedding in embedding]

# # Sort the results based on cosine similarity
# sorted_results = sorted(zip(similarities, database_records), key=lambda x: x[0], reverse=True)

# # Retrieve top results (e.g., top 10)
# top_results = sorted_results[:10]

# # Display the top results to the user
# for i, (similarity, record) in enumerate(top_results, start=1):
#     print(f"Top {i}: Similarity={similarity}, Record={record}")


NameError: name 'query_embedding' is not defined

In [20]:
# Initialize SQLite database in the specified path
db_path = 'chromadb.db'
conn = sqlite3.connect(db_path)

# Save DataFrame to SQLite
df_emb.to_sql("df_n", conn, if_exists="replace", index=False)

print(f"DataFrame from {df_emb} is successfully saved to ChromaDB using SQLite: {db_path}")

DataFrame from            num                                               name  \
0      9180533                         the.message.(1976).eng.1cd   
1      9180583  here.comes.the.grump.s01.e09.joltin.jack.in.bo...   
2      9180592    yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd   
3      9180594    yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd   
4      9180600                              broker.(2022).eng.1cd   
...        ...                                                ...   
29996  9301692                  renfroes.christmas.(1997).eng.1cd   
29997  9301705                              whiffs.(1975).eng.1cd   
29998  9301706                   american.murderer.(2022).eng.1cd   
29999  9301708                             rapture.(1993).eng.1cd   
30000  9301709                        shutterspeed.(2000).eng.1cd   

                                            file_content  \
0      watch any video online with open subtitle free...   
1      ah there s princess dawn and 

In [15]:
# import sqlite3

# # Specify the path to your desired location (e.g., /content/drive/MyDrive)
# db_path = 'chromadb.db'

# # Connect to the SQLite database (creates the file if it doesn't exist)
# conn = sqlite3.connect(db_path)
# conn.close()  # Close the connection

# print(f"SQLite database '{db_path}' created successfully.")

SQLite database 'chromadb.db' created successfully.


In [18]:
# import sqlite3

# try:
#     # Specify the path to your desired location (e.g., /content/drive/MyDrive)
#     db_path = r'E:\innomatics\AI ELITE 13\23.INTERNSHIP_INNO\Tasks\8_\chromadb.db'

#     # Connect to the SQLite database (creates the file if it doesn't exist)
#     conn = sqlite3.connect(db_path)
#     conn.close()  # Close the connection

#     print(f"SQLite database '{db_path}' created successfully.")
# except Exception as e:
#     print("An error occurred:", e)


SQLite database 'E:\innomatics\AI ELITE 13\23.INTERNSHIP_INNO\Tasks\8_\chromadb.db' created successfully.


In [1]:
# !pip install chromadb -U

In [5]:
!pip install sqlite3==3.35.0

ERROR: Could not find a version that satisfies the requirement sqlite3==3.35.0 (from versions: none)
ERROR: No matching distribution found for sqlite3==3.35.0


In [6]:
import csv
from chromadb import ChromaClient

# Connect to ChromaDB
client = ChromaClient('localhost', 6379)

# Function to parse embeddings from CSV and store in ChromaDB
def store_embeddings_from_csv(csv_file, column_name):
    with open(csv_file, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            embedding = row[column_name]
            # Store embedding in ChromaDB
            client.store_embedding(embedding)

# Specify your CSV file name and column name
csv_file = 'df_new_1.csv'
column_name = 'embedding'

# Call the function to store embeddings in ChromaDB
store_embeddings_from_csv(csv_file, column_name)


RuntimeError: [91mYour system has an unsupported version of sqlite3. Chroma                     requires sqlite3 >= 3.35.0.[0m
[94mPlease visit                     https://docs.trychroma.com/troubleshooting#sqlite to learn how                     to upgrade.[0m

In [8]:
from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        return embeddings

RuntimeError: [91mYour system has an unsupported version of sqlite3. Chroma                     requires sqlite3 >= 3.35.0.[0m
[94mPlease visit                     https://docs.trychroma.com/troubleshooting#sqlite to learn how                     to upgrade.[0m

In [7]:
# !pip install --upgrade SomePackage

In [None]:
# import csv
# from chromadb import ChromaClient

# # Connect to ChromaDB
# client = ChromaClient('localhost', 6379)

# # Function to parse embeddings from CSV and store in ChromaDB
# def store_embeddings_from_csv(csv_file, column_name):
#     with open(csv_file, 'r') as file:
#         reader = csv.DictReader(file)
#         for row in reader:
#             embedding = row[column_name]
#             # Store embedding in ChromaDB
#             client.store_embedding(embedding)

# # Specify your CSV file name and column name
# csv_file = 'df_new_1.csv'
# column_name = 'embeddings'

# # Call the function to store embeddings in ChromaDB
# store_embeddings_from_csv(csv_file, column_name)


In [56]:
a = list(df1.embedding[0][0])
len(a)

384

In [57]:
a = list(df1.embedding[0][1])
len(a)

384

In [58]:
a = list(df1.embedding[1][0])
len(a)

384

In [60]:
a = list(df1.embedding[9][1])
len(a)

384

In [44]:
df1.tail()

Unnamed: 0,num,name,file_content,chunked_content,embedding
996,9185839,the.rising.s01.e06.episode.1.6.(2022).eng.1cd,alex i he can t even see you i neve i he can n...,[alex i he can t even see you i neve i he can ...,"[[-0.08525071, -0.042879637, 0.07155471, -0.02..."
997,9185840,the.rising.s01.e07.episode.1.7.(2022).eng.1cd,we found trace of blood in room seven of keato...,[we found trace of blood in room seven of keat...,"[[-0.06101237, -0.0004974004, 0.004478765, -0...."
998,9185841,the.rising.s01.e08.episode.1.8.(2022).eng.1cd,alex neve neve stop it neve i victoria i i you...,[alex neve neve stop it neve i victoria i i yo...,"[[-0.05081622, -0.03507485, 0.02548889, -0.005..."
999,9185844,the.rising.s01.e01.episode.1.1.(2022).eng.1cd,use the free code joinnow at www playships eu ...,[use the free code joinnow at www playships eu...,"[[-0.05711211, -0.08905575, 0.029912557, -0.08..."
1000,9185845,the.rising.s01.e02.episode.1.2.(2022).eng.1cd,i my daughter hasn t come home i i her name s ...,[i my daughter hasn t come home i i her name s...,"[[-0.028831206, -0.11441327, 0.018926123, 0.03..."


In [45]:
df2 = df.loc[1001:2000,:]
df2.head()

Unnamed: 0,num,name,file_content,chunked_content
1001,9185846,the.rising.s01.e03.episode.1.3.(2022).eng.1cd,i my daughter hasn t come home i i her name s ...,[i my daughter hasn t come home i i her name s...
1002,9185847,the.rising.s01.e04.episode.1.4.(2022).eng.1cd,they say you were murdered there s so much i d...,[they say you were murdered there s so much i ...
1003,9185848,the.rising.s01.e05.episode.1.5.(2022).eng.1cd,i all i can think about is neve i i whatever s...,[i all i can think about is neve i i whatever ...
1004,9185849,the.rising.s01.e06.episode.1.6.(2022).eng.1cd,i he can t even see you i i he can now i i i d...,[i he can t even see you i i he can now i i i ...
1005,9185850,the.rising.s01.e07.episode.1.7.(2022).eng.1cd,we found trace of blood in room seven of keato...,[we found trace of blood in room seven of keat...


In [None]:
df2['embedding'] = df2['chunked_content'].apply(encode_text)
df2.head()

In [11]:
# Embeddings_1= model.encode(df.chunked_content)

In [34]:
df1.head()

Unnamed: 0,num,name,file_content,chunked_content,embedding
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.045489475, 0.10605931, -0.055751156, -0.0..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,[ah there s princess dawn and terry with the b...,"[[-0.0816742, -0.005627826, 0.046518527, -0.06..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,[i yumi s cell i i episode extremely polite yu...,"[[-0.14068541, -0.15667278, 0.06697712, -0.054..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.09335469, -0.10753913, 0.041937325, -0.03..."
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,[watch any video online with open subtitle fre...,"[[-0.08041111, -0.026260642, 0.054313846, -0.0..."


In [None]:
# BERT

In [None]:
# from transformers import BertTokenizer, BertModel
# import torch

# # Load pre-trained BERT tokenizer and model
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertModel.from_pretrained('bert-base-uncased')

# # Define a function to apply BERT on text data
# def apply_bert(text):
#     # Tokenize text
#     inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)

#     # Get BERT model outputs
#     outputs = model(**inputs)

#     # Extract relevant features from BERT output if needed
#     pooled_output = outputs.pooler_output

#     return pooled_output

# # Example usage: Apply BERT on the 'overlapping_chunks' column in your DataFrame
# df['bert_features'] = df['overlapping_chunks_file_content'].apply(lambda x: apply_bert(x))


In [4]:
# !pip install huggingface_hub
# python -c"from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('MY_HUGGINGFACE_TOKEN_HERE')"

In [None]:
# import pandas as pd

# # Assuming you already have a DataFrame named 'df'

# # Example DataFrame
# # df = pd.DataFrame({'overlapping_chunks_file_content': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})

# # Explode the list-like column
# df = df.explode('overlapping_chunks_file_content')

# # Reset index if needed
# df.reset_index(drop=True, inplace=True)

# # Now each value in the list is in its own row
# print(df)


           num                                               name  \
0      9180533                         the.message.(1976).eng.1cd   
1      9180583  here.comes.the.grump.s01.e09.joltin.jack.in.bo...   
2      9180592    yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd   
3      9180594    yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd   
4      9180600                              broker.(2022).eng.1cd   
...        ...                                                ...   
29996  9301692                  renfroes.christmas.(1997).eng.1cd   
29997  9301705                              whiffs.(1975).eng.1cd   
29998  9301706                   american.murderer.(2022).eng.1cd   
29999  9301708                             rapture.(1993).eng.1cd   
30000  9301709                        shutterspeed.(2000).eng.1cd   

                                            file_content  \
0      watch any video online with open subtitle free...   
1      ah there s princess dawn and terry with the 

In [None]:
df.head()

Unnamed: 0,num,name,file_content,overlapping_chunks_file_content
0,9180533,the.message.(1976).eng.1cd,watch any video online with open subtitle free...,watch any video online with open subtitle free...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,ah there s princess dawn and terry with the bl...,ah there s princess dawn and terry with the bl...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,i yumi s cell i i episode extremely polite yum...,i yumi s cell i i episode extremely polite yum...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,watch any video online with open subtitle free...,watch any video online with open subtitle free...
4,9180600,broker.(2022).eng.1cd,watch any video online with open subtitle free...,watch any video online with open subtitle free...


In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# import pandas as pd

# # Assuming df contains a column named 'text_column' containing the text data
# # You may need to adjust column names accordingly

# # Initialize the TF-IDF vectorizer
# tfidf_vectorizer = TfidfVectorizer()

# # Fit and transform the text data in your new column
# tfidf_matrix = tfidf_vectorizer.fit_transform(df['overlapping_chunks_file_content'])

# # Convert the TF-IDF matrix into a DataFrame
# tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# # Concatenate the TF-IDF DataFrame with the original DataFrame if needed
# # df = pd.concat([df, tfidf_df], axis=1)
# tfidf_df.head()

In [None]:
# import pandas as pd
# from sklearn.feature_extraction.text import CountVectorizer

# # Assuming you already have a DataFrame named 'df' with the 'overlapping_chunks_file_content' column exploded

# # Example DataFrame
# # df = pd.DataFrame({'overlapping_chunks_file_content': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})

# # Convert the column into strings
# df['overlapping_chunks_file_content'] = df['overlapping_chunks_file_content'].astype(str)

# # Initialize CountVectorizer
# vectorizer = CountVectorizer()

# # Fit and transform the text data
# bow_representation = vectorizer.fit_transform(df['overlapping_chunks_file_content'])

# # Convert the BoW representation into a DataFrame
# bow_df = pd.DataFrame(bow_representation.toarray(), columns=vectorizer.get_feature_names_out())

# # Concatenate the BoW DataFrame with the original DataFrame
# df_bow = pd.concat([df, bow_df], axis=1)

# # Print the resulting DataFrame with BoW representation
# print(df_bow)


In [None]:
# pip install chromadb
# chroma run # run the server

In [56]:
import pandas as pd

In [58]:
dd = pd.read_csv("anu.csv")
dd.head()

Unnamed: 0,SubtitleID_index,subtitle_id,name,content,clean_text_lemma,text_length_lemma,final_embeddings,chunks
0,5,9180607,the.myth.(2005).eng.1cd,general the princesss convoy has entered our ...,general the princess convoy ha entered our rea...,4005,"[-0.1895383894443512, 0.1699831336736679, 0.49...",['general the princess convoy ha entered our r...
1,6,9180608,the.great.beauty.(2013).eng.1cd,apiopensubtitlesorg is deprecated please impl...,apiopensubtitlesorg is deprecated please imple...,9208,"[-0.13504022359848022, 0.17723196744918823, 0....",['apiopensubtitlesorg is deprecated please imp...
2,9,9180694,rudrabinar.obhishaap.s02.e03.anandagarher.akhh...,so youre assuming that my grandma mumtaz is ...,so youre assuming that my grandma mumtaz is ab...,1178,"[-0.13107889890670776, 0.1428164541721344, 0.2...",['so youre assuming that my grandma mumtaz is ...
3,10,9180695,rudrabinar.obhishaap.s02.e04.udara.(2022).eng.1cd,you know that naads have less patience who s...,you know that naads have le patience who smugg...,1359,"[-0.08960047364234924, 0.04438433051109314, 0....",['you know that naads have le patience who smu...
4,11,9180696,rudrabinar.obhishaap.s02.e05.saat.surer.mejaj....,use the free code joinnow at wwwplayshipseu ...,use the free code joinnow at wwwplayshipseu ge...,1017,"[-0.11988365650177002, 0.22249145805835724, 0....",['use the free code joinnow at wwwplayshipseu ...


In [64]:
dd.final_embeddings[0]

'[-0.1895383894443512, 0.1699831336736679, 0.49176275730133057, -0.2788313627243042, 0.10992535948753357, -0.24937047064304352, 0.28042855858802795, 0.294384241104126, -0.006988018751144409, -0.3986762762069702, 0.023242846131324768, -0.5627375841140747, -0.2581726014614105, 0.3621686100959778, -0.30477577447891235, 0.5283430814743042, 0.1338929533958435, 0.16766758263111115, -0.2732051610946655, 0.5424041748046875, 0.34570762515068054, -0.013714262284338474, 0.1275763213634491, 0.18283139169216156, 0.5182170867919922, 0.0532633513212204, 0.20835082232952118, 0.02890435978770256, -0.07623176276683807, -0.20010600984096527, 0.40540367364883423, 0.26352956891059875, -0.1218508630990982, -0.1834336370229721, -0.10975711047649384, 0.08223343640565872, 0.09358371794223785, -0.1507522016763687, -0.06572076678276062, 0.3185150921344757, -0.6149031519889832, -0.5085240006446838, 0.12051786482334137, -0.019797172397375107, -0.24596992135047913, -0.7106381058692932, 0.21036896109580994, 0.122664

In [62]:
# Initialize SQLite database in the specified path
db_path = 'chromadb_2.db'
conn = sqlite3.connect(db_path)

# Save DataFrame to SQLite
df_emb.to_sql("dd", conn, if_exists="replace", index=False)

print(f"DataFrame from {df_emb} is successfully saved to ChromaDB using SQLite: {db_path}")

DataFrame from            num                                               name  \
0      9180533                         the.message.(1976).eng.1cd   
1      9180583  here.comes.the.grump.s01.e09.joltin.jack.in.bo...   
2      9180592    yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd   
3      9180594    yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd   
4      9180600                              broker.(2022).eng.1cd   
...        ...                                                ...   
29996  9301692                  renfroes.christmas.(1997).eng.1cd   
29997  9301705                              whiffs.(1975).eng.1cd   
29998  9301706                   american.murderer.(2022).eng.1cd   
29999  9301708                             rapture.(1993).eng.1cd   
30000  9301709                        shutterspeed.(2000).eng.1cd   

                                            file_content  \
0      watch any video online with open subtitle free...   
1      ah there s princess dawn and 

In [65]:
import sqlite3
import numpy as np

# Connect to SQLite database
CHROMA_DB_FILE = "subtitle_embeddings5.db"
chroma_conn = sqlite3.connect(CHROMA_DB_FILE)
chroma_c = chroma_conn.cursor()

# Create table with columns 'num', 'name', 'clean_content', and 'content_embeddings'
chroma_c.execute('''
    CREATE TABLE embeddings (
        num INTEGER,
        name TEXT,
        clean_content TEXT,
        content_embeddings BLOB
    )
''')

# Function to insert data into the SQLite database
def insert_data(row):
    num = row['num']
    name = row['name']
    clean_content = row['clean_content']
    content_embeddings = row['content_embeddings']

    content_embeddings_array = np.array(content_embeddings)
    
    # Convert the numpy array to binary format using tobytes()
    content_embeddings_blob = sqlite3.Binary(content_embeddings_array.tobytes())
    
    # Insert data into the database
    chroma_c.execute('''
        INSERT INTO embeddings (num, name, clean_content, content_embeddings)
        VALUES (?, ?, ?, ?)
    ''', (num, name, clean_content, content_embeddings_blob))

# Iterate through each row in df and insert data
for idx, row in dd.iterrows():
    insert_data(row)

# Commit the changes and close the connection
chroma_conn.commit()
chroma_conn.close()

OperationalError: table embeddings already exists

In [1]:
import PIL
print('PIL',PIL.__version__)

PIL 10.1.0


In [2]:
import pandas as pd

In [4]:
dg = pd.read_csv('df_e.csv')
dg.head()

ParserError: Error tokenizing data. C error: out of memory