## **Project Hint - Reading the Data from Database**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import sqlite3
import pandas as pd

## **Step 1 - Reading the Tables from Database file**

In [4]:
# Read the code below and write your observation in the next cell

conn = sqlite3.connect(r'/content/drive/MyDrive/Search_Engine/eng_subtitles_database.db')
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('zipfiles',)]


**In the above cell, I am able to read the table inside the database. As mentioned earlier, table name is `zipfiles`. We also know from README.txt that this table contains three columns: 'num', 'name' and 'content'.**

## **Step 2 - Reading the columns of Table**

In [5]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

num
name
content


**The above code helps in checking the column names in the database table.**

**Let's now use `SELECT * FROM zipfiles` to read all the data into a `df` variable.**

## **Step 3 - Loading the Database Table inside a Pandas DataFrame**

In [6]:
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


**Looks like the `content` column donot contain the subtitles text. Instead as mentioned in README.txt, it might be latin-1 encoded.**

## **Step 4 - Printing `content` of 0th Row**

In [8]:
b_data = df.iloc[0, 2]

# here 2 represent the index of content column
# 0 represents the row number

In [9]:
print(b_data)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x99V\x9fx\x96\xf0\x8c\x9e\x00\x00\x86\x9b\x01\x00;\x00\x00\x00The.Message.1976.REMASTERED.1080p.BluRay.x264-PiGNUS.EN.srt\xad\xbdm\x93\xdc\xc6\x91.\xfa\x9d\x11\xfc\x0f-}\xe1=\x11-\x9d\x06P\x85\x17\x9d\x8d\xd5%%[\xa4-Y>&u\x15>\xdf\xd0\xd3\x98\x19x\xfae\x0cts<\xfe\xf57\x9f\'\xb3\n\xd9\xa4\xbc\xbb\xf7\xc6Fl\xacELW\xa2\xaa\x90\x95\x95\xafO\x16/_l6\xdf\xe0\xff\xea\xf5f\xb3Y}\xf5\xd5\xbf\xaf\xf4AQ\xae7Mx\xf9\xe2\xd7\xfe|s\xbf\xea\x8f\xcf\xab\x8f\xe3n8\xadN\xc7\xfdx\x1cVO\xe3\xf9~\xf5\xf3\xe3p\xfc\xea\xfd/o>\xbc\xfb\xf0\xe3\xef\xde\xbf|\xf1\xfbi\x18Vo\xa6\xd3\xd3<L\xab\xe1\x1f\xe7\xe18\x8f\xa7\xe37\xab\xd3\xbc\xdb~-\xc3\x1e\xfe\xa7<|\xf9\xe2\xe5\x8bR_[~S\xd6\xeb\xa2k\xf3k\xe5A\xb7\xeeb\xf5\xf2\xc5\xbb\xe3\xea|?\xac\x8e\xfdaX\x9dnW?\x9cvk>8\x9c\xe6\xf3\xean\xeao\xc6\xd3ev\x8f~\x1a\xa6\x9b\xf1\xf6\xb2\xff\x1a\xe4\xabD\xbe*d\x11\xa5#_U\xeb\xaa\xd9`\xa6\xa7\xc3\xea\xa7\xcb}\x7f8\xf4F\xf9\xa7a\x9e\x87\xe3\x9d\xcc\\\xdf\x07B!\x13\xaa\xd61n<!\xd9\xaf\xd0\

**From the content, it appears to start with the bytes "PK\x03\......", which suggests that it might be a ZIP archive file. How do I know it? Experience! I have worked with something similar earlier.**

## **Step 5 - Unzipping the content of 385th row and decoding using `latin-1`**

In [10]:
import zipfile
import io

# Assuming 'content' is the binary data from your database
binary_data = df.iloc[385, 2]

# Decompress the binary data using the zipfile module
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        # Reading only one file in the ZIP archive
        subtitle_content = zip_file.read(zip_file.namelist()[0])

# Now 'subtitle_content' should contain the extracted subtitle content
print(subtitle_content.decode('latin-1'))  # Assuming the content is latin-1 encoded text

1
00:00:06,000 --> 00:00:12,074
Watch any video online with Open-SUBTITLES
Free Browser extension: osdb.link/ext

2
00:00:15,370 --> 00:00:16,506
You lose everything, my girl.

3
00:00:16,530 --> 00:00:19,360
So you've said - four times.

4
00:00:20,330 --> 00:00:22,120
I definitely had
it on yesterday.

5
00:00:22,465 --> 00:00:25,785
Your gloves, your keys, that
handkerchief I embroidered for you

6
00:00:25,809 --> 00:00:26,168
Everything!

7
00:00:26,192 --> 00:00:27,280
Five times.

8
00:00:31,610 --> 00:00:32,920
Miss Scarlet?
- Yes.

9
00:00:36,390 --> 00:00:37,390
I'm Miss Scarlet.

10
00:00:37,872 --> 00:00:40,880
May I inquire if
you've lost something?

11
00:00:41,350 --> 00:00:42,530
Some jewellery perhaps?

12
00:00:42,870 --> 00:00:45,130
Yes, my mother's wedding ring.

13
00:00:45,220 --> 00:00:45,840
Have you found it?

14
00:00:45,950 --> 00:00:47,656
Does your ring have
an inscription?

15
00:00:48,650 -->

**Look's like it worked.**

## **Step 6 - Applying the above Function on the Entire Data**

In [11]:
import zipfile
import io

count = 0

def decode_method(binary_data):
    global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])

    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')  # Assuming the content is UTF-8 encoded text

In [12]:
df['file_content'] = df['content'].apply(decode_method)

df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [13]:
import nltk

# Download the punctuations
nltk.download('punkt')
# Download the stop words corpus
nltk.download('stopwords')
# Downloading wordnet before applying Lemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [14]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [15]:
stemmer = PorterStemmer()
## We can also use Lemmatizer instead of Stemmer
lemmatizer = WordNetLemmatizer()

In [16]:
def preprocess(raw_text, flag):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", str(raw_text))

    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()

    # remove stop words
    clean_tokens = [t for t in tokens if not t in stopwords.words("english")]

    # Stemming/Lemmatization
    if(flag == 'stem'):
        clean_tokens = [stemmer.stem(word) for word in clean_tokens]
    else:
        clean_tokens = [lemmatizer.lemmatize(word) for word in clean_tokens]

    return pd.Series([" ".join(clean_tokens), len(clean_tokens)])

In [17]:
df=df.sample(n=5000)

In [18]:
df

Unnamed: 0,num,name,content,file_content
27368,9290679,the.legend.of.lizzie.borden.(1975).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1b\xa5\x...,"ï»¿1\r\n00:00:02,600 --> 00:00:05,110\r\n[Chur..."
42709,9355071,strange.evidence.s04.e06.aliens.of.hell.highwa...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00%\x08\x9aV...,"ï»¿1\n00:00:00,901 --> 00:00:02,601\n[ camera ..."
2588,9192252,the.killer.a.girl.who.deserves.to.die.(2022).e...,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\x82\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
47261,9379002,16.to.life.(2009).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x98\x15\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nSupport ..."
30620,9304025,masters.of.sex.s03.e05.matters.of.gravity.(201...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00 \xac\x99V...,"ï»¿1\r\n00:00:01,385 --> 00:00:03,411\r\nVIRGI..."
...,...,...,...,...
74743,9489318,kiff.s01.e05.big.barry.on.campusclub.book.(202...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x8f~\x9aV...,"ï»¿1\r\n00:00:01,626 --> 00:00:03,420\r\n(them..."
68378,9462509,upstairs.downstairs.s02.e01.the.new.man.(1972)...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x0cT\x9aV...,"ï»¿1\r\n00:00:01,133 --> 00:00:04,000\r\n(orch..."
479,9183522,survivor.s15.e04.ride.the.workhorse.till.the.t...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00k\xa9\x99V...,"1\r\n00:00:00,125 --> 00:00:02,669\r\n>> JEFF ..."
64183,9446827,the.ray.bradbury.theater.s02.e04.gotcha.(1988)...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00GG\x9aV\x9...,"ï»¿1\r\n00:00:04,972 --> 00:00:09,009\r\n[musi..."


In [19]:
from tqdm import tqdm, tqdm_notebook

In [20]:
tqdm.pandas()

In [21]:
temp_df1 = df["file_content"].progress_apply(lambda x: preprocess(x, 'lemma'))

100%|██████████| 5000/5000 [54:14<00:00,  1.54it/s]


In [22]:
temp_df1

Unnamed: 0,0,1
27368,church bell tolling bird chirping watch video ...,3879
42709,camera whir narrator worldwide billion camera ...,3085
2588,watch video online open subtitle free browser ...,2453
47261,support u become vip member remove ad www open...,4125
30620,virginia previously master sex chuck kind smut...,3071
...,...,...
74743,theme song play chorus kiff kiff kiff kiff kif...,1226
68378,orchestal music support u become vip member re...,3315
479,jeff probst previously survivor oh wow probst ...,2747
64183,music playing support u become vip member remo...,716


In [23]:
temp_df1.columns = ['clean_text_lemma', 'text_length_lemma']

temp_df1.head()

Unnamed: 0,clean_text_lemma,text_length_lemma
27368,church bell tolling bird chirping watch video ...,3879
42709,camera whir narrator worldwide billion camera ...,3085
2588,watch video online open subtitle free browser ...,2453
47261,support u become vip member remove ad www open...,4125
30620,virginia previously master sex chuck kind smut...,3071


In [24]:
temp_df1.to_csv('/content/drive/MyDrive/Search_Engine/subtitile.csv',index=False)

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
vocab2 = TfidfVectorizer()
subtitiles_tfidf1 = vocab2.fit_transform(temp_df1['clean_text_lemma'])

In [26]:
subtitiles_tfidf1

<5000x116175 sparse matrix of type '<class 'numpy.float64'>'
	with 4258519 stored elements in Compressed Sparse Row format>

In [27]:
user_query = pd.Series([input('Enter the query:')])

user_query.progress_apply(lambda x: preprocess(x,flag='lemma'))

query_vector = vocab2.transform(user_query)

Enter the query:here comes the grump


100%|██████████| 1/1 [00:00<00:00, 142.45it/s]


In [28]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity =  cosine_similarity(query_vector,subtitiles_tfidf1).flatten()

A = 10

top_A_indices = cosine_similarity.argsort()[-A:][::-1]
top_A_subtitles = temp_df1.iloc[top_A_indices] [:]


In [29]:
top_A_indices

array([ 610, 4928, 4942, 4700, 4632,  784, 1385, 3852, 4999, 1661])

In [30]:
top_A_subtitles

Unnamed: 0,clean_text_lemma,text_length_lemma
13111,grump continues relentless chase stop terry pr...,532
24130,watch video online open subtitle free browser ...,535
5465,crowd cheering tv joe murray catch screen pas ...,1881
1780,advertise product brand contact www opensubtit...,4248
75679,octopus episode five read dossier could give s...,2013
62179,whoo dup dubbity dup dup dup duppup dup two th...,1434
16142,advertise product brand contact www opensubtit...,4683
75947,jeff previously survivor merge war domenick ch...,2997
72062,light guitar music playing support u become vi...,1415
76876,phillips lividity fixed consistent body positi...,2356


In [31]:
import pandas as pd

In [32]:
temp_df1 = pd.read_csv(r"/content/drive/MyDrive/Search_Engine/subtitile.csv")

In [33]:
temp_df1.dropna(inplace=True)

In [34]:
temp_df1

Unnamed: 0,clean_text_lemma,text_length_lemma
0,church bell tolling bird chirping watch video ...,3879
1,camera whir narrator worldwide billion camera ...,3085
2,watch video online open subtitle free browser ...,2453
3,support u become vip member remove ad www open...,4125
4,virginia previously master sex chuck kind smut...,3071
...,...,...
4995,theme song play chorus kiff kiff kiff kiff kif...,1226
4996,orchestal music support u become vip member re...,3315
4997,jeff probst previously survivor oh wow probst ...,2747
4998,music playing support u become vip member remo...,716


In [35]:
df = pd.concat([df, temp_df1], axis=1)

df.head()

Unnamed: 0,num,name,content,file_content,clean_text_lemma,text_length_lemma
27368,9290679.0,the.legend.of.lizzie.borden.(1975).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1b\xa5\x...,"ï»¿1\r\n00:00:02,600 --> 00:00:05,110\r\n[Chur...",,
42709,9355071.0,strange.evidence.s04.e06.aliens.of.hell.highwa...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00%\x08\x9aV...,"ï»¿1\n00:00:00,901 --> 00:00:02,601\n[ camera ...",,
2588,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\x82\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",script info title english u original script tv...,2125.0
47261,9379002.0,16.to.life.(2009).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x98\x15\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nSupport ...",,
30620,9304025.0,masters.of.sex.s03.e05.matters.of.gravity.(201...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00 \xac\x99V...,"ï»¿1\r\n00:00:01,385 --> 00:00:03,411\r\nVIRGI...",,


In [36]:
df.isnull().sum()

num                  4680
name                 4680
content              4680
file_content         4680
clean_text_lemma     4680
text_length_lemma    4680
dtype: int64

In [37]:
df = df[['num',"name",'clean_text_lemma']]

df.head()

Unnamed: 0,num,name,clean_text_lemma
27368,9290679.0,the.legend.of.lizzie.borden.(1975).eng.1cd,
42709,9355071.0,strange.evidence.s04.e06.aliens.of.hell.highwa...,
2588,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
47261,9379002.0,16.to.life.(2009).eng.1cd,
30620,9304025.0,masters.of.sex.s03.e05.matters.of.gravity.(201...,


In [38]:
df.dropna(inplace=True)

In [39]:
df.shape

(320, 3)

In [40]:
df

Unnamed: 0,num,name,clean_text_lemma
2588,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
4744,9201995.0,beautiful.days.s01.e12.episode.1.12.(2001).eng...,narrator listen spooky episode time got see do...
467,9183459.0,better.call.saul.s06.e10.nippy.(2022).eng.1cd,innkeeper another round ch ch ch chee eers tha...
1355,9187288.0,survivor.s17.e08.the.brains.behind.everything....,watch video online open subtitle free browser ...
1903,9189480.0,ghost.adventures.s14.e01.stone.lion.inn.(2017)...,insect chirping bird calling rustling advertis...
...,...,...,...
743,9184706.0,ice.cream.man.(1995).eng.1cd,get said get episode happened gosh idiot hurt ...
535,9183805.0,teen.wolf.s03.e09.the.girl.who.knew.too.much.(...,kid capri straight new york city russell simmo...
1034,9186150.0,the.jeffersons.s05.e01.louises.painting.(1978)...,script info title default file scripttype v wr...
745,9184739.0,ruling.of.the.heart.(2018).eng.1cd,called forbidden zone one know lie beneath san...


In [41]:
def chunk_document(corpous, id_col, chunk_size=300):
    data = []

    for doc,id  in zip(corpous, id_col):

        words = doc.split()
        for i in range(0, len(words), chunk_size):
            chunk = ' '.join(words[i:i + chunk_size])


            data.append((id,chunk))

    df = pd.DataFrame(data)

    return df

In [42]:
chuncked_df = chunk_document(df.clean_text_lemma,df.num)

In [43]:
chuncked_df

Unnamed: 0,0,1
0,9192252.0,script info title english u original script tv...
1,9192252.0,h b h style sign episode arial h fffadb h ff h...
2,9192252.0,dialogue main gu last word activate dialogue m...
3,9192252.0,main gu seen heart dialogue main gu ready dial...
4,9192252.0,n battle though dialogue main gu goal get card...
...,...,...
2882,9183522.0,row right behind home zone tonight game two se...
2883,9183522.0,jack speech honestly calling buck manzell hurt...
2884,9183522.0,remember grandma earlier begged get cooky oven...
2885,9183522.0,guy unnervingly tall two time wife got scraggl...


In [44]:
chuncked_df.columns =['num','file_content_chunks']

In [45]:
chuncked_df

Unnamed: 0,num,file_content_chunks
0,9192252.0,script info title english u original script tv...
1,9192252.0,h b h style sign episode arial h fffadb h ff h...
2,9192252.0,dialogue main gu last word activate dialogue m...
3,9192252.0,main gu seen heart dialogue main gu ready dial...
4,9192252.0,n battle though dialogue main gu goal get card...
...,...,...
2882,9183522.0,row right behind home zone tonight game two se...
2883,9183522.0,jack speech honestly calling buck manzell hurt...
2884,9183522.0,remember grandma earlier begged get cooky oven...
2885,9183522.0,guy unnervingly tall two time wife got scraggl...


In [46]:
merged_df = chuncked_df.merge(df,left_on="num",right_on="num",how="left")

In [47]:
merged_df

Unnamed: 0,num,file_content_chunks,name,clean_text_lemma
0,9192252.0,script info title english u original script tv...,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
1,9192252.0,h b h style sign episode arial h fffadb h ff h...,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
2,9192252.0,dialogue main gu last word activate dialogue m...,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
3,9192252.0,main gu seen heart dialogue main gu ready dial...,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
4,9192252.0,n battle though dialogue main gu goal get card...,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
...,...,...,...,...
2882,9183522.0,row right behind home zone tonight game two se...,survivor.s15.e04.ride.the.workhorse.till.the.t...,watch video online open subtitle free browser ...
2883,9183522.0,jack speech honestly calling buck manzell hurt...,survivor.s15.e04.ride.the.workhorse.till.the.t...,watch video online open subtitle free browser ...
2884,9183522.0,remember grandma earlier begged get cooky oven...,survivor.s15.e04.ride.the.workhorse.till.the.t...,watch video online open subtitle free browser ...
2885,9183522.0,guy unnervingly tall two time wife got scraggl...,survivor.s15.e04.ride.the.workhorse.till.the.t...,watch video online open subtitle free browser ...


In [48]:
merged_df= merged_df[['num','name','file_content_chunks']]

In [49]:
merged_df

Unnamed: 0,num,name,file_content_chunks
0,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,script info title english u original script tv...
1,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,h b h style sign episode arial h fffadb h ff h...
2,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,dialogue main gu last word activate dialogue m...
3,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,main gu seen heart dialogue main gu ready dial...
4,9192252.0,the.killer.a.girl.who.deserves.to.die.(2022).e...,n battle though dialogue main gu goal get card...
...,...,...,...
2882,9183522.0,survivor.s15.e04.ride.the.workhorse.till.the.t...,row right behind home zone tonight game two se...
2883,9183522.0,survivor.s15.e04.ride.the.workhorse.till.the.t...,jack speech honestly calling buck manzell hurt...
2884,9183522.0,survivor.s15.e04.ride.the.workhorse.till.the.t...,remember grandma earlier begged get cooky oven...
2885,9183522.0,survivor.s15.e04.ride.the.workhorse.till.the.t...,guy unnervingly tall two time wife got scraggl...


In [50]:
for id, chunk in enumerate(merged_df):
    print(chunk)

num
name
file_content_chunks


In [51]:
merged_df.to_csv("/content/drive/MyDrive/Search_Engine/merged_df.csv",index=False)