# **INNOMATICS Project - Reading the Data from Database**

In [1]:
import sqlite3
import pandas as pd

## ** Reading the Tables from Database file**

In [2]:
conn = sqlite3.connect("C:\\Users\\THANUJA\\Downloads\\eng_subtitles_database.db")
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('zipfiles',)]


**In the above cell, we are able to read the table inside the database. As mentioned earlier, table name is `zipfiles`. We also know from README.txt that this table contains three columns: 'num', 'name' and 'content'.**

## **Reading the columns of Table**

In [3]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

num
name
content


**The above code helps in checking the column names in the database table.**

## **Loading the Database Table inside a Pandas DataFrame**

In [4]:
#reading all the data into a `df` variable
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


**Looks like the `content` column do not contain the subtitles text. Instead as mentioned in README.txt, it might be latin-1 encoded.**

In [6]:
#Printing contents of 0th row
b_data = df.iloc[0, 2]

# here 2 represent the index of content column
# 0 represents the row number
print(b_data)

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x99V\x9fx\x96\xf0\x8c\x9e\x00\x00\x86\x9b\x01\x00;\x00\x00\x00The.Message.1976.REMASTERED.1080p.BluRay.x264-PiGNUS.EN.srt\xad\xbdm\x93\xdc\xc6\x91.\xfa\x9d\x11\xfc\x0f-}\xe1=\x11-\x9d\x06P\x85\x17\x9d\x8d\xd5%%[\xa4-Y>&u\x15>\xdf\xd0\xd3\x98\x19x\xfae\x0cts<\xfe\xf57\x9f\'\xb3\n\xd9\xa4\xbc\xbb\xf7\xc6Fl\xacELW\xa2\xaa\x90\x95\x95\xafO\x16/_l6\xdf\xe0\xff\xea\xf5f\xb3Y}\xf5\xd5\xbf\xaf\xf4AQ\xae7Mx\xf9\xe2\xd7\xfe|s\xbf\xea\x8f\xcf\xab\x8f\xe3n8\xadN\xc7\xfdx\x1cVO\xe3\xf9~\xf5\xf3\xe3p\xfc\xea\xfd/o>\xbc\xfb\xf0\xe3\xef\xde\xbf|\xf1\xfbi\x18Vo\xa6\xd3\xd3<L\xab\xe1\x1f\xe7\xe18\x8f\xa7\xe37\xab\xd3\xbc\xdb~-\xc3\x1e\xfe\xa7<|\xf9\xe2\xe5\x8bR_[~S\xd6\xeb\xa2k\xf3k\xe5A\xb7\xeeb\xf5\xf2\xc5\xbb\xe3\xea|?\xac\x8e\xfdaX\x9dnW?\x9cvk>8\x9c\xe6\xf3\xean\xeao\xc6\xd3ev\x8f~\x1a\xa6\x9b\xf1\xf6\xb2\xff\x1a\xe4\xabD\xbe*d\x11\xa5#_U\xeb\xaa\xd9`\xa6\xa7\xc3\xea\xa7\xcb}\x7f8\xf4F\xf9\xa7a\x9e\x87\xe3\x9d\xcc\\\xdf\x07B!\x13\xaa\xd61n<!\xd9\xaf\xd0\

**From the content, it appears to start with the bytes "PK\x03\......", which suggests that it might be a ZIP archive file.**

## Step 5 - Unzipping the content of 0th row and decoding using latin-1

In [7]:
import zipfile
import io

# Assuming 'content' is the binary data from your database
binary_data = df.iloc[0, 2]

# Decompress the binary data using the zipfile module
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        # Reading only one file in the ZIP archive
        subtitle_content = zip_file.read(zip_file.namelist()[0])

# Now 'subtitle_content' should contain the extracted subtitle content
print(subtitle_content.decode('latin-1'))  # Assuming the content is latin-1 encoded text

1
00:00:06,000 --> 00:00:12,074
Watch any video online with Open-SUBTITLES
Free Browser extension: osdb.link/ext

2
00:02:26,198 --> 00:02:29,953
In the name of God, the most gracious, the most Merciful.

3
00:02:31,072 --> 00:02:33,370
From Muhammad, the Messenger of God

4
00:02:33,550 --> 00:02:36,047
to Heraclius, the emperor of Byzantium.

5
00:02:36,407 --> 00:02:39,464
greetings to him who is the
follower of righteous guidance.

6
00:02:39,783 --> 00:02:42,591
I bid you to hear the divine call.

7
00:02:43,160 --> 00:02:45,817
I am the messenger of God to the people;

8
00:02:46,337 --> 00:02:48,784
accept Islam for your salvation.

9
00:02:52,231 --> 00:02:54,709
He speaks of a new prophet in Arabia.

10
00:02:55,068 --> 00:02:57,825
Was it like this when John, the Baptist
came to king Herod

11
00:02:58,145 --> 00:03:01,272
out of the desert, crying about salvation?

12
00:03:26,136 --> 00:03:28,903
To Muqawqis, Patriarch of Ale

**Look's like it worked.**

## **Applying the above Function on the Entire Data**

In [8]:
import zipfile
import io

count = 0

def decode_method(binary_data):
    global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])
    
    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')  # Assuming the content is UTF-8 encoded text

## To take a random 30% of the data

In [9]:
# df DataFrame containing the subtitle data
# Sample 30% of the data
df1 = df.sample(frac=0.3, random_state=42)  # Using a specific random_state for reproducibility
print(df.head())
# Print the shape of the sampled data 
print("Shape of sampled data:", df1.shape)

# resetting the index of the sampled data (not compulsory)
df1.reset_index(drop=True, inplace=True)


       num                                               name  \
0  9180533                         the.message.(1976).eng.1cd   
1  9180583  here.comes.the.grump.s01.e09.joltin.jack.in.bo...   
2  9180592    yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd   
3  9180594    yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd   
4  9180600                              broker.(2022).eng.1cd   

                                             content  
0  b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...  
1  b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...  
2  b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...  
3  b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...  
4  b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...  
Shape of sampled data: (24749, 3)


In [10]:
df1['file_content'] = df1['content'].apply(decode_method)

df1.head()

Unnamed: 0,num,name,content,file_content
0,9251120,maybe.this.time.(2014).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x89\x9a\x...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."
1,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,b'PK\x03\x04\x14\x00\x00\x00\x08\x007\x8f\x99V...,"1\r\n00:00:09,275 --> 00:00:11,876\r\n¶ Oh, I ..."
2,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x8f\x19\x...,"1\r\n00:00:07,140 --> 00:00:14,220\r\n<i>Timin..."
3,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00[\xaa\x99V...,"1\r\n00:00:06,133 --> 00:00:08,900\r\n[etherea..."
4,9408707,battlebots.(2015).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xf4<\x9aV...,"ï»¿1\r\n00:00:01,480 --> 00:00:03,570\r\n[Chri..."


In [11]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24749 entries, 0 to 24748
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           24749 non-null  int64 
 1   name          24749 non-null  object
 2   content       24749 non-null  object
 3   file_content  24749 non-null  object
dtypes: int64(1), object(3)
memory usage: 773.5+ KB


In [12]:
#Viewing the single file content
df1['file_content'][0]

'ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch any video online with Open-SUBTITLES\r\nFree Browser extension: osdb.link/ext\r\n\r\n2\r\n00:00:37,328 --> 00:00:39,706\r\n<i>It could\'ve been\r\njust another summer.</i>\r\n\r\n3\r\n00:00:40,790 --> 00:00:43,042\r\n<i>But as I set foot on the sand,</i>\r\n\r\n4\r\n00:00:43,209 --> 00:00:46,212\r\n<i>that summer\r\nsuddenly felt different.</i>\r\n\r\n5\r\n00:00:55,221 --> 00:00:56,973\r\n<i>Like it was going to be the summer</i>\r\n\r\n6\r\n00:00:57,098 --> 00:00:59,142\r\n<i>that would change my life.</i>\r\n\r\n7\r\n00:00:59,350 --> 00:01:01,770\r\n<i>The summer of freedom.</i>\r\n\r\n8\r\n00:01:02,562 --> 00:01:05,607\r\n<i>The summer of\r\nendless possibilities.</i>\r\n\r\n9\r\n00:01:06,274 --> 00:01:09,402\r\n<i>The summer of 2007.</i>\r\n\r\n10\r\n00:01:16,493 --> 00:01:18,036\r\nOoh, aah!\r\n\r\n11\r\n00:01:24,459 --> 00:01:26,169\r\nOoh, oh!\r\n\r\n12\r\n00:01:26,377 --> 00:01:28,254\r\n<i>â\x99ª Oh, oh, ooh â\x99ª</i>\r\n\r\n13\

In [13]:
df1.drop('content',axis=1,inplace=True)
df1.head()

Unnamed: 0,num,name,file_content
0,9251120,maybe.this.time.(2014).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."
1,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,"1\r\n00:00:09,275 --> 00:00:11,876\r\n¶ Oh, I ..."
2,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,"1\r\n00:00:07,140 --> 00:00:14,220\r\n<i>Timin..."
3,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,"1\r\n00:00:06,133 --> 00:00:08,900\r\n[etherea..."
4,9408707,battlebots.(2015).eng.1cd,"ï»¿1\r\n00:00:01,480 --> 00:00:03,570\r\n[Chri..."


In [14]:
"""import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import sqlite3
from chromadb import ChromaDB

# Define the document chunker function
def document_chunker(text, chunk_size=500, overlap_size=50):
    chunks = []
    words = text.split()
    for i in range(0, len(words), chunk_size - overlap_size):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Initialize ChromaDB
db = ChromaDB()

# Define TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Iterate over each row in the DataFrame
for index, row in df1.iterrows():
    # Apply document chunker to file content
    chunks = document_chunker(row['file_content'])
    
    # Vectorize each chunk
    chunk_vectors = tfidf_vectorizer.fit_transform(chunks)
    
    # Store embeddings in ChromaDB
    for i, chunk_vector in enumerate(chunk_vectors):
        # Convert sparse matrix to dense array
        chunk_embedding = np.array(chunk_vector.todense()).flatten()
        
        # Store embedding in ChromaDB
        db.set_embedding(f"document_{row['num']}_chunk_{i}", chunk_embedding)"""


'import numpy as np\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nimport sqlite3\nfrom chromadb import ChromaDB\n\n# Define the document chunker function\ndef document_chunker(text, chunk_size=500, overlap_size=50):\n    chunks = []\n    words = text.split()\n    for i in range(0, len(words), chunk_size - overlap_size):\n        chunk = \' \'.join(words[i:i + chunk_size])\n        chunks.append(chunk)\n    return chunks\n\n# Initialize ChromaDB\ndb = ChromaDB()\n\n# Define TF-IDF vectorizer\ntfidf_vectorizer = TfidfVectorizer()\n\n# Iterate over each row in the DataFrame\nfor index, row in df1.iterrows():\n    # Apply document chunker to file content\n    chunks = document_chunker(row[\'file_content\'])\n    \n    # Vectorize each chunk\n    chunk_vectors = tfidf_vectorizer.fit_transform(chunks)\n    \n    # Store embeddings in ChromaDB\n    for i, chunk_vector in enumerate(chunk_vectors):\n        # Convert sparse matrix to dense array\n        chunk_embedding = np.arr

In [15]:
"""import chromadb

# Initialize a ChromaDB client
chroma_client = chromadb.Client()

# Create a new collection for storing subtitle documents
subtitle_collection = chroma_client.create_collection(name="subtitle_collection")

# Assuming df1 contains the DataFrame with subtitle documents
for index, row in df1.iterrows():
    # Add each subtitle document to the collection
    subtitle_collection.add(
        documents=[row['file_content']],  # Assuming 'file_content' contains subtitle text
        metadatas=[{"name": row['name']}],  # Store metadata such as subtitle name
        ids=[str(row['num'])]  # Unique ID for each subtitle document
    )

# Define a function to query subtitle collection based on user query
def search_subtitles(query_text, n_results=5):
    # Query the subtitle collection
    results = subtitle_collection.query(
        query_texts=[query_text],
        n_results=n_results
    )
    return results

# Example query
query_text = "Search query for subtitles"
search_results = search_subtitles(query_text)
print(search_results)
"""

'import chromadb\n\n# Initialize a ChromaDB client\nchroma_client = chromadb.Client()\n\n# Create a new collection for storing subtitle documents\nsubtitle_collection = chroma_client.create_collection(name="subtitle_collection")\n\n# Assuming df1 contains the DataFrame with subtitle documents\nfor index, row in df1.iterrows():\n    # Add each subtitle document to the collection\n    subtitle_collection.add(\n        documents=[row[\'file_content\']],  # Assuming \'file_content\' contains subtitle text\n        metadatas=[{"name": row[\'name\']}],  # Store metadata such as subtitle name\n        ids=[str(row[\'num\'])]  # Unique ID for each subtitle document\n    )\n\n# Define a function to query subtitle collection based on user query\ndef search_subtitles(query_text, n_results=5):\n    # Query the subtitle collection\n    results = subtitle_collection.query(\n        query_texts=[query_text],\n        n_results=n_results\n    )\n    return results\n\n# Example query\nquery_text = "Sea

In [16]:
"""import chromadb

# Initialize ChromaDB client and create a collection
chroma_client = chromadb.Client()
#subtitle_collection = chroma_client.create_collection(name="subtitle_collection")
subtitle_collection = chroma_client.create_collection(name="unique_subtitle_collection")

from sentence_transformers import SentenceTransformer

# Load a pre-trained BERT-based model
model = SentenceTransformer('bert-base-nli-mean-tokens')



# Example function for chunking a document
def chunk_document(document, chunk_size=500):
    # Split the document into chunks of specified size
    chunks = [document[i:i+chunk_size] for i in range(0, len(document), chunk_size)]
    return chunks

# Example function for embedding a chunk
def embed_chunk(chunk):
    # Encode the chunk using the loaded model
    embedding = model.encode(chunk)
    return embedding

# Example function for concatenating vectors
def concatenate_vectors(vector_list):
    # Concatenate vectors into a single vector
    concatenated_vector = [value for sublist in vector_list for value in sublist]
    return concatenated_vector

# Assuming df1 contains the DataFrame with subtitle documents
for index, row in df1.iterrows():
    # Chunk the document
    chunks = chunk_document(row['file_content'])
    
    # Embed each chunk
    chunk_embeddings = [embed_chunk(chunk) for chunk in chunks]
    
    # Concatenate the vectors of all chunks into one single vector
    concatenated_vector = concatenate_vectors(chunk_embeddings)
    
    # Add document metadata, ID, document content, and vector to the collection
    subtitle_collection.add(
        documents=[row['file_content']],
        metadatas=[{"name": row['name'], "chunk_count": len(chunks)}],
        ids=[str(row['num'])],
        vectors=[concatenated_vector]
    )

# Retrieve metadata, document IDs, documents, and vectors from the collection
metadata_list = []
document_ids = []
documents_list = []
vectors_list = []

for result in subtitle_collection.query():
    metadata_list.append(result.metadata)
    document_ids.append(result.id)
    documents_list.append(result.document)
    vectors_list.append(result.vector)

# Now you have lists containing metadata, document IDs, documents, and vectors for all documents in the collection
"""

'import chromadb\n\n# Initialize ChromaDB client and create a collection\nchroma_client = chromadb.Client()\n#subtitle_collection = chroma_client.create_collection(name="subtitle_collection")\nsubtitle_collection = chroma_client.create_collection(name="unique_subtitle_collection")\n\nfrom sentence_transformers import SentenceTransformer\n\n# Load a pre-trained BERT-based model\nmodel = SentenceTransformer(\'bert-base-nli-mean-tokens\')\n\n\n\n# Example function for chunking a document\ndef chunk_document(document, chunk_size=500):\n    # Split the document into chunks of specified size\n    chunks = [document[i:i+chunk_size] for i in range(0, len(document), chunk_size)]\n    return chunks\n\n# Example function for embedding a chunk\ndef embed_chunk(chunk):\n    # Encode the chunk using the loaded model\n    embedding = model.encode(chunk)\n    return embedding\n\n# Example function for concatenating vectors\ndef concatenate_vectors(vector_list):\n    # Concatenate vectors into a single

In [17]:
"""#Data Preprocessing 

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize WordNet lemmatizer
lemmatizer = WordNetLemmatizer()"""

'#Data Preprocessing \n\nimport string\nimport nltk\nfrom nltk.corpus import stopwords\nfrom nltk.stem import WordNetLemmatizer\n\n# Initialize WordNet lemmatizer\nlemmatizer = WordNetLemmatizer()'

In [18]:
"""
# Function to preprocess file_content
def preprocess_text(text):
    lines = text.split('\r\n')
    cleaned_lines = [line.strip() for line in lines if line.strip()]
    return ' '.join(cleaned_lines)

# Apply preprocessing function to file_content column
df1['file_content'] = df1['file_content'].apply(preprocess_text)

# Display the preprocessed DataFrame
print(df1)"""


"\n# Function to preprocess file_content\ndef preprocess_text(text):\n    lines = text.split('\r\n')\n    cleaned_lines = [line.strip() for line in lines if line.strip()]\n    return ' '.join(cleaned_lines)\n\n# Apply preprocessing function to file_content column\ndf1['file_content'] = df1['file_content'].apply(preprocess_text)\n\n# Display the preprocessed DataFrame\nprint(df1)"

In [19]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(raw_text, flag):
    # Removing special characters and digits
    text = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '', str(raw_text))
    sentence = re.sub("[^a-zA-Z]", " ", str(text))
    
    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()
    
    # remove stop words                
    clean_tokens = [t for t in tokens if not t in stopwords.words("english")]
    
    # Stemming/Lemmatization
    if(flag == 'stem'):
        clean_tokens = [stemmer.stem(word) for word in clean_tokens]
    else:
        clean_tokens = [lemmatizer.lemmatize(word) for word in clean_tokens]
    
    return pd.Series([" ".join(clean_tokens)])



In [20]:
df1=df1.sample(n=10000)

In [21]:
df1

Unnamed: 0,num,name,file_content
4850,9490214,everybody.hates.chris.s04.e15.everybody.hates....,"1\n00:00:01,635 --> 00:00:03,604\n*\n\n2\n00:0..."
15148,9469683,fringe.s01.e12.the.nobrainer.(2009).eng.1cd,"ï»¿1\r\n00:00:06,632 --> 00:00:09,431\r\nNo, m..."
15194,9271251,ncis.s20.e04.leave.no.trace.(2022).eng.1cd,"ï»¿1\r\n00:00:03,578 --> 00:00:05,366\r\n30 ye..."
1134,9417579,the.awesomes.s03.e04.awesomes.for.hire.(2015)....,"ï»¿1\r\n00:00:01,000 --> 00:00:02,300\r\nPrevi..."
24011,9435262,northern.exposure.s05.e09.a.cup.of.joe.(1993)....,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
...,...,...,...
20133,9299557,greys.anatomy.s05.e17.i.will.follow.you.into.t...,"ï»¿1\r\n00:00:02,069 --> 00:00:03,964\r\n<i>[D..."
2400,9231311,monster.s01.e10.the.past.that.was.erased.(2004...,ï»¿[Script Info]\r\n; Script generated by Aegi...
19550,9501000,alfred.hitchcock.presents.s01.e31.the.gentlema...,ï»¿[Script Info]\r\nTitle: Default file\r\nScr...
1352,9231152,ncis.s11.e07.better.angels.(2013).eng.1cd,"ï»¿1\r\n00:00:06,673 --> 00:00:09,404\r\nMcGEE..."


In [24]:
from tqdm import tqdm, tqdm_notebook

In [25]:
tqdm.pandas()


In [26]:
temp_df1 = df1["file_content"].progress_apply(lambda x: preprocess_text(x, 'lemma'))

  8%|█████▉                                                                      | 786/10000 [13:54<2:42:58,  1.06s/it]


KeyboardInterrupt: 

In [None]:
temp_df1.columns = ['clean_text_lemma']

temp_df1.head()

In [None]:
temp_df1.to_csv("subtitile.csv",index=False)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vocab2 = TfidfVectorizer()
subtitiles_tfidf1 = vocab2.fit_transform(temp_df1['clean_text_lemma'])

In [None]:
# Apply preprocessing function to file_content column
df1['file_content'] = df1['file_content'].apply(preprocess_text)

# Display the preprocessed DataFrame
print(df1)