<a href="https://colab.research.google.com/github/August-murr/Data_science_Demonstration/blob/main/Parallel%20Subtitle%20Collection%20and%20Alignment/Parallel_Subtitle_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallel Subtitle Collection for Machine Translation

Welcome to this Jupyter notebook, where our goal is to gather and prepare data for machine translation training or fine-tuning, focusing on movie subtitle translation.

## Project Objective
The primary objective of this notebook is to collect movie subtitle data from the popular website, Subscene.com. Subscene.com is renowned for its extensive collection of movie subtitles. We will leverage this resource to obtain subtitle files for various movies.

## Data Preparation
The core of our data preparation involves converting SRT (SubRip) subtitle files into two parallel subtitle files for a pair of languages. These parallel subtitle files are essential for training and evaluating machine translation models.

Please note that while the code provided in this notebook accomplishes the task, there is room for further optimization and refinement in the future, which may be achieved through ongoing development or external contributions.

Let's dive into the process of collecting, processing, and organizing subtitle data to support machine translation tasks.


## Libraries and dependencies

In [None]:
# Install the langdetect library for language detection
!pip install langdetect

# Import the necessary libraries and modules
import requests              # Used for making HTTP requests to websites
from bs4 import BeautifulSoup  # Helps parse and extract data from HTML pages
import os                    # Enables interaction with the file system
from time import sleep       # Adds delays to prevent overloading web servers
import shutil                # Provides file operations, useful for copying and moving files
import zipfile               # Helps with handling zip files
import pandas as pd          # Used for data manipulation and creating DataFrames
import chardet              # Detects the encoding of text files
from langdetect import detect  # Detects the language of text
import langdetect           # Library for language detection
import re                   # Allows working with regular expressions for text processing
import langdetect.lang_detect_exception  # Exception handling for language detection

## Subtitle Zip File Download with web scraping

The `wait` function serves a crucial role in managing HTTP requests to the Subscene website. Subscene's HTTP request limits can behave in unpredictable ways, sometimes requiring longer waiting times between responses and at other times not needing any delay at all. To navigate these variations and avoid overloading the server, we introduce controlled pauses between requests.

In [2]:
def wait(amount=2):
  """
  This function is used to introduce delays in the code to avoid making too many requests in quick succession,
  which can lead to server overload or rate limiting by websites.
  The default duration is set to 2 seconds, but you can customize it by providing a different 'amount' in seconds.
  """
  sleep(amount)

In the upcoming sections, we'll explore a set of functions designed to search for movie subtitles on Subscene.com based on the movie's title. The objective is to locate subtitles in different languages that share the same movie title. Subtitles with matching titles often have synchronized timestamps, simplifying the alignment process.

However, it's important to note that these functions are optimized for movie subtitles and may not perform as effectively with TV series subtitles. TV series subtitle websites often follow different structural patterns, which may require separate handling.

In [3]:
def get_subscene_subtitle_link(movie_title):
    # Define the URL and parameters for the search
    url = "https://subscene.com/subtitles/searchbytitle"
    params = {"query": movie_title}

    # Define headers to prevent errors (you can customize this as needed)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    }
    # Send a GET request to Subscene with the specified parameters and headers
    response = requests.get(url, params=params, headers=headers)
    wait()
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the first result's link
        first_result = soup.find("div", class_="title")
        if first_result:
            link = first_result.find("a")["href"]
            full_link = "https://subscene.com" + link
            return full_link
        else:
            print(f"No subtitle link found for '{movie_title}'.")
            return None
    else:
        print(f"Failed to retrieve the movie's subtitle link for '{movie_title}'. Status code:", response.status_code)
        return None

In [4]:
def get_subscene_subtitle_names(movie_name, language):
    # Get the subtitle link for the movie
    movie_link = get_subscene_subtitle_link(movie_name)

    if movie_link:
        # Construct the URL for the specific language's subtitle page
        language_subtitle_url = f"{movie_link}/{language}"

        # Define headers to prevent errors (you can customize this as needed)
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        }
        # Send a GET request to the language-specific subtitle page
        response = requests.get(language_subtitle_url, headers=headers)
        wait()
        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content of the page
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all subtitle entries
            subtitle_entries = soup.find_all("td", class_="a1")

            # Extract and return the names of positively rated subtitles
            subtitle_names = []
            for entry in subtitle_entries:
                span_elements = entry.find_all("span")
                if len(span_elements) == 2 and "positive-icon" in span_elements[0].get("class", []):
                    subtitle_name = span_elements[1].text.strip()
                    subtitle_names.append(subtitle_name)

            return subtitle_names
        else:
            print("Failed to retrieve the language-specific subtitle page. Status code:", response.status_code)
            return None
    else:
        print(f"No results found for the movie title: {movie_name}")
        return None

In [5]:
def find_common_subtitles(list1, list2, priority=0):
    # Convert the lists to sets to find the intersection
    set1 = set(list1)
    set2 = set(list2)

    # Find the common elements
    common_elements = set1.intersection(set2)

    # Convert the result back to a list
    common_subtitles = list(common_elements)

    if not common_subtitles:
        print("No common subtitle names found.")
        return None

    return common_subtitles[priority]

In [6]:
def get_subtitle_href(movie_title, language, subtitle_name):
    movie_page_link = get_subscene_subtitle_link(movie_title)
    movie_page_link = movie_page_link + f"/{language}"

    # Define headers to prevent errors (you can customize this as needed)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    }

    # Send a GET request to the movie page
    response = requests.get(movie_page_link, headers=headers)
    wait()

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all subtitle entries
        subtitle_entries = soup.find_all("a", href=True)

        # Search for the subtitle name and return the href for positively rated subtitles
        for entry in subtitle_entries:
            spans = entry.find_all("span")
            if len(spans) >= 2 and subtitle_name in spans[1].text:
                if "positive-icon" in spans[0].get("class"):
                    href = entry["href"]
                    return f"https://subscene.com{href}"

        print(f"No positively rated subtitle '{subtitle_name}' found on the page.")
        return None
    else:
        print("Failed to retrieve the movie page. Status code:", response.status_code)
        return None

The function below will create a folder with the same name as the movie in your desired directory and download the two zip files inside. It's important to note that this function requires two languages as input because it's designed to find subtitles that are synced between those two languages. If you're looking for subtitles in just one language, this code may not be suitable for your needs.


In [7]:
def get_download_link(subtitle_page_link):
    # Define headers to prevent errors (you can customize this as needed)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    }
    # Send a GET request to the subtitle page
    response = requests.get(subtitle_page_link,headers=headers)
    wait()
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the download link element
        download_link = soup.find("a", id="downloadButton")

        if download_link:
            # Extract the href attribute and create the full download link
            href = download_link.get("href")
            full_download_link = "https://subscene.com" + href
            return full_download_link
        else:
            print("Download link not found on the page.")
            return None
    else:
        print("Failed to retrieve the subtitle page. Status code:", response.status_code)
        return None

In [8]:
# Function to create a folder if it doesn't exist
def create_folder(folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

# Function to download a file and save it with a specific name
def download_file(url, destination):
    # Define headers to prevent errors (you can customize this as needed)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    }
    response = requests.get(url,headers=headers, stream=True)
    wait()
    with open(destination, 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)

In [9]:
def download_subtitle_to(movie_title, language_1, language_2, save_to,priority=0):
    # Get subtitle lists for both languages
    language_1_list = get_subscene_subtitle_names(movie_title, language_1)
    language_2_list = get_subscene_subtitle_names(movie_title, language_2)
    # Find the common subtitle name
    subtitle_name = find_common_subtitles(language_1_list, language_2_list,priority)

    # Get subtitle page links
    subtitle_page_link_1 = get_subtitle_href(movie_title, language_1, subtitle_name)
    subtitle_page_link_2 = get_subtitle_href(movie_title, language_2, subtitle_name)

    # Get download links
    download_link_1 = get_download_link(subtitle_page_link_1)
    download_link_2 = get_download_link(subtitle_page_link_2)

    # Create the movie folder inside the "save_to" path
    movie_folder = os.path.join(save_to, movie_title)
    create_folder(movie_folder)

    # Download subtitle files directly to the "save_to" path
    download_file(download_link_1, os.path.join(movie_folder, f"{movie_title}_{language_1}.zip"))
    download_file(download_link_2, os.path.join(movie_folder, f"{movie_title}_{language_2}.zip"))
    return movie_folder

## Text Alignment

Now that we have a pair of zip files containing subtitles for a movie, our next steps are to unzip them and use the data to create dataframes.

We have two types of dataframes: line_by_line and time_based.

1. **Line-by-line dataframes:** In this method, each line of subtitle from one language is synced and paralleled with a line of subtitle from the other language. This approach is simpler and more convenient as each sentence is exactly matched with another. However, it can lead to information loss because not all lines in a subtitle file may find a parallel match.

2. **Time-based dataframes:** Here, all the subtitles that happen within a minute of a movie are represented as a single row of data. This method is advantageous when dealing with languages that have very different structures and grammar. Not all languages can be translated sentence to sentence, and sometimes longer dependencies are required for better translation.

Each method has its own set of advantages and disadvantages, and the choice between them depends on the specific characteristics of the subtitle content and the translation goals.


In [10]:
def unzip_zipfile(zipfile_path):
    try:
        # Get the folder where the zip file is located
        folder = os.path.dirname(zipfile_path)

        # Create a list to store the paths of the extracted files
        extracted_file_paths = []

        # Open the zip file
        with zipfile.ZipFile(zipfile_path, 'r') as zip_ref:
            # Extract all files in the zip to the same folder
            zip_ref.extractall(folder)

            # Get the names of the extracted files
            extracted_files = zip_ref.namelist()

            # Create paths for the extracted files
            for extracted_file in extracted_files:
                extracted_file_path = os.path.join(folder, extracted_file)
                extracted_file_paths.append(extracted_file_path)

        return extracted_file_paths

    except Exception as e:
        print("Error:", str(e))
        return None

In [11]:
def remove_html_tags(text):
    soup = BeautifulSoup(text, 'html.parser')
    clean_text = soup.get_text()
    return clean_text

In [12]:
import os
import chardet
import pandas as pd

def open_srt_as_dataframe(file_path):
    try:
        # Determine the file size to read a portion for encoding detection
        file_size = os.path.getsize(file_path)
        bytes_to_read = min(1024, file_size)  # Read at most 1024 bytes for detection

        # Read a portion of the file for encoding detection
        with open(file_path, 'rb') as file:
            raw_data = file.read(bytes_to_read)

        # Detect the encoding
        result = chardet.detect(raw_data)

        # Get the detected encoding
        detected_encoding = result['encoding']

        # Open the file with the detected encoding
        with open(file_path, 'r', encoding=detected_encoding, errors='replace') as file:
            srt_content = file.read()

        # Remove HTML tags from SRT content
        clean_srt_content = remove_html_tags(srt_content)

        # Split cleaned SRT content into individual subtitle blocks
        subtitle_blocks = clean_srt_content.strip().split('\n\n')

        # Parse the subtitle blocks into a DataFrame
        data = {'number': [], 'start': [], 'end': [], 'content': []}
        for block in subtitle_blocks:
            lines = block.strip().split('\n')
            if len(lines) >= 3:
                try:
                    data['number'].append(int(lines[0]))
                    start, end = lines[1].split(' --> ')
                    data['start'].append(pd.to_datetime(start, format='%H:%M:%S,%f'))
                    data['end'].append(pd.to_datetime(end, format='%H:%M:%S,%f'))
                    data['content'].append('\n'.join(lines[2:]))
                except ValueError:
                    continue  # Skip subtitle blocks that cannot be parsed
            else:
                continue  # Skip incomplete subtitle blocks

        df = pd.DataFrame(data)

        return df

    except Exception as e:
        print("Error:", str(e))
        return None

After extensive testing of the functions with numerous movie subtitles, it became evident that, even with an encoder detected, many Arabic subtitles were not opened with the correct encoding. To address this issue, several new functions have been introduced, all of which are suffixed with "_arabic." These functions are specifically designed for handling Arabic subtitles and should be used when downloading them.


In [13]:
def open_srt_as_dataframe_arabic(file_path):
    try:
        # Determine the file size to read a portion for encoding detection
        file_size = os.path.getsize(file_path)
        bytes_to_read = min(1024, file_size)  # Read at most 1024 bytes for detection

        # Read a portion of the file for encoding detection
        with open(file_path, 'rb') as file:
            raw_data = file.read(bytes_to_read)

        # Detect the encoding
        result = chardet.detect(raw_data)

        # Get the detected encoding
        detected_encoding = result['encoding']

        # Open the file with the detected encoding
        with open(file_path, 'r', encoding=detected_encoding, errors='replace') as file:
            srt_content = file.read()

        # Detect the language of the first 30 lines using langdetect
        try:
            detected_language = detect(srt_content[:1000])
        except langdetect.lang_detect_exception.LangDetectException:
            detected_language = "unknown"

        # If the detected language is Arabic, continue with Arabic content
        if detected_language == "ar":
            pass
        else:
            # Change the encoding to windows-1256 and read the file again
            detected_encoding = "windows-1256"
            with open(file_path, 'r', encoding=detected_encoding, errors='replace') as file:
                srt_content = file.read()

        # Remove HTML tags from SRT content
        clean_srt_content = remove_html_tags(srt_content)

        # Remove text within curly braces like {text}
        clean_srt_content = re.sub(r'\{.*?\}', '', clean_srt_content)

        # Split cleaned SRT content into individual subtitle blocks
        subtitle_blocks = clean_srt_content.strip().split('\n\n')

        # Parse the subtitle blocks into a DataFrame
        data = {'number': [], 'start': [], 'end': [], 'content': []}
        for block in subtitle_blocks:
            lines = block.strip().split('\n')
            if len(lines) >= 3:
                try:
                    data['number'].append(int(lines[0]))
                    start, end = lines[1].split(' --> ')
                    data['start'].append(pd.to_datetime(start, format='%H:%M:%S,%f'))
                    data['end'].append(pd.to_datetime(end, format='%H:%M:%S,%f'))
                    data['content'].append('\n'.join(lines[2:]))
                except ValueError:
                    continue  # Skip subtitle blocks that cannot be parsed
            else:
                continue  # Skip incomplete subtitle blocks

        df = pd.DataFrame(data)

        return df

    except Exception as e:
        print("Error:", str(e))
        return None

In [14]:
def remove_curly_braces(text):
    return re.sub(r'\{.*?\}', '', text)

def open_srt_as_dataframe_time_based(file_path):
    try:
        # Determine the file size to read a portion for encoding detection
        file_size = os.path.getsize(file_path)
        bytes_to_read = min(1024, file_size)  # Read at most 1024 bytes for detection

        # Read a portion of the file for encoding detection
        with open(file_path, 'rb') as file:
            raw_data = file.read(bytes_to_read)

        # Detect the encoding
        result = chardet.detect(raw_data)

        # Get the detected encoding
        detected_encoding = result['encoding']

        # Open the file with the detected encoding
        with open(file_path, 'r', encoding=detected_encoding, errors='replace') as file:
            srt_content = file.read()

        # Remove HTML tags from SRT content
        clean_srt_content = remove_html_tags(srt_content)

        # Split cleaned SRT content into individual subtitle blocks
        subtitle_blocks = clean_srt_content.strip().split('\n\n')

        # Initialize data for the DataFrame
        data = {'hour': [], 'minute': [], 'content': []}

        current_hour, current_minute = 0, 0
        current_content = ""

        for block in subtitle_blocks:
            lines = block.strip().split('\n')

            if len(lines) >= 3:
                # Extract the start time
                start_time = pd.to_datetime(lines[1].split(' --> ')[0], format='%H:%M:%S,%f')
                hour, minute = start_time.hour, start_time.minute

                if hour != current_hour or minute != current_minute:
                    if current_content:
                        data['hour'].append(current_hour)
                        data['minute'].append(current_minute)
                        data['content'].append(current_content)

                    current_hour, current_minute = hour, minute
                    current_content = remove_curly_braces('\n'.join(lines[2:]))
                else:
                    current_content += "\n" + remove_curly_braces('\n'.join(lines[2:]))

        if current_content:
            data['hour'].append(current_hour)
            data['minute'].append(current_minute)
            data['content'].append(current_content)

        df = pd.DataFrame(data)

        return df

    except Exception as e:
        print("Error:", str(e))
        return None

In [15]:
def create_parallel_dataset(language_1_subtitles, language_2_subtitles, time_threshold=1):
    # Merge the English and French subtitles using cross join
    merged_df = pd.merge(language_1_subtitles, language_2_subtitles, how='cross')

    # Calculate the time difference in seconds
    merged_df["time_difference"] = (merged_df["start_y"] - merged_df["start_x"]).dt.total_seconds()

    # Filter the rows where the time difference is less than the threshold
    parallel_df = merged_df[abs(merged_df["time_difference"]) < time_threshold]

    # Create a new dataframe with "language_1" and "language_2" columns
    language_1_language_2 = parallel_df[["content_x", "content_y"]]

    # Rename the columns to "language_1" and "language_2"
    language_1_language_2 = language_1_language_2.rename(columns={"content_x": "language_1", "content_y": "language_2"})

    return language_1_language_2

In [16]:
def create_parallel_dataset_time_based(df1, df2):
    # Merge the 'content' columns from both DataFrames
    merged_df = pd.merge(df1, df2, left_on=['hour', 'minute'], right_on=['hour', 'minute'], how='outer')

    # Rename the 'content' columns from each DataFrame
    merged_df.rename(columns={'content_x': 'language_1', 'content_y': 'language_2'}, inplace=True)

    # Fill NaN values with empty strings
    merged_df['language_1'].fillna('', inplace=True)
    merged_df['language_2'].fillna('', inplace=True)

    # Drop the 'hour' and 'minute' columns
    merged_df.drop(['hour', 'minute'], axis=1, inplace=True)

    return merged_df

In [17]:
def align_time_based(directory_path):
    # List all zip files in the specified directory
    zip_files = list_zip_files(directory_path)

    # Sort the zip files to ensure the "english.zip" comes first
    zip_files = sorted(zip_files, key=custom_sort_key)

    # Extract and parse the first SRT file from the first zip file
    zipfile_path = zip_files[0]
    language_1_subtitles = open_srt_as_dataframe_time_based(unzip_zipfile(zipfile_path)[0])

    # Extract and parse the second SRT file from the second zip file
    zipfile_path = zip_files[1]
    language_2_subtitles = open_srt_as_dataframe_time_based(unzip_zipfile(zipfile_path)[0])

    # Create a parallel dataset with aligned subtitles
    parallel_dataset = create_parallel_dataset_time_based(language_1_subtitles, language_2_subtitles)

    # Reset the index and drop the extra index column
    #parallel_dataset = parallel_dataset.reset_index().drop("index", axis=1)

    # Save the parallel dataset as a CSV in the same directory
    csv_path = os.path.join(directory_path, "parallel_subtitle_time_based.csv")
    parallel_dataset.to_csv(csv_path, index=False)

    # Print a success message
    print("Alignment and Time Based CSV creation completed successfully.")

In [18]:
def align_time_based_arabic(directory_path):
    # List all zip files in the specified directory
    zip_files = list_zip_files(directory_path)

    # Sort the zip files to ensure the "english.zip" comes first
    zip_files = sorted(zip_files, key=custom_sort_key)

    # Extract and parse the first SRT file from the first zip file
    zipfile_path = zip_files[0]
    language_1_subtitles = open_srt_as_dataframe_time_based(unzip_zipfile(zipfile_path)[0])

    # Extract and parse the second SRT file from the second zip file
    zipfile_path = zip_files[1]
    language_2_subtitles = open_srt_as_dataframe_time_based_arabic(unzip_zipfile(zipfile_path)[0])

    # Create a parallel dataset with aligned subtitles
    parallel_dataset = create_parallel_dataset_time_based(language_1_subtitles, language_2_subtitles)

    # Reset the index and drop the extra index column

    # Save the parallel dataset as a CSV in the same directory
    csv_path = os.path.join(directory_path, "parallel_subtitle_time_based.csv")
    parallel_dataset.to_csv(csv_path, index=False)

    # Print a success message
    print("Alignment and Time Based CSV creation completed successfully.")

In [19]:
def list_zip_files(directory):
    zip_files = []

    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".zip"):
                zip_files.append(os.path.join(root, file))

    return zip_files

In [20]:
# Custom sorting key function
def custom_sort_key(item):
    if item.endswith("english.zip"):
        return (0, item)  # Assign a lower sort key for strings ending with "english.zip"
    else:
        return (1, item)  # Assign a higher sort key for other strings

In [21]:
def open_srt_as_dataframe_time_based_arabic(file_path):
    try:
        # Determine the file size to read a portion for encoding detection
        file_size = os.path.getsize(file_path)
        bytes_to_read = min(1024, file_size)  # Read at most 1024 bytes for detection

        # Read a portion of the file for encoding detection
        with open(file_path, 'rb') as file:
            raw_data = file.read(bytes_to_read)

        # Detect the encoding
        result = chardet.detect(raw_data)

        # Get the detected encoding
        detected_encoding = result['encoding']

        # Open the file with the detected encoding
        with open(file_path, 'r', encoding=detected_encoding, errors='replace') as file:
            srt_content = file.read()

        # Detect the language of the first 1000 characters using langdetect
        try:
            detected_language = detect(srt_content[:1000])
        except langdetect.lang_detect_exception.LangDetectException:
            detected_language = "unknown"

        # If the detected language is Arabic, continue with Arabic content
        if detected_language == "ar":
            pass
        else:
            # Change the encoding to windows-1256 and read the file again
            detected_encoding = "windows-1256"
            with open(file_path, 'r', encoding=detected_encoding, errors='replace') as file:
                srt_content = file.read()

        # Remove text within curly braces like {text}
        srt_content = re.sub(r'\{.*?\}', '', srt_content)

        # Remove HTML tags from SRT content
        clean_srt_content = re.sub(r'<[^>]*>', '', srt_content)

        # Split cleaned SRT content into individual subtitle blocks
        subtitle_blocks = clean_srt_content.strip().split('\n\n')

        # Initialize data for the DataFrame
        data = {'hour': [], 'minute': [], 'content': []}

        current_hour, current_minute = 0, 0
        current_content = ""

        for block in subtitle_blocks:
            lines = block.strip().split('\n')

            if len(lines) >= 3:
                # Extract the start time
                start_time = pd.to_datetime(lines[1].split(' --> ')[0], format='%H:%M:%S,%f')
                hour, minute = start_time.hour, start_time.minute

                if hour != current_hour or minute != current_minute:
                    if current_content:
                        data['hour'].append(current_hour)
                        data['minute'].append(current_minute)
                        data['content'].append(current_content)

                    current_hour, current_minute = hour, minute
                    current_content = remove_curly_braces('\n'.join(lines[2:]))
                else:
                    current_content += "\n" + remove_curly_braces('\n'.join(lines[2:]))

        if current_content:
            data['hour'].append(current_hour)
            data['minute'].append(current_minute)
            data['content'].append(current_content)

        df = pd.DataFrame(data)

        return df

    except Exception as e:
        print("Error:", str(e))
        return None

In [22]:
def align(directory_path, time_threshold=1):
    # List all zip files in the specified directory
    zip_files = list_zip_files(directory_path)

    # Sort the zip files to ensure the "english.zip" comes first
    zip_files = sorted(zip_files, key=custom_sort_key)

    # Extract and parse the first SRT file from the first zip file
    zipfile_path = zip_files[0]
    language_1_subtitles = open_srt_as_dataframe(unzip_zipfile(zipfile_path)[0])

    # Extract and parse the second SRT file from the second zip file
    zipfile_path = zip_files[1]
    language_2_subtitles = open_srt_as_dataframe(unzip_zipfile(zipfile_path)[0])

    # Create a parallel dataset with aligned subtitles
    parallel_dataset = create_parallel_dataset(language_1_subtitles, language_2_subtitles, time_threshold)

    # Reset the index and drop the extra index column
    parallel_dataset = parallel_dataset.reset_index().drop("index", axis=1)

    # Save the parallel dataset as a CSV in the same directory
    csv_path = os.path.join(directory_path, "parallel_subtitle_line_by_line.csv")
    parallel_dataset.to_csv(csv_path, index=False)

    # Print a success message
    print("Alignment and CSV creation completed successfully.")

In [23]:
def align_arabic(directory_path, time_threshold=1):
    # List all zip files in the specified directory
    zip_files = list_zip_files(directory_path)

    # Sort the zip files to ensure the "english.zip" comes first
    zip_files = sorted(zip_files, key=custom_sort_key)

    # Extract and parse the first SRT file from the first zip file
    zipfile_path = zip_files[0]
    language_1_subtitles = open_srt_as_dataframe(unzip_zipfile(zipfile_path)[0])

    # Extract and parse the second SRT file from the second zip file
    zipfile_path = zip_files[1]
    language_2_subtitles = open_srt_as_dataframe_arabic(unzip_zipfile(zipfile_path)[0])

    # Create a parallel dataset with aligned subtitles
    parallel_dataset = create_parallel_dataset(language_1_subtitles, language_2_subtitles, time_threshold)

    # Reset the index and drop the extra index column
    parallel_dataset = parallel_dataset.reset_index().drop("index", axis=1)

    # Save the parallel dataset as a CSV in the same directory
    csv_path = os.path.join(directory_path, "parallel_subtitle_line_by_line.csv")
    parallel_dataset.to_csv(csv_path, index=False)

    # Print a success message
    print("Alignment and CSV creation completed successfully.")

In [24]:
def has_csv_file(file_path):
    # Check if the directory exists
    if not os.path.exists(file_path):
        return False

    # List all files in the directory
    files = os.listdir(file_path)

    # Check if any file has a .csv extension
    for file in files:
        if file.endswith('.csv'):
            return True

    return False

In [25]:
def delete_folder_and_contents(folder_path):
    try:
        # Check if the folder exists
        if not os.path.exists(folder_path) or not os.path.isdir(folder_path):
            print("Folder does not exist or is not a directory.")
            return

        # Delete the entire folder and its contents
        shutil.rmtree(folder_path)
        print(f"Deleted folder and its contents: {folder_path}")

    except Exception as e:
        print(f"Error: {str(e)}")

In [26]:
def find_csv_in_folder(folder_path, filename="parallel_subtitle_line_by_line.csv"):
    try:
        # Check if the folder exists
        if not os.path.exists(folder_path) or not os.path.isdir(folder_path):
            print("Folder does not exist or is not a directory.")
            return None

        # Iterate through the files in the folder
        for root, dirs, files in os.walk(folder_path):
            for file in files:
                if file == filename:
                    # Found the CSV file with the specified name
                    return os.path.join(root, file)

        print(f"CSV file '{filename}' not found in the folder.")
        return None

    except Exception as e:
        print(f"Error: {str(e)}")
        return None

Furthermore, an additional function has been implemented to automatically detect the languages of the downloaded subtitle files. This serves the purpose of ensuring that the language pair is consistent and that the subtitle files are not corrupted. It adds an extra layer of verification to the subtitle acquisition process.


In [27]:
def detect_language_from_csv_column(csv_path, column_name):
    try:
        # Read the CSV file into a DataFrame
        df = pd.read_csv(csv_path)

        # Extract the first 20 rows from the specified column
        first_20_rows = df[column_name].head(20).str.cat()

        # Detect the language of the concatenated string
        detected_language = detect(first_20_rows)

        return detected_language
    except Exception as e:
        print("Error:", str(e))
        return None

In [28]:
def delete_non_zip_csv_files(folder_path):
    try:
        for root, dirs, files in os.walk(folder_path):
            for filename in files:
                file_path = os.path.join(root, filename)
                if not (filename.endswith(".zip") or filename.endswith(".csv")):
                    os.remove(file_path)
        return True
    except Exception as e:
        print("Error:", str(e))
        return False

The final function, while it may appear somewhat intricate and includes some repetition, is designed to achieve a specific set of tasks. It begins by creating a folder named after the movie and downloading two subtitle zip files into it. Subsequently, the align function attempts to generate a parallel subtitle file. If this process encounters an error, the zip files are replaced with another pair, and the align function is re-executed. This retry mechanism is repeated up to three times to ensure successful alignment. Once all operations are completed successfully, a language detector is used to print the detected languages of the downloaded subtitle files for verification. Finally, the function cleans the folder by removing unnecessary files to maintain a tidy workspace.


In [29]:
def create_parallel_subtitles_line_by_line(movie_title,language_1,language_2,save_to,time_threshold=0.7):
  dir = download_subtitle_to(movie_title,language_1,language_2,save_to)
  try:
    align(dir,time_threshold)
  except:
    pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=1)
    try:
      align(dir, time_threshold)
    except:
      pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=2)
    try:
      align(dir, time_threshold)
    except:
      pass
  csv_path = find_csv_in_folder(dir)
  first_lan = detect_language_from_csv_column(csv_path,"language_1")
  second_lan = detect_language_from_csv_column(csv_path,"language_2")
  print(f"language pair of {os.path.basename(dir) } is {first_lan}:{second_lan}")
  delete_non_zip_csv_files(dir)

In [30]:
def create_parallel_subtitles_time_based(movie_title,language_1,language_2,save_to,time_threshold=0.7):
  dir = download_subtitle_to(movie_title,language_1,language_2,save_to)
  try:
    align_time_based(dir)
  except:
    pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=1)
    try:
      align_time_based(dir)
    except:
      pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=2)
    try:
      align_time_based(dir)
    except:
      pass
  csv_path = find_csv_in_folder(dir)
  first_lan = detect_language_from_csv_column(csv_path,"language_1")
  second_lan = detect_language_from_csv_column(csv_path,"language_2")
  print(f"language pair of {os.path.basename(dir) } is {first_lan}:{second_lan}")
  delete_non_zip_csv_files(dir)

In [31]:
def create_parallel_subtitles_arabic_line_by_line(movie_title,language_1,language_2,save_to,time_threshold=0.7):
  dir = download_subtitle_to(movie_title,language_1,language_2,save_to)
  try:
    align_arabic(dir,time_threshold)
  except:
    pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=1)
    try:
      align_arabic(dir, time_threshold)
    except:
      pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=2)
    try:
      align_arabic(dir, time_threshold)
    except:
      pass
  csv_path = find_csv_in_folder(dir)
  first_lan = detect_language_from_csv_column(csv_path,"language_1")
  second_lan = detect_language_from_csv_column(csv_path,"language_2")
  print(f"language pair of {os.path.basename(dir) } is {first_lan}:{second_lan}")
  delete_non_zip_csv_files(dir)

In [32]:
def create_parallel_subtitles_arabic_time_based(movie_title,language_1,language_2,save_to,time_threshold=0.7):
  dir = download_subtitle_to(movie_title,language_1,language_2,save_to)
  try:
    align_time_based_arabic(dir)
  except:
    pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=1)
    try:
      align_time_based_arabic(dir)
    except:
      pass
  if has_csv_file(dir)==False:
    delete_folder_and_contents(dir)
    dir = download_subtitle_to(movie_title,language_1,language_2,save_to,priority=2)
    try:
      align_time_based_arabic(dir)
    except:
      pass
  csv_path = find_csv_in_folder(dir)
  first_lan = detect_language_from_csv_column(csv_path,"language_1")
  second_lan = detect_language_from_csv_column(csv_path,"language_2")
  print(f"language pair of {os.path.basename(dir) } is {first_lan}:{second_lan}")
  delete_non_zip_csv_files(dir)

### Example

In [33]:
movie = "Blade Runner 2049"
language_1 = "english"
language_2 = "french"
save_to = "/content"#path to save the downloaded subtitles

In [34]:
create_parallel_subtitles_line_by_line(movie, language_1, language_2, save_to)
create_parallel_subtitles_time_based(movie, language_1, language_2, save_to)

Alignment and CSV creation completed successfully.
language pair of Blade Runner 2049 is en:fr
Alignment and Time Based CSV creation completed successfully.
language pair of Blade Runner 2049 is en:fr


I've also uploaded the parallel subtitle files to my google drive which can be downloaded with gdown.

In [None]:
!pip install gdown

In [36]:
import gdown

In [37]:
# Define the file ID and the output filename
file_id = "1Iw0TonQcDTAfJfzElasg6pt92322zem6"
output_filename = "Blade_Runner_2049_line_by_line.csv"

# Download the file from Google Drive
gdown.download(f"https://drive.google.com/uc?id={file_id}", output_filename, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1Iw0TonQcDTAfJfzElasg6pt92322zem6
To: /content/Blade_Runner_2049_line_by_line.csv
100%|██████████| 61.4k/61.4k [00:00<00:00, 57.0MB/s]


'Blade_Runner_2049_line_by_line.csv'

In [38]:
# Define the file ID and the output filename
file_id = "16zSr8T7TOMg4wkc6-3rLdTYbqr7bG06j"
output_filename = "Blade_Runner_2049_time_based.csv"

# Download the file from Google Drive
gdown.download(f"https://drive.google.com/uc?id={file_id}", output_filename, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=16zSr8T7TOMg4wkc6-3rLdTYbqr7bG06j
To: /content/Blade_Runner_2049_time_based.csv
100%|██████████| 70.9k/70.9k [00:00<00:00, 73.0MB/s]


'Blade_Runner_2049_time_based.csv'

In [39]:
line_by_line = pd.read_csv("/content/Blade_Runner_2049_line_by_line.csv")
time_based = pd.read_csv("/content/Blade_Runner_2049_time_based.csv")

In [40]:
line_by_line

Unnamed: 0,language_1,language_2
0,K: I hope you don't mind me\ntaking the liberty.,"Je me suis permis, Ca ne vous dérange pas,\nj’..."
1,I was careful\nnot to drag in any dirt.,J’ai fait attention à ne pas laisser\nentrer d...
2,SAPPER: I don't mind the dirt.,Je me fiche de la poussière.
3,I do mind,Mais pas pareil
4,unannounced visits.,…pour visites inattendues.
...,...,...
950,Who am I to you?,Qu'est ce que je suis pour toi?
951,Go meet your daughter.,Va voir ta fille.
952,You okay?,Ca va aller?
953,Just a moment.,Un instant s'il vous plait?


In [41]:
time_based

Unnamed: 0,language_1,language_2
0,\nSubtitles by explosiveskull,\nPh3nIc3\nEnjoy and Merry Christmas !!! Joyeu...
1,(ALARM BUZZES)\n(BREATHES DEEPLY),
2,(AIRCRAFT APPROACHING),
3,(SPRAY HISSING)\n(HISSING STOPS)\n(DRONE HOVER...,
4,(POT BOILING)\n(TAP RUNNING)\n(TAP STOPS)\nK: ...,"Je me suis permis, Ca ne vous dérange pas,\nj’..."
...,...,...
141,You shoulda let me\ndie out there.\nK: You did...,Tu aurais du me laisser mourir là dedans.\nC'e...
142,All the best memories\nare hers.\nWhy?\nWho am...,Tous les meilleurs souvenirs sont les siens.\n...
143,"Just a moment.\nBeautiful, isn't it?\nSubtitle...","Un instant s'il vous plait?\nC'est magnifique,..."
144,,L'effondrement de l'Ecosystème au milieu des a...
