<a href="https://colab.research.google.com/github/August-murr/Data_science_Demonstration/blob/main/Data%20Collection%20and%20Web%20Scraping%20for%20Machine%20Translation/Data_collection_and_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This Jupyter notebook is dedicated to the comprehensive data collection for the purpose of training or fine-tuning a Large Language Model (LLM) for movie translation and subtitle generation. The objective of this project is to enrich the model's contextual knowledge, allowing it to deliver more nuanced and accurate translations for a diverse range of movies.

The data collection process is structured into five distinct parts, each contributing valuable insights and content to enhance the LLM's understanding of movie content:

**Subtitle Collection:** In the initial phase, we gather movie subtitles in various languages based on specific search queries. These subtitles play a pivotal role in understanding the linguistic diversity within the movie content.

**Movie Metadata Retrieval:** The second part of our endeavor involves collecting essential movie metadata from IMDb (Internet Movie Database). This information provides a foundational understanding of each movie's background and attributes.

**Synopsis Acquisition:** A comprehensive understanding of a movie necessitates a clear overview of its storyline. To this end, we procure detailed synopses from IMDb, encapsulating the complete narrative of each movie.

**Script Compilation:** The script of a movie often contains intricate details that offer profound insights into the storyline, dialogues, and character interactions. This segment of our project is dedicated to gathering movie scripts to provide a wealth of contextual information.

**Speech Recognition and Transcription:** In the final phase, we employ speech recognition technology to transcribe specific scenes from movies. This method captures the spoken dialogue and interactions, enriching our dataset with valuable audio content.

The amalgamation of these data sources equips our LLM with a broader understanding of the intricacies of each movie, facilitating more accurate and context-aware translations. This notebook serves as a pivotal step in our journey towards enhancing the capabilities of language models for movie translation and subtitle generation.



# Dependancies

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import os
import time

 **requests** library is used to make HTTP requests, allowing us to retrieve data from web pages.

**Beautiful Soup (bs4)** is employed for parsing and extracting information from HTML documents, making it easier to navigate and manipulate web content.

 **re (regular expressions)** facilitates string manipulation and pattern matching, useful for extracting specific data patterns from text.

 **os** provides tools for working with the file system, enabling us to create folders and files for organized data storage.

 **time**, specifically the **sleep** function, is used to introduce pauses or delays in the execution of code. It's essential for rate-limiting requests and preventing excessive traffic to web servers.

The following function is designed to remove all files and folders within a specified directory. This is particularly useful during testing phases to prevent clutter and avoid the creation of redundant files and folders.


In [2]:
def delete_files_and_folders(directory):
    # Iterates through the directory structure in a bottom-up approach, starting from the deepest nested files and folders.
    for root, dirs, files in os.walk(directory, topdown=False):
        for file in files:
            # Removes individual files located within the directory.
            file_path = os.path.join(root, file)
            os.remove(file_path)
        for folder in dirs:
            # Removes entire folders within the directory, ensuring a clean slate for subsequent testing.
            folder_path = os.path.join(root, folder)
            os.rmdir(folder_path)

### Collecting the Names of Top 250 IMDb Movies
 We will use this list to download subtitles for multiple movies in different languages later in the notebook.

In [3]:
# Collecting the Names of Top 250 IMDb Movies
# We will use this list to download subtitles for multiple movies in different languages later in the notebook.

# Define user-agent headers to simulate a real browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

# Define the URL of the IMDb top 250 page
url = "https://www.imdb.com/chart/top/"

# Send an HTTP GET request to the URL, simulating a real browser request using user-agent headers
response = requests.get(url, headers=headers)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Find all the h3 elements with the class "ipc-title__text" which contain movie names
movie_elements = soup.find_all("h3", class_="ipc-title__text")

# Extract movie names from the elements and store them in a list, removing unwanted elements
movie_names = [element.text.strip() for element in movie_elements][1:-12]

# Prepare the IMDb Top 250 movies list by removing the ranking numbers
imdb_top_250_movies_list = [movie.split('. ', 1)[1] for movie in movie_names]


## Downloading Subtitles for IMDb Top 10 Movies



The following Python code defines a function and a loop for downloading movie subtitles in different languages from the Subscene website for the top 10 IMDb movies. This task is accomplished in a step-by-step process:

1. **`download_subtitle_zip_by_name_and_language` Function:**
    - This function takes two parameters, `movie_name` (the name of the movie) and `language` (the desired subtitle language).
    - It begins by generating a subtitle link for the given movie and language:
      - Sends an HTTP GET request to Subscene's search page to search for subtitles based on the movie name.
      - Parses the search results to find the first search result (assuming there is at least one) and extracts its URL.
      - Appends the specified language to the URL to obtain the subtitles in that language.
    - If successful, the function proceeds to download the subtitles:
      - Sends an HTTP GET request to the subtitle page.
      - Extracts the download link for the subtitles and constructs the full download URL.
      - Extracts the filename from the response headers.
      - Creates a folder with the same name as the movie.
      - Saves the downloaded zip file to the folder.
    - The function returns the path to the downloaded zip file.

2. **Loop for Downloading Subtitles:**
    - The code includes a loop that iterates through a list of specified languages (e.g., English,French, Thai).
    - For each language, it further loops through the top 10 movies from IMDb's top 250 movies list (`imdb_top_250_movies_list[:10]`).
    - It attempts to download subtitles for each movie in the given language, and a pause of 2 seconds (`time.sleep(2)`) is introduced between requests.
    - In case of any exceptions, an exception handler is used to continue the loop without interrupting the execution.

This code effectively automates the process of searching and downloading subtitles for a set of top IMDb movies in various languages, making it a useful tool for collecting subtitle data for machine translation and subtitle generation tasks.


In [4]:
def download_subtitle_zip_by_name_and_language(movie_name, language):
    def generate_subtitle_link(query, language):
        # Define the base URL
        base_url = "https://www.subscene.com"

        # Define user-agent headers to simulate a real browser request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        }

        # Send a GET request to perform the search
        search_url = f"{base_url}/subtitles/searchbytitle"
        params = {"query": query}
        response = requests.get(search_url, headers=headers, params=params)

        # Parse the search results page using BeautifulSoup
        soup = BeautifulSoup(response.text, "html.parser")

        # Find the first search result (assuming there is at least one result)
        first_result = soup.find("div", class_="search-result").find("a")

        # Get the URL of the first result
        first_result_url = first_result["href"]

        # Extract the movie name from the URL (last part of the URL)
        movie_name = first_result_url.split("/")[-1]

        # Append language to the URL to get subtitles in the specified language
        subtitle_url = f"{base_url}{first_result_url}/{language}"

        # Send an HTTP GET request to the subtitle page
        response = requests.get(subtitle_url,headers=headers)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Parse the HTML content of the page using BeautifulSoup
            soup = BeautifulSoup(response.text, "html.parser")

            # Find the first "positive-icon" element and extract its href attribute
            positive_icon_tag = soup.find("span", class_="positive-icon")
            if positive_icon_tag:
                href = positive_icon_tag.find_parent("a")["href"]

                # Construct the full subtitle link by adding the base URL
                full_subtitle_link = f"https://subscene.com{href}"

                return full_subtitle_link
            else:
                print(f"No positive-rated {language} subtitles found for {movie_name}")
        else:
            print(f"Failed to retrieve the subtitle page. Status code: {response.status_code}")
    # Generate the subtitle link using the movie name and language
    subtitle_link = generate_subtitle_link(movie_name, language)

    if subtitle_link:
        # Define user-agent headers to simulate a real browser request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
        }
        # Send an HTTP GET request to the subtitle page
        response = requests.get(subtitle_link,headers=headers)

        # Extract the download link from the subtitle page
        soup = BeautifulSoup(response.text, "html.parser")
        download_link_tag = soup.find("a", class_="button positive")

        if download_link_tag:
            href = download_link_tag.get("href")

            # Construct the full download URL
            full_download_url = f"https://subscene.com{href}"

            # Extract the filename from the response headers
            # Define user-agent headers to simulate a real browser request
            headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
            }

            zip_file_response = requests.get(full_download_url,headers=headers)
            content_disposition = zip_file_response.headers.get("content-disposition")
            filename = content_disposition.split("filename=")[1]

            # Create a folder with the same name as the movie
            folder_name = movie_name
            os.makedirs(folder_name, exist_ok=True)

            # Save the zip file to the folder
            zip_file_path = os.path.join(folder_name, filename)
            with open(zip_file_path, "wb") as f:
                f.write(zip_file_response.content)

            return zip_file_path
        else:
            print("Download link not found on the subtitle page.")
    return None

In [5]:
# List of languages for which subtitles will be downloaded
languages = ["english", "french", "thai"]

# Loop through each language and the top 10 IMDb movies
for language in languages:
    for query in imdb_top_250_movies_list[:10]:
        try:
            # Attempt to download subtitles for the current movie and language
            download_subtitle_zip_by_name_and_language(query, language)

            # Introduce a 2-second pause to be considerate of server requests
            time.sleep(2)
        except:
            # In case of an exception, continue to the next movie or language
            pass

No positive-rated thai subtitles found for 12-angry-men


**Important Notes:**

- The `download_subtitle_zip_by_name_and_language` function, while functional, may not be flawless in all cases. However, it should work effectively in most instances.

- If you encounter issues with the function, consider the following troubleshooting steps:
  - Ensure that there are well-rated subtitles available for the specific movie you are trying to download. The function is designed to download subtitles with positive ratings.
  - Keep in mind that Subscene's search engine, while suitable for movies, can be less reliable when dealing with TV series.
  - Pay close attention to the `language` parameter. Make sure it is spelled exactly as it appears on Subscene.com. For instance, "France" will not work if the correct language name is "French."

These notes provide valuable guidance to users and address potential issues they may encounter while using the provided function for subtitle downloads.

# IMDb Movie Information Retrieval

In this section, we define two Python functions that work in tandem to retrieve essential information about a movie from IMDb based on a provided search query.

### `get_imdb_movie_link(query)`

- **Purpose**: This function takes a movie title or query as input and returns the IMDb link to the most relevant movie search result.
- **Steps**:
  1. It sends an HTTP GET request to the IMDb search page, simulating a real browser request using user-agent headers.
  2. Parses the search results page to find the first link to the movie (assuming there is a result).
  3. Extracts the IMDb link from the link found and returns it.
- **Error Handling**:
  - If there is an error during the process, it prints an error message and returns `None`.

### `extract_movie_info(imdb_link)`

- **Purpose**: This function takes an IMDb link as input and extracts detailed information about the movie.
- **Steps**:
  1. It sends an HTTP GET request to the IMDb movie page with user-agent headers.
  2. Parses the HTML content of the page using BeautifulSoup.
  3. Extracts various movie details such as title, release year, age rating, and genres.
  4. Compiles the extracted information into a dictionary.
- **Error Handling**:
  - If an error occurs, it prints an error message and returns `None`.

### Example Usage:

```python
# Usage Example
movie_link = get_imdb_movie_link("Joker")  # Get IMDb link for the movie "Joker"
movie_info = extract_movie_info(movie_link)  # Extract detailed information about the movie

# Output is a dictionary containing movie information, e.g.:
# {'Title': 'Joker', 'Release Year': '2019', 'Age Rating': 'R', 'Genres': ['Crime', 'Drama', 'Thriller']}


In [6]:
def get_imdb_movie_link(query):
    try:
        # Send a search request to IMDb
        url = f"https://www.imdb.com/find?q={query}"
        # Define user-agent headers to simulate a real browser request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
            "Accept-Language": "en-US"
        }
        response = requests.get(url,headers=headers)

        # Check if the request was successful (status code 200)
        response.raise_for_status()

        # Parse the search results page
        soup = BeautifulSoup(response.text, "html.parser")

        # Find the first <a> element with the specified class attribute
        first_link = soup.find("a", class_="ipc-metadata-list-summary-item__t")

        # Extract the href attribute (IMDb link) from the first <a> element
        if first_link:
            imdb_link = first_link.get("href")
            imdb_full_link = f"https://www.imdb.com{imdb_link}"
            return imdb_full_link
        else:
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

In [7]:
def extract_movie_info(imdb_link):
    try:
        # Define user-agent headers to simulate a real browser request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
            "Accept-Language": "en-US"
        }

        # Send an HTTP GET request to the IMDb page with headers
        response = requests.get(imdb_link, headers=headers)
        response.raise_for_status()  # Check if the request was successful

        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract the movie title
        title_element = soup.find("span", class_="sc-afe43def-1")
        title = title_element.text.strip() if title_element else None

        # Extract the release year
        release_year_element = soup.find("a", href=lambda x: x and "/releaseinfo?ref_=tt_ov_rdat" in x)
        release_year = release_year_element.text.strip() if release_year_element else None

        # Extract the age rating
        age_rating_element = soup.find("a", href=lambda x: x and "/parentalguide/certificates?ref_=tt_ov_pg" in x)
        age_rating = age_rating_element.text.strip() if age_rating_element else None

        # Extract all genres
        genres_elements = soup.find_all("a", class_="ipc-chip ipc-chip--on-baseAlt")
        genres = [genre.find("span", class_="ipc-chip__text").text.strip() for genre in genres_elements]

        # Create a dictionary with the extracted information
        movie_info = {
            "Title": title,
            "Release Year": release_year,
            "Age Rating": age_rating,
            "Genres": genres
        }

        return movie_info
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

In [8]:
extract_movie_info(get_imdb_movie_link("Joker"))

{'Title': 'Joker',
 'Release Year': '2019',
 'Age Rating': 'R',
 'Genres': ['Crime', 'Drama', 'Thriller']}

# Saving Movie Synopsis to a Text File

The following Python function, `save_plot_synopsis_to_txt`, is designed to retrieve the plot synopsis of a movie from IMDb and save it to a text file. It takes the IMDb movie page URL and the path to the output text file as input.

###`save_plot_synopsis_to_txt(movie_url, output_file_path)`

- **Purpose**: This function extracts the plot synopsis of a movie and saves it as a text file.
- **Functionality**:
    1. The function first converts the IMDb movie page URL to the URL for the plot summary of the same movie.
    2. It then sends an HTTP GET request to the synopsis URL, simulating a real browser request using user-agent headers.
    3. Upon a successful request (HTTP status code 200), it parses the HTML content of the synopsis page using BeautifulSoup.
    4. It locates and extracts the plot synopsis text from the inner divs of the HTML content, usually present in the last inner div.
    5. The extracted synopsis text is saved to a text file at the provided output file path.
- **Error Handling**:
    - If something goes wrong during the process or no synopsis is found, the function returns `False`.
- **Example Usage**:

```python
# Usage Example
movie_url = get_imdb_movie_link("Joker")  # Get IMDb link for the movie "Joker"
output_file_path = "joker_synopsis.txt"  # Path for the output text file

# Attempt to save the plot synopsis to the text file
result = save_plot_synopsis_to_txt(movie_url, output_file_path)

# 'result' will be True if the synopsis is successfully saved; otherwise, it will be False.


In [9]:
def save_plot_synopsis_to_txt(movie_url, output_file_path):
    # Convert the movie page URL to the synopsis URL
    imdb_url = movie_url
    synopsis_url = re.sub(r"(https://www.imdb.com/title/)(tt\d+)(/.*)", r"\1\2/plotsummary/?ref_=tt_stry_pl", imdb_url)

    # Send a request to the synopsis URL
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
        "Accept-Language": "en-US"
    }
    response = requests.get(synopsis_url, headers=headers)

    if response.status_code == 200:
        # Parse the HTML content of the synopsis page
        soup = BeautifulSoup(response.text, "html.parser")

        # Find all the HTML content inner divs
        inner_divs = soup.find_all("div", class_="ipc-html-content-inner-div")

        # Get the text from the last inner div (which usually contains the synopsis)
        if inner_divs:
            synopsis_text = inner_divs[-1].get_text()

            # Save the synopsis to a .txt file
            with open(output_file_path, "w", encoding="utf-8") as file:
                file.write(synopsis_text)
            return True

    # If something goes wrong or no synopsis is found, return False
    return False

In [10]:
save_plot_synopsis_to_txt(get_imdb_movie_link("Joker"),"joker_synopsis.txt")

True

# Downloading Movie Scripts from ScriptSlug

In this section, two Python functions are defined to facilitate the download of movie scripts from the ScriptSlug website. The first function creates a valid link to the script, and the second function handles the actual download process.

### `script_page_link(query)`

- **Purpose**: This function generates a link to a movie script on ScriptSlug based on the movie title.
- **Functionality**:
  1. It constructs a base URL for movie scripts on the ScriptSlug website.
  2. Extracts movie information using the `extract_movie_info` function, which retrieves IMDb information for the given movie title.
  3. Formats the movie name into a structure acceptable by the website (e.g., lowercase, spaces replaced with hyphens).
  4. Combines the movie name, release year, and the base URL to create a valid script page link.
  5. Returns the script page link.
- **Error Handling**:
  - This function does not have explicit error handling, and any issues may be propagated from the `extract_movie_info` function.

### `download_movie_script(url, download_folder="scripts")`

- **Purpose**: This function is responsible for downloading the movie script from a provided URL.
- **Functionality**:
  1. It sends an HTTP GET request to the provided URL, aiming to access the movie script page.
  2. If the request is successful (HTTP status code 200), it parses the HTML content of the page using BeautifulSoup.
  3. It searches for the download link with a .pdf extension and extracts it.
  4. If a download link is found, it proceeds to download the script:
     - Creates the download folder (default is "scripts") if it doesn't exist.
     - Extracts the filename from the download link's URL.
     - Sends an HTTP GET request to the PDF download link.
     - If the download of the PDF script is successful (HTTP status code 200), it saves the script to the specified folder and returns the path.
- **Error Handling**:
  - If any part of the download process fails, the function returns `None`.

### Example Usage:

```python
# Usage Example
movie_script_url = script_page_link("Joker")  # Get the URL for the script of the movie "Joker"
output_file = download_movie_script(movie_script_url)  # Download and save the movie script

# 'output_file' will be the path to the downloaded script if successful, otherwise it will be None.


In [11]:
# Function to generate a link to a movie script page on ScriptSlug
def script_page_link(query):
    # Define the base URL for movie scripts on ScriptSlug
    link = "https://www.scriptslug.com/script/"

    # Extract movie information from IMDb using the 'extract_movie_info' function
    dic = extract_movie_info(get_imdb_movie_link(query))

    # Format the movie name for the URL, converting to lowercase and replacing spaces with hyphens
    movie_name = dic["Title"].lower().replace(" ", "-")

    # Remove any characters from the movie name that are not alphanumeric or hyphens
    movie_name = re.sub(r'[^a-z0-9-]', '', movie_name)

    # Combine the formatted movie name and release year to create the script page link
    script_page_link = link + movie_name + "-" + dic['Release Year']

    # Return the generated script page link
    return script_page_link


In [12]:
def download_movie_script(url, download_folder="scripts"):
    # Send a GET request to the movie script page
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")

        # Find the download link with .pdf extension
        download_link = None
        for link in soup.find_all("a", href=True):
            if link["href"].endswith(".pdf"):
                download_link = link["href"]
                break

        if download_link:
            # Create the download folder if it doesn't exist
            os.makedirs(download_folder, exist_ok=True)

            # Extract the filename from the URL
            filename = os.path.join(download_folder, os.path.basename(download_link))

            # Send a GET request to the PDF download link
            pdf_response = requests.get(download_link)
            if pdf_response.status_code == 200:
                # Save the PDF script to the specified folder
                with open(filename, "wb") as pdf_file:
                    pdf_file.write(pdf_response.content)
                return filename

    # Return None if download fails
    return None

In [13]:
download_movie_script(script_page_link("joker"))

'scripts/joker-2019.pdf'

# Audio Transcription From Movie's Audio

To enable speech recognition using the OpenAI Whisper model and download audio files from Google Drive, we install two important libraries:

 [Transformers](https://huggingface.co/transformers)

- **Purpose**: Transformers is a powerful library used to access OpenAI's Whisper model, which is an automatic speech recognition (ASR) model. It allows you to transcribe spoken language into written text with high accuracy.

[gdown](https://pypi.org/project/gdown/)

- **Purpose**: gdown is a Python library that simplifies the process of downloading files from Google Drive links. In this context, it is used to retrieve audio files required for speech recognition from a specified Google Drive link.



For faster transcription change runtime type to a GPU.


In [None]:
!pip install transformers
!pip install gdown

In [15]:
from transformers import pipeline
import gdown
from IPython.display import Audio

Audio file's link from google drive:
https://drive.google.com/file/d/1Eo_yIAnXmZOmF2PknakrSUoKKRiXZ_c5/view?usp=sharing

The Audio is From the movie **Joker(2019)**

In [16]:
# Define the file ID and the output filename
file_id = "1Eo_yIAnXmZOmF2PknakrSUoKKRiXZ_c5"
output_filename = "How about another joke, Murray?.mp3"

# Download the file from Google Drive
gdown.download(f"https://drive.google.com/uc?id={file_id}", output_filename, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1Eo_yIAnXmZOmF2PknakrSUoKKRiXZ_c5
To: /content/How about another joke, Murray?.mp3
100%|██████████| 4.76M/4.76M [00:00<00:00, 30.6MB/s]


'How about another joke, Murray?.mp3'

In [17]:
audio_file_path =  "/content/How about another joke, Murray?.mp3"

In [18]:
Audio(audio_file_path)

In [19]:
recognizer= pipeline("automatic-speech-recognition", model="openai/whisper-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Note that the recognizer is limited to only the first 30 seconds of the audio file.

In [20]:
recognizer(audio_file_path)

{'text': " Kill those three Wall Street guys. Okay, I'm waiting for the punchline. There's no punchline. It's not a joke. You're serious, aren't you? You're telling us you killed those three young men on the subway? And why should we believe that?"}