<a href="https://colab.research.google.com/github/MarMarhoun/freelance_work/blob/main/side_projects/NLP_projs/LLMs_with_Gradio/llm_traduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Application for Video Content Analysis and Translation using Gradio

This code provides a comprehensive tool for users to analyze video content by extracting subtitles, summarizing them, checking for similarity with a given prompt, and translating text into multiple languages. The use of Gradio allows for an interactive and visually appealing user experience, making it easy for users to engage with the functionalities offered by the application.

**Key Components and Functionalities**



1.   **Subtitle Extraction:** The script initializes a similarity model using `SentenceTransformer` and a summarization model from the Hugging Face Transformers library. It defines a dictionary of translation models for various languages, including Arabic, Spanish, French, German, and Chinese, and loads them into translation pipelines.


2.   **Subtitle Extraction:** The `extract_subtitles` function takes a YouTube video URL, extracts the video ID, and retrieves the transcript. It includes error handling for scenarios such as missing transcripts or unavailable videos, ensuring that users receive informative feedback.

3.  **Text Processing Functions:** Several functions are defined to handle text processing tasks. The `save_extracted_text` function saves the extracted subtitles to a text file. The `create_wordcloud` function generates a word cloud image from the provided text and saves it as a PNG file. The `summarize_text` function summarizes the input text using the summarization model, truncating it if necessary. The `translate_text` function translates the input text into the selected target language, splitting long texts into smaller chunks to avoid errors during translation.

4. **Text Extraction and Display:** The `extract_and_display_text` function combines the functionalities of subtitle extraction, word cloud generation, and text summarization. It returns the extracted text, summarized text, and paths to the generated word clouds, providing a comprehensive overview of the video content.

5. **Gradio Interface:** The script creates a Gradio app with a user-friendly interface that includes input fields for the YouTube video URL, prompt text, text to translate, and target language selection. Output fields are provided for displaying extracted text, summarized text, similarity scores, missing points, and translated text. Buttons are included to trigger the extraction, analysis, and translation processes.


In summary, this code provides a comprehensive tool for users to analyze video content by extracting subtitles, summarizing them, checking for similarity with a given prompt, and translating text into multiple languages. The use of Gradio allows for an interactive and visually appealing user experience, making it easy for users to engage with the functionalities offered by the application.

## Install the librarires

In [20]:
!pip install gradio youtube-transcript-api transformers torch sentence-transformers wordcloud matplotlib



# Text Similarity Analysis from Video - with Summarization

In [21]:
import gradio as gr
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, VideoUnavailable
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Load the similarity model and summarization model once at the start
similarity_model = SentenceTransformer('all-MiniLM-L6-v2')
summarization_model = pipeline("summarization")

# Function to extract subtitles from a YouTube video with error handling
def extract_subtitles(video_url):
    try:
        video_id = video_url.split("v=")[-1]
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine the text from the transcript
        full_text = " ".join([entry['text'] for entry in transcript])
        return full_text
    except NoTranscriptFound:
        return "Error: No transcript available for this video."
    except VideoUnavailable:
        return "Error: The video is unavailable or does not have subtitles."
    except Exception as e:
        return f"Error: An unexpected error occurred: {str(e)}"

# Function to save extracted text to a file
def save_extracted_text(video_id, text):
    with open(f"{video_id}_transcript.txt", "w") as f:
        f.write(text)

# Function to create a word cloud image
def create_wordcloud(text, filename, title=None):
    if not text.strip():  # Check if the text is empty
        return None

    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    plt.figure(figsize=(18, 6))
    plt.imshow(wordcloud, interpolation='bicubic')
    plt.axis('off')

    # Add title if provided
    if title:
        plt.title(title, fontsize=24, color='black')  # Customize title appearance as needed

    plt.savefig(filename, format='png')
    plt.close()
    return filename

# Function to summarize text
def summarize_text(text):
    if len(text) > 1024:  # Adjust the length limit as needed
        text = text[:1024]  # Truncate to the first 1024 characters
    summarized = summarization_model(text, max_length=150, min_length=30, do_sample=False)
    return summarized[0]['summary_text']

# Function to handle text extraction and update the output
def extract_and_display_text(video_url):
    extracted_text = extract_subtitles(video_url)
    if "Error" in extracted_text:
        return extracted_text, "", None, None  # Return error message and empty summary

    # Create the word cloud for the extracted text with a title
    extracted_wordcloud_path = create_wordcloud(extracted_text, 'extracted_wordcloud_sum.png', title='Extracted Text WordCloud')

    # Summarize the extracted text
    summarized_text = summarize_text(extracted_text)

    # Create a word cloud for the summarized text
    summarized_wordcloud_path = create_wordcloud(summarized_text, 'summarized_wordcloud_sum.png', title='Summarized Text WordCloud')

    return extracted_text, summarized_text, extracted_wordcloud_path, summarized_wordcloud_path

# Gradio interface
def gradio_interface(prompt, video_url):
    score, gaps, missing_wordcloud_path = analyze_text(prompt, video_url)
    return score, gaps, missing_wordcloud_path

# Create Gradio app
with gr.Blocks() as app:
    gr.Markdown("<h2 style='text-align: center;'>Text Similarity Analysis from Video - with Summarization </h2>")

    # Example video URLs
    gr.Markdown("### Example YouTube Video URLs")
    gr.Markdown("- [DNA Structure and Replication](https://www.youtube.com/watch?v=8kK2zwjRV0M)")
    gr.Markdown("- [Startup Success](https://www.youtube.com/watch?v=0lJKucu6HJc&ab_channel=YCombinator)")

    with gr.Row():
        with gr.Column():
            video_url = gr.Textbox(label="YouTube Video URL", placeholder="Enter the YouTube video URL here...")
            extracted_text_output = gr.Textbox(label="Extracted Text", interactive=False, lines=10)
            summarized_text_output = gr.Textbox(label="Summarized Text", interactive=False, lines=5)  # New output for summarized text
            extract_btn = gr.Button("Extract Text")
            extracted_wordcloud = gr.Image(label="Extracted Text Word Cloud", interactive=False)
            summarized_wordcloud = gr.Image(label="Summarized Text Word Cloud", interactive=False)


        with gr.Column():
            prompt = gr.Textbox(label="Prompt Text", placeholder="Enter your prompt text here...")

            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            gaps_output = gr.Textbox(label="Missing Points", interactive=False)

            missing_wordcloud = gr.Image(label="Missing Points Word Cloud", interactive=False)
            submit_btn = gr.Button("Analyze")

    # Button click events
    extract_btn.click(extract_and_display_text, inputs=video_url, outputs=[extracted_text_output, summarized_text_output, extracted_wordcloud, summarized_wordcloud])
    submit_btn.click(gradio_interface, inputs=[prompt, video_url], outputs=[similarity_output, gaps_output, missing_wordcloud])

    gr.Markdown("### Example Prompts")
    gr.Markdown("- Explain the structure of DNA and its significance in genetics.")
    gr.Markdown("- DNA sequencing is used in modern medicine")
    gr.Markdown("- DNA plays a weak role in the process of protein synthesis.")
    gr.Markdown("- DNA sequencing has major implications in modern medicine.")
    gr.Markdown("- How does DNA replication occur, and why is it important for cell division?")
    gr.Markdown("- What are the ethical considerations surrounding genetic engineering and DNA manipulation?")

# Launch the app
app.launch()

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://9af22291332461dde7.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Traduction from English to Arabic

In [22]:
import gradio as gr
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, VideoUnavailable
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Load the similarity model, summarization model, and translation model once at the start
similarity_model = SentenceTransformer('all-MiniLM-L6-v2')
summarization_model = pipeline("summarization")
translation_model = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ar")  # English to Arabic translation model

# Function to extract subtitles from a YouTube video with error handling
def extract_subtitles(video_url):
    try:
        video_id = video_url.split("v=")[-1]
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine the text from the transcript
        full_text = " ".join([entry['text'] for entry in transcript])
        return full_text
    except NoTranscriptFound:
        return "Error: No transcript available for this video."
    except VideoUnavailable:
        return "Error: The video is unavailable or does not have subtitles."
    except Exception as e:
        return f"Error: An unexpected error occurred: {str(e)}"

# Function to save extracted text to a file
def save_extracted_text(video_id, text):
    with open(f"{video_id}_transcript.txt", "w") as f:
        f.write(text)

# Function to create a word cloud image
def create_wordcloud(text, filename, title=None):
    if not text.strip():  # Check if the text is empty
        return None

    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    plt.figure(figsize=(18, 6))
    plt.imshow(wordcloud, interpolation='bicubic')
    plt.axis('off')

    # Add title if provided
    if title:
        plt.title(title, fontsize=24, color='black')  # Customize title appearance as needed

    plt.savefig(filename, format='png')
    plt.close()
    return filename

# Function to summarize text
def summarize_text(text):
    if len(text) > 1024:  # Adjust the length limit as needed
        text = text[:1024]  # Truncate to the first 1024 characters
    summarized = summarization_model(text, max_length=150, min_length=30, do_sample=False)
    return summarized[0]['summary_text']

# Function to translate text from English to Arabic
def translate_text(text):
    if not text.strip():
        return ""
    translated = translation_model(text)
    return translated[0]['translation_text']

# Function to handle text extraction and update the output
def extract_and_display_text(video_url):
    extracted_text = extract_subtitles(video_url)
    if "Error" in extracted_text:
        return extracted_text, "", None, None  # Return error message and empty summary

    # Create the word cloud for the extracted text with a title
    extracted_wordcloud_path = create_wordcloud(extracted_text, 'extracted_wordcloud_sum.png', title='Extracted Text WordCloud')

    # Summarize the extracted text
    summarized_text = summarize_text(extracted_text)

    # Create a word cloud for the summarized text
    summarized_wordcloud_path = create_wordcloud(summarized_text, 'summarized_wordcloud_sum.png', title='Summarized Text WordCloud')

    return extracted_text, summarized_text, extracted_wordcloud_path, summarized_wordcloud_path

# Gradio interface
def gradio_interface(prompt, video_url):
    score, gaps, missing_wordcloud_path = analyze_text(prompt, video_url)
    return score, gaps, missing_wordcloud_path

# Create Gradio app
with gr.Blocks() as app:
    gr.Markdown("<h2 style='text-align: center;'>Text Translation & Similarity Analysis from Video - with Summarization </h2>")

    # Example video URLs
    gr.Markdown("### Example YouTube Video URLs")
    gr.Markdown("- [DNA Structure and Replication](https://www.youtube.com/watch?v=8kK2zwjRV0M)")
    gr.Markdown("- [Startup Success](https://www.youtube.com/watch?v=0lJKucu6HJc&ab_channel=YCombinator)")

    with gr.Row():
        with gr.Column():
            video_url = gr.Textbox(label="YouTube Video URL", placeholder="Enter the YouTube video URL here...")
            extracted_text_output = gr.Textbox(label="Extracted Text", interactive=False, lines=10)
            summarized_text_output = gr.Textbox(label="Summarized Text", interactive=False, lines=5)  # New output for summarized text
            extracted_wordcloud = gr.Image(label="Extracted Text Word Cloud", interactive=False)
            summarized_wordcloud = gr.Image(label="Summarized Text Word Cloud", interactive=False)
            extract_btn = gr.Button("Extract Text")

        with gr.Column():
            prompt = gr.Textbox(label="Prompt Text", placeholder="Enter your prompt text here...")
            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            gaps_output = gr.Textbox(label="Missing Points", interactive=False)
            missing_wordcloud = gr.Image(label="Missing Points Word Cloud", interactive=False)
            submit_btn = gr.Button("Analyze")

    with gr.Row():
      with gr.Column():
            translation_input = gr.Textbox(label="Text to Translate", placeholder="Enter text in English here...")
            translate_btn = gr.Button("Translate")
      with gr.Column():
            translation_output = gr.Textbox(label="Translated Text (Arabic)", interactive=False)  # Output for translation


    # Button click events

    extract_btn.click(extract_and_display_text, inputs=video_url, outputs=[extracted_text_output, summarized_text_output, extracted_wordcloud, summarized_wordcloud])
    submit_btn.click(gradio_interface, inputs=[prompt, video_url], outputs=[similarity_output, gaps_output, missing_wordcloud])
    translate_btn.click(translate_text, inputs=translation_input, outputs=translation_output)


    gr.Markdown("### Example Prompts")
    gr.Markdown("- Explain the structure of DNA and its significance in genetics.")
    gr.Markdown("- DNA sequencing is used in modern medicine")
    gr.Markdown("- DNA plays a weak role in the process of protein synthesis.")
    gr.Markdown("- DNA sequencing has major implications in modern medicine.")
    gr.Markdown("- How does DNA replication occur, and why is it important for cell division?")
    gr.Markdown("- What are the ethical considerations surrounding genetic engineering and DNA manipulation?")

# Launch the app
app.launch()

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Device set to use cpu


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6adccc3771ddf0fa69.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# Traduction from English to multi-languages


In [23]:
import gradio as gr
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, VideoUnavailable
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Load the similarity model, summarization model, and translation model once at the start
similarity_model = SentenceTransformer('all-MiniLM-L6-v2')
summarization_model = pipeline("summarization")

# Define a dictionary for translation models
translation_models = {
    "Arabic": "Helsinki-NLP/opus-mt-en-ar",
    "Spanish": "Helsinki-NLP/opus-mt-en-es",
    "French": "Helsinki-NLP/opus-mt-en-fr",
    "German": "Helsinki-NLP/opus-mt-en-de",
    "Chinese": "Helsinki-NLP/opus-mt-en-zh",
}

# Load translation models
translation_pipelines = {lang: pipeline("translation", model=model) for lang, model in translation_models.items()}

# Function to extract subtitles from a YouTube video with error handling
def extract_subtitles(video_url):
    try:
        video_id = video_url.split("v=")[-1]
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine the text from the transcript
        full_text = " ".join([entry['text'] for entry in transcript])
        return full_text
    except NoTranscriptFound:
        return "Error: No transcript available for this video."
    except VideoUnavailable:
        return "Error: The video is unavailable or does not have subtitles."
    except Exception as e:
        return f"Error: An unexpected error occurred: {str(e)}"

# Function to save extracted text to a file
def save_extracted_text(video_id, text):
    with open(f"{video_id}_transcript.txt", "w") as f:
        f.write(text)

# Function to create a word cloud image
def create_wordcloud(text, filename, title=None):
    if not text.strip():  # Check if the text is empty
        return None

    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    plt.figure(figsize=(18, 6))
    plt.imshow(wordcloud, interpolation='bicubic')
    plt.axis('off')

    # Add title if provided
    if title:
        plt.title(title, fontsize=24, color='black')  # Customize title appearance as needed

    plt.savefig(filename, format='png')
    plt.close()
    return filename

# Function to summarize text
def summarize_text(text):
    if len(text) > 1024:  # Adjust the length limit as needed
        text = text[:1024]  # Truncate to the first 1024 characters
    summarized = summarization_model(text, max_length=150, min_length=30, do_sample=False)
    return summarized[0]['summary_text']

# Function to translate text from English to the selected language
def translate_text(text, target_language):
    if not text.strip() or target_language not in translation_pipelines:
        return ""

    # Split the text into chunks of 512 characters
    chunk_size = 512
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

    # Translate each chunk and join the results
    translated_chunks = []
    for chunk in chunks:
        translated = translation_pipelines[target_language](chunk)
        translated_chunks.append(translated[0]['translation_text'])

    return " ".join(translated_chunks)

# Function to handle text extraction and update the output
def extract_and_display_text(video_url):
    extracted_text = extract_subtitles(video_url)
    if "Error" in extracted_text:
        return extracted_text, "", None, None  # Return error message and empty summary

    # Create the word cloud for the extracted text with a title
    extracted_wordcloud_path = create_wordcloud(extracted_text, 'extracted_wordcloud_sum.png', title='Extracted Text WordCloud')

    # Summarize the extracted text
    summarized_text = summarize_text(extracted_text)

    # Create a word cloud for the summarized text
    summarized_wordcloud_path = create_wordcloud(summarized_text, 'summarized_wordcloud_sum.png', title='Summarized Text WordCloud')

    return extracted_text, summarized_text, extracted_wordcloud_path, summarized_wordcloud_path

# Gradio interface
def gradio_interface(prompt, video_url):
    score, gaps, missing_wordcloud_path = analyze_text(prompt, video_url)
    return score, gaps, missing_wordcloud_path

# Create Gradio app
with gr.Blocks() as app:
    gr.Markdown("<h2 style='text-align: center;'>Text translation & Similarity Analysis from Video - with Summarization </h2>")

    # Example video URLs
    gr.Markdown("### Example YouTube Video URLs")
    gr.Markdown("- [DNA Structure and Replication](https://www.youtube.com/watch?v=8kK2zwjRV0M)")
    gr.Markdown("- [Startup Success](https://www.youtube.com/watch?v=0lJKucu6HJc&ab_channel=YCombinator)")

    with gr.Row():
        with gr.Column():
            video_url = gr.Textbox(label="YouTube Video URL", placeholder="Enter the YouTube video URL here...")
            extracted_text_output = gr.Textbox(label="Extracted Text", interactive=False, lines=10)
            summarized_text_output = gr.Textbox(label="Summarized Text", interactive=False, lines=5)  # New output for summarized text
            extract_btn = gr.Button("Extract Text")
            extracted_wordcloud = gr.Image(label="Extracted Text Word Cloud", interactive=False)
            summarized_wordcloud = gr.Image(label="Summarized Text Word Cloud", interactive=False)


        with gr.Column():
            prompt = gr.Textbox(label="Prompt Text", placeholder="Enter your prompt text here...")
            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            gaps_output = gr.Textbox(label="Missing Points", interactive=False)
            missing_wordcloud = gr.Image(label="Missing Points Word Cloud", interactive=False)
            submit_btn = gr.Button("Analyze")

    with gr.Row():
        with gr.Column():
            translation_input = gr.Textbox(label="Text to Translate", placeholder="Enter text in English here...")
            target_language = gr.Dropdown(label="Select Target Language", choices=list(translation_models.keys()), value="Arabic")
            translate_btn = gr.Button("Translate")
        with gr.Column():
            translation_output = gr.Textbox(label="Translated Text", interactive=False)  # Output for translation

    # Button click events
    extract_btn.click(extract_and_display_text, inputs=video_url, outputs=[extracted_text_output, summarized_text_output, extracted_wordcloud, summarized_wordcloud])
    submit_btn.click(gradio_interface, inputs=[prompt, video_url], outputs=[similarity_output, gaps_output, missing_wordcloud])
    translate_btn.click(translate_text, inputs=[translation_input, target_language], outputs=translation_output)

    gr.Markdown("### Example Prompts")
    gr.Markdown("- Explain the structure of DNA and its significance in genetics.")
    gr.Markdown("- DNA sequencing is used in modern medicine")
    gr.Markdown("- DNA plays a weak role in the process of protein synthesis.")
    gr.Markdown("- DNA sequencing has major implications in modern medicine.")
    gr.Markdown("- How does DNA replication occur, and why is it important for cell division?")
    gr.Markdown("- What are the ethical considerations surrounding genetic engineering and DNA manipulation?")

# Launch the app
app.launch()

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://5e59cbbced35fd7505.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


