<a href="https://colab.research.google.com/github/MarMarhoun/freelance_work/blob/main/side_projects/NLP_projs/LLMs_with_Gradio/startup_demo_marouane.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web App for Text Similarity Using Gradio and Hugging Face's Language Models - Demo

This web app leverages Gradio and Hugging Face's language models (LLMs) to analyze text similarity based on user prompts. This app is designed to work with video subtitles or transcriptions, allowing users to extract meaningful insights from video content. You can either provide a link to a video (such as from YouTube) or upload a transcription file locally.

This web app is designed to enhance your understanding of video content by providing a robust analysis of text similarity, ensuring that you capture all essential information effectively.

## Key Features of the Web App:


1.   **Text Similarity Analysis:** The app will analyze the extracted text from the video and compare it to the user-provided prompt. It will calculate a similarity score that indicates how closely the prompt aligns with the content of the video.
2.   **Similarity Scoring:** The application will provide a score or probability indicating whether the prompt text falls within the scope of the extracted text. This score will help users understand the relevance of their prompt in relation to the video content.
3. **Gap Identification:** The app will highlight and mention any significant gaps or major points that are missing from the prompt compared to the extracted text. This feature ensures that users are aware of critical information that may not have been included in their prompt.

### Output:

+ Similarity Score: A numerical value representing the degree of similarity between the prompt and the extracted text.
+ Gap Analysis: A detailed report highlighting key points or concepts that are present in the extracted text but absent from the prompt, providing users with a comprehensive understanding of the content.

### Test Video:
To demonstrate the functionality of this web app, please provide a test video link or upload a transcription file.

> Test video: https://www.youtube.com/watch?v=8kK2zwjRV0M

```
For any inquiries or support, feel free to reach out to Marouane MARHOUN @t marmarhoun@gmail.com

Github profile: https://github.com/MarMarhoun/
LinkedIn profile: https://www.linkedin.com/in/marmarhoun/
```





1.   Extract subtitles from a YouTube video: We can use the youtube-transcript-api library to fetch subtitles.
2.   Use a pre-trained language model: We will utilize a model from Hugging Face to compute text similarity.
3. Build the Gradio interface: This will allow users to upload a video or provide a YouTube link, input a prompt, and see the result




## Install the librarires

In [1]:
!pip install gradio youtube-transcript-api transformers torch

Collecting gradio
  Downloading gradio-5.23.1-py3-none-any.whl.metadata (16 kB)
Collecting youtube-transcript-api
  Downloading youtube_transcript_api-1.0.3-py3-none-any.whl.metadata (23 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11

In [2]:
!pip install sentence-transformers wordcloud matplotlib



# Before Summarization of the Extracted Text

In [26]:
import gradio as gr
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, VideoUnavailable
from sentence_transformers import SentenceTransformer, util
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os


# Load the similarity model once at the start
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to extract subtitles from a YouTube video with error handling
def extract_subtitles(video_url):
    try:
        video_id = video_url.split("v=")[-1]
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine the text from the transcript
        full_text = " ".join([entry['text'] for entry in transcript])
        return full_text
    except NoTranscriptFound:
        return "Error: No transcript available for this video."
    except VideoUnavailable:
        return "Error: The video is unavailable or does not have subtitles."
    except Exception as e:
        return f"Error: An unexpected error occurred: {str(e)}"

# Function to save extracted text to a file
def save_extracted_text(video_id, text):
    with open(f"{video_id}_transcript.txt", "w") as f:
        f.write(text)

# Function to create a word cloud image
def create_wordcloud(text, filename, title=None):
    if not text.strip():  # Check if the text is empty
        return None

    wordcloud = WordCloud(width=800, height=400, background_color='black').generate(text)

    plt.figure(figsize=(18, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')

    # Add title if provided
    if title:
        plt.title(title, fontsize=24, color='black')  # Customize title appearance as needed

    plt.savefig(filename, format='png')
    plt.close()
    return filename

# Function to compute similarity and gaps
def analyze_text(prompt, video_url):
    # Extract subtitles
    extracted_text = extract_subtitles(video_url)

    if "Error" in extracted_text:
        return extracted_text, []

    # Save the extracted text for future use
    video_id = video_url.split("v=")[-1]
    save_extracted_text(video_id, extracted_text)

    # Ensure prompt and extracted_text are not empty
    if not prompt or not extracted_text:
        return "Error: Prompt or extracted text is empty.", []

    # Compute similarity score
    try:
        embeddings1 = model.encode(prompt, convert_to_tensor=True)
        embeddings2 = model.encode(extracted_text, convert_to_tensor=True)
        similarity_score = util.cos_sim(embeddings1, embeddings2).item()
    except Exception as e:
        return f"Error: An issue occurred while computing similarity: {str(e)}", []

    # Create a similarity message
    #similarity_message = f"Similarity Score: {similarity_score:.2f}"
    similarity_message = f"Similarity Score: {similarity_score * 100:.2f}% (This score indicates how closely the prompt matches the extracted text.)"

    # Find gaps in the prompt
    extracted_sentences = re.split(r'(?<=[.!?]) +', extracted_text)
    missing_points = [sentence for sentence in extracted_sentences if prompt.lower() not in sentence.lower()]
    # Create the word cloud for missing points
    missing_wordcloud_path = create_wordcloud(" ".join(missing_points), 'missing_wordcloud.png', title='Missing Points WordCloud') if missing_points else None


    # Format missing points for better presentation
    if missing_points:
        formatted_missing_points = "\n".join([f"- {point.strip()}" for point in missing_points])
        missing_points_message = f"Missing Points ({len(missing_points)}):\n{formatted_missing_points}"
    else:
        missing_points_message = "No missing points found."

    return similarity_message, missing_points_message, missing_wordcloud_path

# Function to handle text extraction and update the output
def extract_and_display_text(video_url):
    extracted_text = extract_subtitles(video_url)
    # Create the word cloud for the extracted text with a title
    extracted_wordcloud_path = create_wordcloud(extracted_text, 'extracted_wordcloud.png', title='Extracted Text WordCloud')

    return extracted_text, extracted_wordcloud_path

# Gradio interface
def gradio_interface(prompt, video_url):
    score, gaps, missing_wordcloud_path= analyze_text(prompt, video_url)
    return score, gaps, missing_wordcloud_path

# Create Gradio app
with gr.Blocks() as app:
    gr.Markdown("<h2 style='text-align: center;'>Text Similarity Analysis from Video</h2>")

    # Example video URLs
    gr.Markdown("### Example YouTube Video URLs")
    gr.Markdown("- [DNA Structure and Replication](https://www.youtube.com/watch?v=8kK2zwjRV0M)")

    with gr.Row():
        with gr.Column():
            video_url = gr.Textbox(label="YouTube Video URL", placeholder="Enter the YouTube video URL here...")
            extracted_text_output = gr.Textbox(label="Extracted Text", interactive=False, lines=10)
            extracted_wordcloud = gr.Image(label="Extracted Text Word Cloud", interactive=False)
            extract_btn = gr.Button("Extract Text")

        with gr.Column():
            prompt = gr.Textbox(label="Prompt Text", placeholder="Enter your prompt text here...")

            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            gaps_output = gr.Textbox(label="Missing Points", interactive=False)

            missing_wordcloud = gr.Image(label="Missing Points Word Cloud", interactive=False)
            submit_btn = gr.Button("Analyze")

    # Button click events
    extract_btn.click(extract_and_display_text, inputs=video_url, outputs=[extracted_text_output, extracted_wordcloud])
    submit_btn.click(gradio_interface, inputs=[prompt, video_url], outputs=[similarity_output, gaps_output, missing_wordcloud])

     gr.Markdown("### Example Prompts")
    gr.Markdown("- Explain the structure of DNA and its significance in genetics.")
    gr.Markdown("- DNA sequencing is used in modern medicine")
    gr.Markdown("- DNA plays a weak role in the process of protein synthesis.")
    gr.Markdown("- DNA sequencing has major implications in modern medicine.")
    gr.Markdown("- How does DNA replication occur, and why is it important for cell division?")
    gr.Markdown("- What are the ethical considerations surrounding genetic engineering and DNA manipulation?")

# Launch the app
app.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://46bf89108ea42410a0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# After Summarization of the Extracted Text

In [28]:
import gradio as gr
from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, VideoUnavailable
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Load the similarity model and summarization model once at the start
similarity_model = SentenceTransformer('all-MiniLM-L6-v2')
summarization_model = pipeline("summarization")

# Function to extract subtitles from a YouTube video with error handling
def extract_subtitles(video_url):
    try:
        video_id = video_url.split("v=")[-1]
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine the text from the transcript
        full_text = " ".join([entry['text'] for entry in transcript])
        return full_text
    except NoTranscriptFound:
        return "Error: No transcript available for this video."
    except VideoUnavailable:
        return "Error: The video is unavailable or does not have subtitles."
    except Exception as e:
        return f"Error: An unexpected error occurred: {str(e)}"

# Function to save extracted text to a file
def save_extracted_text(video_id, text):
    with open(f"{video_id}_transcript.txt", "w") as f:
        f.write(text)


# Function to create a word cloud image
def create_wordcloud(text, filename, title=None):
    if not text.strip():  # Check if the text is empty
        return None

    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    plt.figure(figsize=(18, 6))
    plt.imshow(wordcloud, interpolation='bicubic')
    plt.axis('off')

    # Add title if provided
    if title:
        plt.title(title, fontsize=24, color='black')  # Customize title appearance as needed

    plt.savefig(filename, format='png')
    plt.close()
    return filename


# Function to compute similarity and gaps
def analyze_text(prompt, video_url):
    # Extract subtitles
    extracted_text = extract_subtitles(video_url)

    if "Error" in extracted_text:
        return extracted_text, []

    # Save the extracted text for future use
    video_id = video_url.split("v=")[-1]
    save_extracted_text(video_id, extracted_text)

    # Ensure prompt and extracted_text are not empty
    if not prompt or not extracted_text:
        return "Error: Prompt or extracted text is empty.", []

    # Compute similarity score
    try:
        embeddings1 = similarity_model.encode(prompt, convert_to_tensor=True)
        embeddings2 = similarity_model.encode(extracted_text, convert_to_tensor=True)
        similarity_score = util.cos_sim(embeddings1, embeddings2).item()
    except Exception as e:
        return f"Error: An issue occurred while computing similarity: {str(e)}", []

    # Create a similarity message
    #similarity_message = f"Similarity Score: {similarity_score:.2f}"
    similarity_message = f"Similarity Score: {similarity_score * 100:.2f}% (This score indicates how closely the prompt matches the extracted text.)"

    # Find gaps in the prompt
    extracted_sentences = re.split(r'(?<=[.!?]) +', extracted_text)
    missing_points = [sentence for sentence in extracted_sentences if prompt.lower() not in sentence.lower()]
    # Create the word cloud for missing points
    missing_wordcloud_path = create_wordcloud(" ".join(missing_points), 'missing_wordcloud_sum.png', title='Missing Points WordCloud from Summarized text') if missing_points else None



    # Summarize the major missing points using the summarization model
    if missing_points:
        # Join missing points into a single string for summarization
        missing_text = " ".join(missing_points)

        # Ensure the text is not too long for the summarization model
        if len(missing_text) > 1024:  # Adjust the length limit as needed
            missing_text = missing_text[:1024]  # Truncate to the first 1024 characters

        summarized_missing_points = summarization_model(missing_text, max_length=150, min_length=30, do_sample=False)
        major_missing_points = summarized_missing_points[0]['summary_text']
    else:
        major_missing_points = "No missing points found."

    return similarity_message, major_missing_points, missing_wordcloud_path

# Function to handle text extraction and update the output
def extract_and_display_text(video_url):
    extracted_text = extract_subtitles(video_url)
    # Create the word cloud for the extracted text with a title
    extracted_wordcloud_path = create_wordcloud(extracted_text, 'extracted_wordcloud_sum.png', title='Extracted Text WordCloud')

    return extracted_text, extracted_wordcloud_path

# Gradio interface
def gradio_interface(prompt, video_url):
    score, gaps, missing_wordcloud_path = analyze_text(prompt, video_url)
    return score, gaps, missing_wordcloud_path

# Create Gradio app
with gr.Blocks() as app:
    gr.Markdown("<h2 style='text-align: center;'>Text Similarity Analysis from Video - with Summarization </h2>")

    # Example video URLs
    gr.Markdown("### Example YouTube Video URLs")
    gr.Markdown("- [DNA Structure and Replication](https://www.youtube.com/watch?v=8kK2zwjRV0M)")

    with gr.Row():
        with gr.Column():
            video_url = gr.Textbox(label="YouTube Video URL", placeholder="Enter the YouTube video URL here...")
            extracted_text_output = gr.Textbox(label="Extracted Text", interactive=False, lines=10)
            extracted_wordcloud = gr.Image(label="Extracted Text Word Cloud", interactive=False)
            extract_btn = gr.Button("Extract Text")


        with gr.Column():
            prompt = gr.Textbox(label="Prompt Text", placeholder="Enter your prompt text here...")

            similarity_output = gr.Textbox(label="Similarity Score", interactive=False)
            gaps_output = gr.Textbox(label="Missing Points", interactive=False)

            missing_wordcloud = gr.Image(label="Missing Points Word Cloud", interactive=False)
            submit_btn = gr.Button("Analyze")

    # Button click events
    extract_btn.click(extract_and_display_text, inputs=video_url, outputs=[extracted_text_output, extracted_wordcloud])
    submit_btn.click(gradio_interface, inputs=[prompt, video_url], outputs=[similarity_output, gaps_output, missing_wordcloud])

    gr.Markdown("### Example Prompts")
    gr.Markdown("- Explain the structure of DNA and its significance in genetics.")
    gr.Markdown("- DNA sequencing is used in modern medicine")
    gr.Markdown("- DNA plays a weak role in the process of protein synthesis.")
    gr.Markdown("- DNA sequencing has major implications in modern medicine.")
    gr.Markdown("- How does DNA replication occur, and why is it important for cell division?")
    gr.Markdown("- What are the ethical considerations surrounding genetic engineering and DNA manipulation?")

# Launch the app
app.launch()

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://9301be49f0a3723127.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


