#PolitoSbobinatore

### Installation of Dependencies and Environment Setup
In the next block, we will install the necessary dependencies and set up the environment to run the subsequent code. This includes installing libraries such as Whisper, MoviePy, Transformers, Selenium, and others, as well as configuring the Chromium browser.

In [None]:
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
!pip install moviepy
!pip install --upgrade transformers torch
!pip install torch
!pip install transformers
!pip install selenium webdriver_manager requests
!apt-get update -y
!apt-get install -y chromium-browser
!apt-get install -y chromedriver
!apt install chromium-chromedriver

# Insert Your Polito Credentials and Course Information

Before running the script, fill in your PoliTo credentials and the course details:


In [None]:
#INSERT HERE YOUR DATA
USERNAME = #POLITO USERNAME
PASSWORD = #POLITO PASSWORD
COURSE_TITLE = #COURSE NAME
LECTURE_TITLE = #LECTURE NAME


# Automated Login and Video Download from Virtual Classroom

This script uses Selenium to automate the login to the Politecnico di Torino portal, navigate through the platform, and download the lecture video from the Virtual Classroom. The process includes:

1. Automatic login to the PoliTo portal with credentials.
2. Navigation to the specified course and desired lecture.
3. Extraction of the video URL from the lecture page.
4. Downloading the video as an MP4 file to the local system.

The browser is run in headless mode (without a graphical interface) for faster processing and without the need for manual interaction.


In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import requests

# Settings to use Chromium in headless mode
chrome_options = Options()
chrome_options.add_argument('--headless')  # Headless mode
chrome_options.add_argument('--no-sandbox')  # Necessary for running in a container environment
chrome_options.add_argument('--disable-dev-shm-usage')  # Necessary to avoid memory issues

# Configure the Chromium driver
driver = webdriver.Chrome(options=chrome_options)

# URL and credentials
URL = "https://idp.polito.it/home"
USERNAME_FIELD_ID = "username"
PASSWORD_FIELD_ID = "password"
LOGIN_BUTTON_CSS_SELECTOR = "button.login.button"

# Maximize the window
driver.maximize_window()

try:
    # Open the login page
    driver.get(URL)

    # Find the username field and enter the value
    username_box = driver.find_element(By.ID, USERNAME_FIELD_ID)
    username_box.send_keys(USERNAME)

    # Find the password field and enter the value
    password_box = driver.find_element(By.ID, PASSWORD_FIELD_ID)
    password_box.send_keys(PASSWORD)

    # Find and click the login button
    login_button = driver.find_element(By.CSS_SELECTOR, LOGIN_BUTTON_CSS_SELECTOR)
    login_button.click()

    # Wait 10 seconds for the page to load
    time.sleep(10)

    # Navigate to the specified course using the course title
    link_Course = driver.find_element(By.LINK_TEXT, COURSE_TITLE)
    link_Course.click()

    # Click the link for the "Virtual classroom"
    link_virtualClass = driver.find_element(By.LINK_TEXT, "Virtual classroom")
    link_virtualClass.click()

    # Find and click the link for the specific lecture "Efficient fine-tuning, inference"
    link_vc = driver.find_element(By.XPATH, f"//a[text()='{LECTURE_TITLE}']")
    link_vc.click()

    # Wait 10 seconds for the video to load
    time.sleep(10)

    # Find the video element using the class 'video-js' and get the video URL
    video_element = driver.find_element(By.CLASS_NAME, "video-js")
    video_url = video_element.find_element(By.TAG_NAME, "source").get_attribute("src")

    # Download the video using the obtained URL
    print(f"Video URL found: {video_url}")
    video_response = requests.get(video_url)

    # Save the video to disk with the name "video.mp4"
    with open("video.mp4", "wb") as video_file:
        video_file.write(video_response.content)
    print("Video downloaded successfully!")

    # Print the page title for verification
    print("Page title:", driver.title)

finally:
    # Wait 10 seconds to see the result
    time.sleep(10)
    # Close the browser
    driver.quit()

# Extracting Audio from Video and Transcription with Whisper

This script uses `moviepy` to extract audio from a video file and save it as an MP3 file, then uses OpenAI's Whisper model for potential transcription. The process includes:

1. Loading the video file in MP4 format.
2. Extracting the audio and saving it as an MP3 file.
3. Optionally, using Whisper for audio transcription (not included in this snippet but could be added).
   
The result is an MP3 file that contains the audio from the video, ready for further processing or transcription.


In [None]:
from moviepy.editor import *
import whisper

# Load the Whisper model
model = whisper.load_model("medium")  # You can use "tiny", "base", "small", "medium", or "large"

# Path to the MP4 file
input_file = "video.mp4"
# Name of the output MP3 file
output_file = "audio.mp3"

# Load the video
video = VideoFileClip(input_file)

# Extract the audio and save it as MP3
video.audio.write_audiofile(output_file)

print(f"Conversion completed! File saved as: {output_file}")

# Audio Transcription with Whisper

This script uses OpenAI's Whisper model to transcribe audio from an MP3 file into text. The process includes:

1. Loading the audio file in MP3 format.
2. Transcribing the audio using the Whisper model.
3. Saving the transcription as a text file.

The result is a text file containing the full transcription of the audio, ready for review or further processing.


In [None]:
# Path to the MP3 file
input_audio = "audio.mp3"
# Name of the output file
output_text = "trascription.txt"

# Transcribe the audio
result = model.transcribe(input_audio)

# Save the transcription to a text file
with open(output_text, "w", encoding="utf-8") as file:
    file.write(result["text"])

# Text Summarization with BART

This script uses the BART model from Hugging Face's Transformers library to generate a summary of a given transcription. The process includes:

1. Reading the transcription text from a file.
2. Loading the BART model for summarization (`facebook/bart-large-cnn`).
3. Preparing the content for summarization.

This script sets up the model and tokenizer, ready to summarize the transcription text into a concise version.


In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Name of the file to read
file_path = "trascription.txt"

# Open and read the content of the file
with open(file_path, "r", encoding="utf-8") as file:
    content = file.read()

transcription_text = content

# Load the BART model
print("Loading the BART model for summarization...")
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')


# Chunk-Based Summarization of Transcription with BART

This script generates a summary for a large transcription by breaking it into smaller chunks, each of which is summarized separately. The process includes:

1. Splitting the transcription text into chunks of a maximum length (1024 characters).
2. Using the BART model to summarize each chunk with the context of previous summaries.
3. Concatenating the individual summaries to create the final summarized text.
4. Saving the final summary as a text file.

This approach ensures that even long transcriptions can be summarized efficiently while maintaining coherence across chunks.


In [None]:
# Maximum length for each chunk (approximation in characters)
max_length = 1024

# Split the text into chunks based on the maximum length in characters
chunks = [transcription_text[i:i + max_length] for i in range(0, len(transcription_text), max_length)]
# Generate the summary for each chunk
summaries = []

# Initialize an empty summary for the context of the summary
summary = ""

for i, chunk in enumerate(chunks):
    # Prompt for the summary with the context of the previous summary
    prompt = f"""SUMMARIZE THIS TRANSCRIPTION OF A LECTURE OF {COURSE_TITLE} AS A UNIVERSITY STUDENT THAT IS TAKING NOTES: {summary if summary else ''}. {chunk}. RESPONSE: """
    print(prompt)
    # Tokenize the prompt
    prompt_tokenized = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
    # Generate the summary using input_ids and attention_mask
    outputs = model.generate(
        prompt_tokenized["input_ids"],
        max_length=1023,
        min_length=100,
        length_penalty=2.0,
        temperature=0.4,
        top_p=0.9
    )

    # Decode the result
    new_summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Add the summary to the final result
    summaries.append(new_summary)

    # Update the summary for the next chunk
    summary = new_summary

print(summary)

# Combine the generated summaries
final_summary = " ".join(summaries)

# Save the summary
output_summary = "final_summary.txt"
with open(output_summary, "w", encoding="utf-8") as file:
    file.write(final_summary)

print(f"Summary completed! Saved in: {output_summary}")
