# Transcription and summarization notebook with AIs

---



Repository: https://github.com/martinopiaggi/summarize

In [1]:
Source = "https://www.youtube.com/watch?v=n6M6MK9JVoA" #@param {type:"string"}
Type_of_source = "Youtube video or playlist" #@param ['Text', 'Youtube video or playlist', 'Videos on Google Drive folder','Dropbox video link']

Type = Type_of_source

if Type == 'Youtube video or playlist':
  URL = Source

if Type == 'Dropbox video link':
  dropbox_URL = Source

#@markdown ---
#@markdown ## Advanced settings

#@markdown If source is video, you want timestamps in final summary?
Timestamps = True #@param {type:"boolean"}

#@markdown Number of iterations of the summarization
Iterations = 1 #@param {type:"number"}

# dynamic iterations
#adaptive works well only with podcast or videos with speech

In [2]:
#@markdown ### Installation of libraries
#@markdown Installation of libraries

!pip install transformers
!pip install tensorflow
from transformers import pipeline,BartTokenizer, BartForConditionalGeneration
from torch.utils.data import Dataset, DataLoader

import re
import math

if Type == ("Youtube video or playlist"
            or 'Videos on Google Drive folder'
            or "Dropbox video link"):

  !pip install faster-whisper
  from faster_whisper import WhisperModel
  from pathlib import Path
  import subprocess
  import torch
  import shutil
  import numpy as np

  if Type == "Youtube video or playlist":
    !pip install -U --pre yt-dlp
    import yt_dlp


  if Type == ("Dropbox video link"):
    !sudo apt update && sudo apt install ffmpeg




In [3]:
#@markdown ### Video downloads
#@markdown Downloading video sources

if Type == "Youtube video or playlist":
  video_path_local_list = []

# Function to download and extract audio from YouTube videos or playlists
def download_youtube_audio(url, download_path='.'):
    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': f'{download_path}/%(id)s.%(ext)s',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
            'preferredquality': '192',
        }]
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(url, download=True)
        filename = ydl.prepare_filename(info_dict)
        return Path(filename).with_suffix('.wav')

# Function to download and convert Dropbox video to audio
def download_dropbox_video(dropbox_url, output_audio_path='dropbox_video_audio.wav'):
    subprocess.run(['wget', '-O', 'dropbox_video.mp4', dropbox_url], check=True)
    subprocess.run(['ffmpeg', '-i', 'dropbox_video.mp4', '-vn', '-acodec', 'pcm_s16le', '-ar', '16000', '-ac', '1', output_audio_path], check=True)



video_path_local_list = []

if Type == "Youtube video or playlist":
    audio_path = download_youtube_audio(URL)
    video_path_local_list.append(audio_path)
    print(f"Downloaded and converted audio path: {audio_path}")

elif Type == "Dropbox video link":
    download_dropbox_video(URL)
    video_path_local_list.append("dropbox_video_audio.wav")
    print("Downloaded and converted Dropbox video to audio.")


[youtube] Extracting URL: https://www.youtube.com/watch?v=n6M6MK9JVoA
[youtube] n6M6MK9JVoA: Downloading webpage
[youtube] n6M6MK9JVoA: Downloading ios player API JSON
[youtube] n6M6MK9JVoA: Downloading android player API JSON
[youtube] n6M6MK9JVoA: Downloading player 5e928255
[youtube] n6M6MK9JVoA: Downloading m3u8 information
[info] n6M6MK9JVoA: Downloading 1 format(s): 251
[download] Destination: ./n6M6MK9JVoA.webm
[download] 100% of  199.00MiB in 00:00:05 at 38.50MiB/s  
[ExtractAudio] Destination: ./n6M6MK9JVoA.wav
Deleting original file ./n6M6MK9JVoA.webm (pass -k to keep)
Downloaded and converted audio path: n6M6MK9JVoA.wav


In [4]:
#@markdown ### Transcription
#@markdown Transcription using Whisper model

!pip install faster-whisper
from faster_whisper import WhisperModel



In [5]:
print(audio_path)

n6M6MK9JVoA.wav


In [6]:
language = "en"
initial_prompt = ""
Text = ""
TextTimestamps = ""

video_path_local = str(video_path_local_list[0])

def seconds_to_time_format(s):
    hours = s // 3600
    s %= 3600
    minutes = s // 60
    s %= 60
    seconds = s // 1
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"

if Type not in ["Text", "Text from Google Drive"]:
    model = WhisperModel('small', device="cuda", compute_type='int8')
    segments, info = model.transcribe(str(video_path_local), beam_size=5,
                                      language=None if language == "auto" else language,
                                      task="translate",
                                      initial_prompt=initial_prompt)

    transcript_file_name = video_path_local.replace(".wav", ".txt")
    transcript_file_name_timestamps = video_path_local.replace(".wav", "") + "Timestamps" + ".txt"
    with open(transcript_file_name, 'w') as f:
        for segment in segments:
            start_time = seconds_to_time_format(segment.start)
            Text += segment.text.strip() + " "
            TextTimestamps += f"[{start_time}] {segment.text.strip()} "
        f.write(Text)
        with open(transcript_file_name_timestamps, 'w') as ft:
          ft.write(TextTimestamps)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/484M [00:00<?, ?B/s]

In [7]:
#Clean cache to avoid Colab crash in case of multiple runs
#torch.cuda.empty_cache()
#Whisper model no longer necessary
del model
#load model
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name).to("cuda:0")

if Type is not ("Text" or "Text from Google Drive"):
  Text = open(transcript_file_name, "r").read()
  TextTimestamps = open(transcript_file_name_timestamps, "r").read()
  if Type == "Dropbox video link":
    Text = open("dropbox_video_audio.txt", "r").read()

if Type == "Text":
  transcript_file_name = "Text.txt"
  Text = Source

# Define the TextDataset
class TextDataset(Dataset):
    def __init__(self, texts):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]


for i in range(Iterations):
  # Tokenize the text and split into batches
  chunk_size = 1024
  overlap_size = 25
  texts = [Text[i:i+chunk_size] for i in range(0, len(Text), chunk_size - overlap_size)]

  # Calculate the total number of chunks we've got
  number_of_chunks = len(texts)

  # Now, let's calculate the ratio of the length of `TextTimestamps` to `Text`
  ratio = len(TextTimestamps) / len(Text)

  # We will use this ratio to determine the size of chunks for `TextTimestamps`
  # Calculate the size for chunks for TextTimestamps accounting for the ratio
  timestamps_chunk_size = int(chunk_size * ratio)
  timestamps_overlap_size = int(overlap_size * ratio)

  # Chunk the TextTimestamps similarly
  text_timestamps_chunks = [TextTimestamps[i:i+timestamps_chunk_size] for i in range(0, len(TextTimestamps) - int((chunk_size - overlap_size) * ratio), timestamps_chunk_size -timestamps_overlap_size)]
  print(text_timestamps_chunks)
  print(len(text_timestamps_chunks))
  def extract_timestamp_ranges(text_timestamp_chunks):
    # This regex matches the pattern [hh:mm:ss]
    timestamp_pattern = re.compile(r'\[(\d{2}:\d{2}:\d{2})]')
    ranges = []

    for chunk in text_timestamp_chunks:
        # Find all matches of the timestamp pattern
        matches = timestamp_pattern.findall(chunk)

        if matches:
            # Take the first start time
            start_time = matches[0]
            #add to the list
            ranges.append(f"[{start_time}]")
    return ranges

  # Example usage with your chunks:
  # Assuming `text_timestamp_chunks` is your list of chunks that contains text with timestamps
  timestamp_ranges = extract_timestamp_ranges(text_timestamps_chunks)


  dataset = TextDataset(texts)
  dataloader = DataLoader(dataset, batch_size=20, shuffle=False)

  summary = ''
  summaryTimestamps = ''

  ts_idx=0
  for batch in dataloader:

    # Convert the batch to the appropriate format
    encoded_inputs = tokenizer(batch, truncation=True, padding=True, return_tensors="pt", max_length=1024).to("cuda:0")

    summaries_output = model.generate(
        input_ids=encoded_inputs["input_ids"],
        attention_mask=encoded_inputs["attention_mask"],
        num_beams=4,
        length_penalty=0.6,
        no_repeat_ngram_size=2
    )

    # Decode each summarized output and append to the final summary
    for idx, output in enumerate(summaries_output):
      decoded_summary = tokenizer.decode(output, skip_special_tokens=True)

      if (ts_idx<(len(timestamp_ranges)) and ts_idx%2==0):
        summary += decoded_summary
        summaryTimestamps += "\n" + timestamp_ranges[ts_idx] + " " + decoded_summary
      else:
        summary += decoded_summary + "\n"
        summaryTimestamps += decoded_summary + "\n"
      ts_idx+=1
  #in case of two or more iterations
  Text = summary
  TextTimestamps = summaryTimestamps
  print("iter finished")


# Save the final summary
final_name = 'summary_' + transcript_file_name if Type != "Dropbox video link" else "summary_dropbox_video_audio.txt"

with open(final_name, 'w') as f:
  f.write(summaryTimestamps)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

["[00:00:00] Okay, so this is wild for me. This is a very big day for me. I don't get star-struck by celebrities [00:00:07] I don't care. I find most of them disappointing frankly. I do get star-struck by [00:00:12] neuroscientists and neurology professors [00:00:15] So you this this actually is a very big deal for me ask the producers [00:00:20] I changed did I not wear a dress and then change into something else before he got here I [00:00:25] I [00:00:27] Put a dress on and then I started sweating and panicking so then I took it off and now I'm put a sweater on which was even [00:00:34] Dumber, so I'm fully flustered. I want to just say that out loud. I don't get flustered by celebrities [00:00:41] I get flustered by scientists [00:00:43] People that listen to this podcast know that I'm obsessed with neurology [00:00:47] In fact, I believe I know more than you do [00:00:50] And you very very well may and if you disagree with anything I say, I'll just cut it out [00:00:56] That's goo

# Test with OpenAI calls

In [25]:
!pip install openai

import openai


[31mERROR: Operation cancelled by user[0m[31m
[0m

NameError: name 'os' is not defined

In [28]:
openai_api_key = "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"

client = openai.OpenAI(
    api_key=openai_api_key,
)




In [58]:
summary_prompt = """Rewrite this video transcript excerpt into a concise summary. Correct any transcription errors. Start the summary with direct statements about the content, completely omitting any form of introduction or mention of 'summary', 'the speaker', 'this video', or 'this transcript'. Focus solely on the essence of the content as if you are continuing a conversation without needing to signal a beginning."""


def query_openai_gpt(prompt, model="gpt-3.5-turbo", max_tokens=1024):
    try:
        completion = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content":  prompt}
            ],
            max_tokens=max_tokens
        )
        # Assuming the response structure has a 'choices' field with a list of completions,
        # and each completion contains a 'message' field with the actual text.
        return completion.choices[0].message.content
    except Exception as e:  # General exception handling
        return f"An error occurred: {str(e)}"

In [59]:
import requests
import re
from torch.utils.data import Dataset, DataLoader

# Define the TextDataset
class TextDataset(Dataset):
    def __init__(self, texts):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]

# Process and summarize text
def process_and_summarize(Text, TextTimestamps=None, Type=None):
    chunk_size = 2048
    overlap_size = 100

    texts = [Text[i:i+chunk_size] for i in range(0, len(Text), chunk_size - overlap_size)]
    dataset = TextDataset(texts)
    dataloader = DataLoader(dataset, batch_size=1, shuffle=False)  # Batch size set to 1 to avoid exceeding token limits

    summary = ''
    summaryTimestamps = ''

    for batch in dataloader:
        text_chunk = batch[0]
        print(summarized_chunk)
        summary += summarized_chunk + "\n"

        if TextTimestamps:
            # Handle timestamps similarly if provided
            pass  # Add logic for handling timestamps here

    # Save the final summary
    final_name = 'summary_' + transcript_file_name if Type != "Dropbox video link" else "summary_dropbox_video_audio.txt"
    with open(final_name, 'w') as f:
        f.write(summary)


if Type is not ("Text" or "Text from Google Drive"):
  Text = open(transcript_file_name, "r").read()
  TextTimestamps = open(transcript_file_name_timestamps, "r").read()
  if Type == "Dropbox video link":
    Text = open("dropbox_video_audio.txt", "r").read()

if Type == "Text":
  transcript_file_name = "Text.txt"
  Text = Source


process_and_summarize(Text, Type=Type)


CHUNK: Okay, so this is wild for me. This is a very big day for me. I don't get star-struck by celebrities I don't care. I find most of them disappointing frankly. I do get star-struck by neuroscientists and neurology professors So you this this actually is a very big deal for me ask the producers I changed did I not wear a dress and then change into something else before he got here I I Put a dress on and then I started sweating and panicking so then I took it off and now I'm put a sweater on which was even Dumber, so I'm fully flustered. I want to just say that out loud. I don't get flustered by celebrities I get flustered by scientists People that listen to this podcast know that I'm obsessed with neurology In fact, I believe I know more than you do And you very very well may and if you disagree with anything I say, I'll just cut it out That's good. So this is like a crazy honor for me. Do you want to just introduce yourself? So I don't screw it up Sure. Well, first of all, it's gre

KeyboardInterrupt: 