<a href="https://colab.research.google.com/github/JanEggers-hr/youtube-scraper/blob/main/youtube_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Youtube-Scraper v05

Get audio for all public videos from a Youtube channel/in a playlist, convert speech to text, and summarize with AI Large Language Model. All data is stored to the account's Google Drive.

- Use [yt-dlp](https://github.com/yt-dlp/yt-dlp), a fork of youtube-dl, to collect all video metadata in an Excel sheet. 
- Use yt-dlp to download video' audio-only .M4A version with "-f 140" option (thx Cappucchino)
- Use [OpenAI's Whisper library](https://github.com/openai/whisper) to do multi-language speech-to-text conversion
- Use a Large Language Model to summarize the transcripts: default is [Aleph Alpha's](https://www.aleph-alpha.com/luminous) Luminous Extreme summarizer (API key necessary, incurs cost)

M4A files, transcripts, and metadata with AI annotations/summaries are written to the ```output_dir``` folder defaulting to ```youtube-scraper/download``` in the Google drive. 

## Tips for running this colab

- Activate the GPU in the colab environment (menu "Runtime"/"Change Runtime type") - this speeds up the Whisper conversion immensely
- Use a browser plugin like [Colab Auto Clicker](https://addons.mozilla.org/en-US/firefox/addon/colab-automatic-clicker/) for Firefox to hold the connection to the Notebook while it's doing the work, and leave the browser tab open
- Get an API key for Aleph Alpha or GPT3 - and calculate the cost before running the AI summary cells. 

----
There is a Changelog and Todo/Ideas list at the end of this Notebook. 

## Target channel/playlist

Put the channel/playlist to scrape here. If necessary, change the target directory as well.

In [None]:
channel_url = "https://www.youtube.com/@audiopilz"
output_dir = "/content/gdrive/MyDrive/youtube-scraper/output"

In [None]:
# Get the youtube downloader module - a fork of youtube-dl which seems to be abandoned 
!pip install yt-dlp

In [None]:
# Connect to Google Drive to export data
import os
from google.colab import drive
drive.mount('/content/gdrive')

# Create output directory 
if not os.path.exists("/content/gdrive/MyDrive/youtube-scraper"):
    os.mkdir("/content/gdrive/MyDrive/youtube-scraper")
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

os.chdir(output_dir)

As youtube-dl sometime fails with a 403 error ("Forbidden"), it is better to generate a list of all files to download first. Get the metadata - views, upload date, etc. - as well, and create an XLSX table file. 

In [None]:
from __future__ import unicode_literals
import yt_dlp
import pandas as pd

# Options for downloading metadata only 
ydl_opts = { 
    'quiet': 'True',
    'skip-download': 'True'
    }

# Get the metadata
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    metadata = ydl.extract_info(channel_url, download=False) 

# This is very much "dictionary of dictionaries of dictionaries" style. 
# Found out how to unwrap by pure experimentation.  
videos_df = pd.DataFrame(metadata['entries'], 
                         columns=["id","upload_date","description",
                                  "duration","view_count","comment_count",
                                  "like_count","average_rating",
                                  "availability","age_limit","categories","tags"])

# Sort list by upload date in ascending order and save.
videos_df.sort_values("upload_date")
videos_df.to_excel("video_list.xlsx")
print(len(videos_df)," video IDs found in playlist/channel.")
videos_df.head(5)

## Download Videos as Audio Files

Yes, it's possible to download Audio only - which is of course much, much faster than getting the entire video file. 

This may still take so much time that your browser may close down the connection to the Colab VM.

Another problem: Sometimes, the yt-dlp call returns an error, eg a "403 Forbidden" from Youtube. If that happens, the bulk download function calls itself again, starting off with the videos not downloaded yet. Python will stop after a couple of hundred recursions. 

In [None]:
# Define a download function getting all URLs in a list as .m4a audio (format 140).
# This is much faster than downloading video and converting to MP3. 

def download_m4a(videos_list):
    ydl_opts = {'format': '140/bestaudio',
                'outtmpl': '%(id)s.%(ext)s', 
                'ignore-errors': 'True'
    }

    # Empty list to collect all videos to download
    new_urls = []
    # Check whether audio already exists; if not, put in list. 
    for id in videos_list:
        f = output_dir + "/" + id + ".m4a"
        if not os.path.exists(f):
            new_urls.append(id)

    if len(new_urls) > 0:
        videos_left = len(new_urls)
        print(videos_left," videos left to download...")
        # Give that list to the downloader. 
        # Sometimes, the download fails. Try again then. 
        try: 
            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
                    ydl.download(new_urls)
        except:
            # Retrying - call recursion
            print("Retrying...")
            download_m4a(videos_list)

    print("Downloaded audio of all available videos.")
    return(True)

# Do it, man!
download_m4a(videos_df["id"])

All audios are .M4A files in the output directory - named (id).m4a. Send them to the Speech-to-text converter now: Using [OpenAI's Whisper model/library](https://github.com/openai/whisper) which can be used locally. 

Whisper does a spectral-based language recognition before processing an audio. I guess it doesn't get along too well with mixed-language file but the transcriptions are fairly good. 

In [None]:
!pip install git+https://github.com/openai/whisper.git 

Did installing the Whisper library work? If yes, the conversion is fairly simple: Just call whisper on the audio files. 

Using the medium-sized model (the multilanguage model is about 5GB); for better accuracy, switch to "large" (10GB), for faster transcription, use "small" (2GB). 

Remember to switch on the GPU in Colab, or conversion will be really, really slow. **But even with GPU installed, the conversion takes some time** - approx. one minute for every five minutes of video with the Medium model - so be patient! If you should lose connection to the Colab VM, reconnect, and rerun the cell - it will restart with the audio files it has not converted yet only. 

One thing that Whisper does not do for you: insert paragraphs, line breaks, indentations, emphases. Anything that makes the text block more readable is missing. Sorry. 

In [None]:
import whisper
import pandas as pd
model = whisper.load_model("medium")

# Get the Index file just in case. 
videos_df = pd.read_excel("video_list.xlsx",index_col=0)

# List of all files for which there is no transcript now
new_urls = []
for id in videos_df["id"]:
    f = output_dir + "/" + id + "_transcript.txt"
    if not os.path.exists(f):
        new_urls.append(id)
i = 0
print(len(new_urls)," M4a files to transcribe.")

# M4A files have to exist - if there is an ID in the index but it has not been 
# downloaded, the run will fail. Run audio acquisition cells again. 

for id in new_urls:
    m4a_fname = output_dir + "/" + id + ".m4a"
    txt_fname = output_dir + "/" + id + "_transcript.txt"

    result = model.transcribe(m4a_fname)
    # Write transcription to a text file
    with open(txt_fname, 'w') as f:
      f.write(result["text"])
    i = i + 1
    print(i," - ",txt_fname," generated")
    

print("Done - ",len(new_urls)," files converted.")

# AI powered Summary

Use an AI Large Language Model (LLM) as a summarizer, and for keyword extraction. 

AI LLMs are capable of doing a semantic summary. We are using a service that creates a bullet-point list for every text, reducing the text amount by about two-thirds, and making it more easily scannable.

The LLM use costs about a tenth of a cent per text. You have to have a prepaid account there. 

### Using the Luminous Extreme LLM by Aleph Alpha

Luminous Extreme, by the German startup [Aleph Alpha](https://www.aleph-alpha.com/luminous), is a LLM comparable to OpenAI's GPT-3 Curie model. It is not quite as powerful as the largest models available - no chatGPT skills here - but it features a very nice summarizer. 

Luminous has a maximum prompt size of 2048 tokens, approximately 600-800 words. It features a dedicated summarizer which may use something like 400 words per chunk, summarizing them to a one-line bullet point. 

**You need an account with Aleph Alpha, and an access token.** Write the token into a text file called ```aleph_alpha_key.txt```, and put it into the myDrive folder of your GDrive. 

### Keywords and paragraphs

Target: Split the text files into single paragraphs (which may then be used as chunks). Reduce the summaries even further to keywords. Working on it. 

### ...and GPT-3? 

Can you use GPT-3 for all this? Of course you can. [Find a summarizer that uses GPT3-Davinci here](https://github.com/emlynoregan/newaiexp) - it has a "sliding window" approach.  

I did my own experiments, with a sample summary for the model to learn from, and trying to use the summary of the first chunk as the example for the next chunk, providing some sort of context. 

My own experiments found that GPT3 is more expensive but not necessarily better, but I may have used wrong settings. This is still very much work in progress. 

In [None]:
# Get the API library from Aleph Alpha
!pip install aleph_alpha_client

OK - now that we have established the connection, let's summarize the transcripts. 

**Running the summarizer some time after finishing the STT conversion:** The code may run "from scratch", i.e. without executing the cells above, with one exception: It expects the variable ```output_dir``` to be set. If the code below stops with an error saying that it does not know that variable, jump to the very top of this notebook, and run the first cell defining ```output_dir```.

In [None]:
# Reload the GDrive, and get files to summarize, so that you may use this cell
# without the previous ones. 

import hashlib
import os
import pandas as pd
from google.colab import drive

# Helper function to get text file
def gettext(fname):
    try: 
        textfile = open(fname,'r')
    except:
        print("**Datei ",fname," nicht gefunden!**")
        return("")
    text = textfile.readline()
    textfile.close()
    return(text.replace("\n",""))

drive.mount('/content/gdrive')
path = output_dir
os.chdir(path)

# Get Alep Alpha access token first
aa_token = gettext('/content/gdrive/MyDrive/aleph_alpha_key.txt')

# Use the token to load the model
# Boilerplate code on https://github.com/Aleph-Alpha/examples/
from aleph_alpha_client import AlephAlphaModel, SummarizationRequest, EvaluationRequest, Document

model = AlephAlphaModel.from_model_name(model_name="luminous-extended", token = aa_token)

print("AlephAlpha Token (MD5) ", hashlib.md5(aa_token.encode('utf-8')).hexdigest()," works.")

# Function to summarize code. 
def generate_summary(id: str):
    text = gettext(path + "/" + id + "_transcript.txt")
    request = SummarizationRequest(document=Document.from_text(text))
    result = model.summarize(request)
    print(text[:60],"... condensed to ",len(result.summary)," chars")
    return result.summary

# get the index file again
videos_df = pd.read_excel("video_list.xlsx",index_col=0)
videos_df.sort_values("upload_date",ascending=True)

# Summaries for every line of the index file
videos_df["summary"] = videos_df["id"].map(generate_summary)

videos_df.head(10)
videos_df.to_excel("video_liste_annotiert.xlsx")

### Changelog

* v05 - Changed notebook language to English; switch downloader from youtoube-dl to yt-dlp fork, add check if availability of videos is public
* v04 - Variablen like_count in Übersicht aufgenommen
* v03 - Zusammenfassungen über Aleph Alpha und GPT-3 integriert; Sortierung aufsteigend nach Datum
* v02 - Fehler beim Download automatisch auffangen (ganz simpel: Download nochmal starten)
* v01 - Suche nach noch nicht heruntergeladenen Videos; Vervollständigung
* v00 - Funktioniert

### Todo

- Better format for summary
- Keywords, semantic similarity to focus summary