# Download YouTube Videos
**Author:** Scott Campit

This notebook adpted code from Yan Gobeil's [Medium article](https://towardsdatascience.com/making-an-image-dataset-from-youtube-videos-5116252d20a3) convert YouTube video clips into images.

The following modifications were made to ensure the code ran in Python 3.8+ with the latest dependency versions (which can be found in the `requirements.txt` file):

  1. Replaced Beautiful Soup with Selenium to get urls due to JavaScript file being pulled and not being parsed in bs4.
  2. Resolved several errors that popped up during the run. 
  3. Simplified syntax, arguments, and number of operations performed.

Some issues and/or extensions that would be appreciated:

  * Some videos do not contain relevant images - rather, they are either a lecture, are based on cartoons, or have really bad resolution. 
    * We need an automatic or semi-automatic way to tease apart what videos/images are relevant.
  * Extending the number of pages that are queried on YouTube. 

## Import Libraries
First, we need to install some important libraries.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from pytube import YouTube
import cv2
import os
import glob

## Setting up functions for downloading YouTube videos as images

### Get video URLs
Next, we'll define the first function, which searches a word or phrase on Youtube and returns urls for the videos.

In [3]:
def get_urls(query, site_to_query="https://www.youtube.com/", wait_duration=5, path_to_chrome_driver=r'C:/Users/Scott/Software/Browser_Utils/chromedriver.exe'):
    '''
    Search Youtube based on query and return a list of links to the videos resulting from the search. A maximum of results can be set (default 10).

    INPUTS
        * query: a string denoting the keyword or phrase to search on YouTube.
        * wait_duration: an integer denoting the amount of time to wait per click. This prevents us from getting into trouble by Google. The default value is 5s.
        * path_to_chrome_driver: a string denoting the path to your Chrome driver (needed to query on the Chrome browser using Selenium).

    OUTPUT
        * urls: a list of urls corresponding to the query.
    '''
    
    # Initialize on Chrome 
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    driver = webdriver.Chrome(executable_path=path_to_chrome_driver, chrome_options=options)
    
    # Search on YouTube
    driver.get(site_to_query)
    WebDriverWait(driver, wait_duration).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#search"))).send_keys(query)
    driver.find_element_by_css_selector("button.style-scope.ytd-searchbox#search-icon-legacy").click()
    
    # Get URLs
    urls = [my_href.get_attribute("href") for my_href in WebDriverWait(driver, wait_duration).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.yt-simple-endpoint.style-scope.ytd-video-renderer#video-title")))]
    return urls

### Download images to local
Next, we'll download the videos to your local system using [pytube](https://python-pytube.readthedocs.io/en/latest/). Note that this code may temporarily occupy a lot of storage, as it saves the video on your disk before deleting. 

In [4]:
def download_video(url, path=None):
    """
    download_video downloads all YouTube videos from the list of urls using pytube.
    
    INPUTS:
        * url: a list of urls correspond to the query.
    """

    # if the url is bad or doesn't point to a YouTube video, the following code doesn't work.
    try:
        yt = YouTube(url)
        yt = yt.streams.filter(file_extension='mp4').first()
        out_file = yt.download(path)
        file_name = out_file.split("\\")[-1]
        print(f"Downloaded {file_name} correctly!")
    except Exception as exc:
        print(f"Download of {url} did not work because of {exc}...")

### Extract images from video
Finally, this function will extract images using [OpenCV](https://opencv.org/).

In [5]:
def extract_images_from_video(video, savepath=os.getcwd(), delay=30, name="file", max_images=20, is_silent=True):
    """
    extract_images_from_video turns a YouTube video (.mp4) into a series of images using the opencv project. 

    INPUTS:
        * video: a string denoting the path to an .mp4 file.
        * savepath: a string denoting where to save the images. Default is the current working directory.
        * delay: 
        * name: a string denoting the name of the file.
        * max_images: an integer denoting the number of images to generate per video
        * is_silent: a boolean denoting whether to print the status of image generation or not.
    """    
    vidcap = cv2.VideoCapture(video)
    count = 0
    num_images = 0
    label = 0
    success = True
    fps = int(vidcap.get(cv2.CAP_PROP_FPS))
    
    # The code fails if the video is too short w.r.t the delay and number of images. 
    # So I made this try/except statement to get around that.
    try:
        while success and num_images < max_images:
            success, image = vidcap.read()
            num_images += 1
            label += 1
            filename = name + "_" + str(label) + ".jpg"
            path = os.path.join(savepath, filename)
            cv2.imwrite(path, image)
            if cv2.imread(path) is None:
                os.remove(path)
            else:
                if not is_silent:
                    print(f'Image successfully written at {path}')
            count += delay*fps
            vidcap.set(1, count)
    except Exception as exc:
        print(f"Image separation stopped because of {exc}...")
        pass

# Putting it all together
Now let's aggregate all the functions we just developed into a single function to call for downloading images based on YouTube queries.


In [17]:
def extract_images_from_word(query, delete_video=True, image_delay=5, 
                             max_images=30, savepath=os.getcwd(), silent=True):
    """
    """
    if not os.path.exists(savepath):
        os.mkdir(savepath)
    
    # Get urls
    urls = get_urls(query, 
                    site_to_query="https://www.youtube.com/", 
                    wait_duration=5, 
                    path_to_chrome_driver=r'C:/Users/Scott/Software/Browser_Utils/chromedriver.exe')

    # Download videos using the URLs we just extracted
    for url in urls:
        download_video(url, savepath)
    
    # Find all .mp4s and create images from them.
    for i, video in enumerate(glob.glob("*.mp4")):
        extract_images_from_video(str(video), folder=savepath, delay=image_delay, name=query+'_'+str(i), max_images=max_images, is_silent=silent)
        
        # Save on memory when possible.
        if delete_video:
            os.remove(video)
    
    print("Finished extracting images from videos!")

## Extracting Thyroidectomy videos (automatically)
Now, let's download some Thyroidectomy videos from YouTube. Gross.

In [4]:
path_to_img = "D:/Data/InterOp/Thyroidectomy/"
extract_images_from_word(query="thyroidectomy", delete_video=True, image_delay=5, max_images=50, savepath=path_to_img, silent=False)

# Extracting Thyroidectomy videos (manually)
First, you would need to copy and paste urls into this list. The urls below correspond to the query "Thyroidectomy" specifically.

In [9]:
urls = [
    "https://www.youtube.com/watch?v=IiBg-fSNSxc",
    "https://www.youtube.com/watch?v=WJ2jS88EUmo",
    "https://www.youtube.com/watch?v=2tCajgpPcGo&has_verified=1",
    "https://www.youtube.com/watch?v=cRKcRu2ugoA&has_verified=1",
    "https://www.youtube.com/watch?v=biS97SAiNCA",
    "https://www.youtube.com/watch?v=EhQ-yXqB8no",
    "https://www.youtube.com/watch?v=23uZbHfnWnU",
    "https://www.youtube.com/watch?v=eGs_JNbH1Xs",
    "https://www.youtube.com/watch?v=QKNQe-oXFKQ",
    "https://www.youtube.com/watch?v=zLaaIYtSXnk",
    "https://www.youtube.com/watch?v=EkESrh4f5ao",
    "https://www.youtube.com/watch?v=mYzL383plFw",
    "https://www.youtube.com/watch?v=tEJagKxruw8"
]

Let's modify the function so that it takes in a list of urls to query.

In [20]:
def extract_images_from_url_manually(query, urls, delete_video=True, image_delay=5, 
                             max_images=30, savepath=os.getcwd(), silent=True):
    """
    """
    if not os.path.exists(savepath):
        os.mkdir(savepath)
    os.chdir(savepath)
    
    # Download videos using the URLs we just extracted
    for url in urls:
        download_video(url, savepath)
    
    # Find all .mp4s and create images from them.
    for i, video in enumerate(glob.glob("*.mp4")):
        extract_images_from_video(video, savepath=savepath, delay=image_delay, name=query+'_'+str(i), max_images=max_images, is_silent=silent)
        
        # Save on memory when possible.
        if delete_video:
            os.remove(video)
    
    print("Finished extracting images from videos!")

We'll grab them now.

In [21]:
path_to_img = "D:/Data/InterOp/Thyroidectomy/"
query = "thyroidectomy"
extract_images_from_url_manually(query, urls, delete_video=True, image_delay=5, 
                             max_images=50, savepath=path_to_img, silent=True)

Downloaded D:/Data/InterOp/Thyroidectomy/Total Thyroidectomy and Central Neck Dissection.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Thyroid lobectomy.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Thyroidectomy using HARMONIC FOCUS®+ Shears with Dr Pellitteri.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/THE MOUNT SINAI SURGICAL FILM ATLAS Transaxillary Thyroidectomy.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Total Thyroidectomy - Thyroid Cancer - Operative Surgery.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Left Thyroid Lobectomy.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Thyroidectomy Surgery - THUNDERBEAT - Olympus Surgical  Dr Sam Van Slycke.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Total Thyroidectomy For Substernal Goiter in a Previously Operated Patient.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/Minimally Invasive Thyroidectomy.mp4 correctly!
Downloaded D:/Data/InterOp/Thyroidectomy/