
# Youtube - Scaper

In this Jupyter notebook we are going to simulate a browser to scrape valuable data such as comments, likes, dislikes and other informations from YouTube videos. 
Initially, this didn't work at all, because of Google's anti scraping measures. However, after further investigation we managed to bypass their algorithms by using fake browser information and carefully navigating and scrolling on the simulated browser. So that our scraper behaves like a real person. The type of user agent and their underlying operating system get selected randomly and therefore YouTube can't find repetetive access patterns which are normally displayed by naive web scrapers. 



In [1]:
from pyppeteer import launch
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import time
import asyncio

# for extended documentation visit --> https://miyakogi.github.io/pyppeteer/
# !!! function could only be called with await !!!
async def scrape(url_: str, selector_: str, page_function_ = "(element) => element.outerHTML",
                 bypass_google_anti_scrape_algorithm_ = False, log_ = True):
    if log_ : print("-------------------------Scrape Log Begin--------------------------", "\n")
    #create random user agent so YouTube's algorithm gets pypassed
    ua = UserAgent()
    agent = ua.random
    
    # create browser, incognito context and page
    browser = await launch(options={"ping_interval": None}, ping_interval=None)
    context = await browser.createIncognitoBrowserContext()
    page = await context.newPage()

    await page.bringToFront() # switch to current page (tab switch) --> just for safety
    
    if log_ : print("Browser, Incognito Context and Page created")
    
    request_result = ""
    
    try:
        # set user agent
        await page.setUserAgent(agent)
        if log_ : print("User Agent:", agent)

        # open url
        await page.goto(url_)
        if log_ : print("Url opened:", url_)
        if log_ : print("Target:", selector_)

        if bypass_google_anti_scrape_algorithm_:
            if log_ : print("Info: Start bypassing scrape algorithm")
            await asyncio.gather(
                page.waitForSelector("h1.title"),
                page.click("h1.title")
            )

            # multiple scroll to page end to get more comments
            for i in range(12):
                time.sleep(4)
                await asyncio.gather(
                    page.keyboard.press("End", delay=20)
                )
                if log_ : print("Scraper Info: scrolled to page end to load more comments. Iteration:", i + 1)
                    
            time.sleep(2)

        await asyncio.gather(
            page.waitForSelector(selector_),
            page.click(selector_)
        )
        if log_ : print("Selector loaded:", selector_)

        # get element from query selector and relating function
        request_result = await page.querySelectorEval(selector_, page_function_)
        if log_ : print("Request finished")
    except Exception as e:
        raise Exception(str(e)) 
    finally:
        await page.close()
        if log_ : print("Page closed")
        await context.close()
        if log_ : print("Incognito context closed")
        # close browser
        await browser.disconnect()
        await browser.close()
        if log_ : print("Browser closed", "\n")
        if log_ : print("-------------------------Scrape Log End----------------------------", "\n")
    
    return request_result

## Let's test the "scrape" method: ##

As you can see our scrape method can be used to download any type of content of any webpage. 

It has the beneave listed parameters:
- (necessary) - `url` the url where the YouTube video is registered.
- (optional) - `selector_`: In essence, this is just a CSS selector like "h1.title". It would match the following: `<h1 class="title"></h1>`.
- (optional) - `page_function_`: A JavaScript lambda expression applied to each matched element. Example given: `(element) => element.firstChild.innerHTML`.
- (optional) - `bypass_google_anti_scrape_algorithm_` - This is just a boolean flag whether to bypass the anti scraping algorithm or not.
- (optional) - `log_`: This flag just enables logging.


In [2]:
# get YouTube Video Title

url = "https://www.youtube.com/watch?v=dyN_WtjdfpA&list=PLhTjy8cBISEoOtB5_nwykvB9wfEDscuEo"
query_selector = "h1.title"
function = "(element) => element.firstChild.innerHTML"

#title = await scrape(url, query_selector, function)                    
#print(title)

In [3]:
# get comments and their authors as html

url = "https://www.youtube.com/watch?v=dyN_WtjdfpA&list=PLhTjy8cBISEoOtB5_nwykvB9wfEDscuEo"
query_selector = "ytd-comments"
function = "(element) => element.outerHTML"

#html = await scrape(url, query_selector, function, True)

## Let's play around with BeautifulSoup for html parsing: ##

Due to the fact, that we get HTML code as a response from our scraper, we need to extract the relevant data out of it.
The following method extracts and their corresponding authors of the raw html of the comment section.

In [4]:
# parse html and assign them

def _parse_comments_with_corresponding_authors(html_, log_ = True):
    soup = BeautifulSoup(html_, features="html.parser")

    # get authors of comments and clear html data
    authors = [item.text.strip() for item in soup.select("a[id=author-text] > span")]

    # get comments and clear html data
    comments = [
        item.text.strip().replace("\r\n", " ").replace("\n", " ").replace("\"", "'") 
        for item in soup.select("yt-formatted-string[id=content-text]")
    ]
    #print(comments)

    likes = [
        item.text.strip().replace("\r\n", " ").replace("\n", " ").replace(".","").replace(",","")
        for item in soup.select("span[id=vote-count-middle]")
    ]
    
    #<span id="vote-count-middle" class="style-scope ytd-comment-action-buttons-renderer" aria-label="2&nbsp;&quot;Mag ich&quot;-Bewertungen">
    #print(likes)
    comments_with_authors_and_likes = list(zip(authors, comments, likes))

    
    if log_:
        print("Finished parsing")
        #for author, comment, likes in comments_with_authors_and_likes:
        #    print(author, "wrote:\n -" + comment + " with "+likes+" likes")
    
    return comments_with_authors_and_likes

# Let's try it:
#_parse_comments_with_corresponding_authors(html, False)

## Let's build parser methods: ##

The following method defines an algorithm which scrapes video metadata like 
the name, likes, dislikes, date and so on... <br>

Internally, it calls the `scrape()` method to get the raw HTML code from the given url.

In [5]:
# scrape and parse video metadata

# returns metadata as dict
# function is asynchronous and therefore it has to be awaited
async def _scrape_and_parse_video_meta_data(url: str, log_ = True):
    if "youtube.com" in url:
        
        trials = 1
        max_trials = 5
        
        while trials < max_trials:
            try:
                html = await scrape(
                    url, 
                    "div#info-contents",
                    "(element) => element.outerHTML", 
                    bypass_google_anti_scrape_algorithm_ = (trials > 2),
                    log_ = log_
                )
                
                trials = 100000
            except Exception as e:
                print('WARNING! : Metadata Scraping trial',trials,'failed for url="',url,'"!')
                print('ERROR:\n',e)
                print('Trying again...')
                trials = trials + 1
                
        if trials == max_trials : 
            raise Exception("Meta-Data scraping trials all failed!! :(")
       
        soup = BeautifulSoup(html, features="html.parser")

        title = soup.find("h1", {"class": "title"}).find("yt-formatted-string").text
        primary_info = soup.find_all("yt-formatted-string", {"class": "ytd-video-primary-info-renderer"})
        
        date = (primary_info[len(primary_info) - 1].text)
    
        hashtags = [ tag.text.strip() for tag in primary_info[0].find_all("a") if tag != None]
        
        likes = soup.select("yt-formatted-string[id=text]")[0].text.replace(".", "").replace(",","").replace("\xa0Mio","0"*6)
        dislikes = soup.select("yt-formatted-string[id=text]")[1].text.replace(".", "").replace(",","").replace("\xa0Mio","0"*6)
        
        return {"title": title, "date": date, "hashtags": hashtags, "likes": likes, "dislikes": dislikes}
    else:
        print("Wrong url format given!")

Next we need to gather the comments for a specific video! <br>
The following method does exactly that.
It takes a video url and runs the `scrape()` method internally and afterwards parses it just like the previous method... 

In [6]:
# scrape and parse comments with authors

# returns list of tuples [(Author, Comment), (...), ...]
# function is asynchronous and therefore it has to be awaited
async def _scrape_and_parse_youtube_comments(url: str, log_ = True):
    
    if "youtube.com" in url:
        trials = 1
        max_trials = 5
        
        while trials < max_trials:
            try:
                html = await scrape(url, "ytd-comments", "(element) => element.outerHTML", bypass_google_anti_scrape_algorithm_ = True, log_ = log_)
        
                return _parse_comments_with_corresponding_authors(html, log_ = log_)
        
            except Exception as e:
                print('WARNING! : Comment scraping trial',trials,'failed for url="',url,'"!')
                print('ERROR:\n',e)
                print('Trying again...')
                trials = trials + 1

        if trials == max_trials : 
            raise Exception("Comment scraping trials all failed!! :(")
    else:
        print("Wrong url format given!")
            

# Let's test it:      
#await _scrape_and_parse_youtube_comments("https://www.youtube.com/watch?v=dyN_WtjdfpA&list=PLhTjy8cBISEoOtB5_nwykvB9wfEDscuEo", log_ = False)

Now we need to store the data into the database! <br>
This is done via the following two methods.
The first one handles storing video metadata and the second one
stores the video comments. <br>
As you can see we user cypher queries which will be send to the Neo4j database. More information on cypher: https://neo4j.com/docs/cypher-manual/current/

In [7]:
# Scraping transaction : 

def _store_video_metadata(tx, httpUrl_, metadata_): # "tx" is a neo4j transaction...
     
    merges = '\n'.join([
        "MERGE(t"+str(i)+':Tag{name:"'+str(t)+'"})\n' + 
        "MERGE(t"+str(i)+")-[:REFERENCES]->(v)" 
        for i, t in enumerate(metadata_['hashtags'])
    ])
    result = tx.run(
        "MERGE (v:Video {url: $url}) "
        "SET v = {title: $title, date: $date, likes: $likes, dislikes: $dislikes, url: $url}"
        "\n"+merges+"\n"
        "RETURN v.title + ', from node ' + id(v)", 
        title=metadata_["title"], 
        date=metadata_["date"], 
        likes=metadata_["likes"], 
        dislikes=metadata_["dislikes"],
        url=httpUrl_
    )
    print('Video Metadata', metadata_,' sent to database...')
    return result

def _store_video_comments(tx, httpUrl_, data):
    comments_with_authors_, ratio = data
    author_result = []
    comment_result = []
    for author, comment, likes in comments_with_authors_:
        author_result.append(tx.run(
            "MERGE (a:Author{name: $name})"
            "RETURN a.name + ', created as Author with id ' + id(a)", 
            name=author
        )) 
        comment_result.append(tx.run(
            "MATCH (v:Video), (a:Author) "
            "WHERE v.url = $url AND a.name = $name "
            "MERGE (a) - [r: COMMENTED {text: $comment_text}] -> (v) "
            "SET r = {text: $comment_text, likes: $likes, score: $score } "
            "RETURN v.title, type(r), r.text, a.name ;",
            url=httpUrl_,
            name=author,
            #relation="COMMENTED",
            comment_text=comment,
            likes=likes, 
            score=ratio*int(likes)
        ))
        
    print('Comments sent to database...')
    return zip(author_result, comment_result)

## Using Neo4j for data storage : ##

Now that we have everything set up we will proceed to 
go through a list of video urls and scrape them for their data! :)<br>
All we need to do is call the previously defined methods for scraping and storing urls...

In [8]:
from neo4j import GraphDatabase

uri, user, password = 'bolt://localhost:7687', 'neo4j', 'neo4j_'

In [9]:
import os

# start get urls from files
def all_files_at(path):
    files = []
    for file in os.listdir(path):
        if os.path.isdir(os.path.join(path, file)): 
            files.extend(all_files_at(os.path.join(path, file)))
        else:
            files.append(os.path.join(path, file))
    return files
            
httpUrls = []

# read get urls from files and store them in a list
for file in all_files_at('data_sources'):
    with open(file) as openfileobject:
        for line in openfileobject:
            httpUrls.append(line.strip())

# let's take a look
print("First three urls:", httpUrls[:3], "...", "\n")

# removing duplicate videos if some exist ... and sort them
httpUrls = sorted(list(set(httpUrls))) 

import random
random.seed(66642999)
random.shuffle(httpUrls) # First we shuffle the urls so that the data is randomly stored!

driver = GraphDatabase.driver(uri, auth=(user, password))

# resetting database

with driver.session() as session:
    def _q(query) : return session.run(query)
    #---------------------------------------

    #_q("MATCH (n) DETACH DELETE n") # remove all graphs and nodes! BE CAREFUL!

    #---------------------------------------
driver.close()

# activate logging for more details
log_process = True

counter = 0
error_urls = []

with driver.session() as session:
    for url in httpUrls :
        
        # |--------------------------------------------------------------------------------
        # | start
        # | Videos which are already in the database do not need to be scraped multiple times
        # | Therefore, we check if a video with that url already exists
        # |--------------------------------------------------------------------------------
        
        video_exists = False
        
        with driver.session() as session:
            def _q(query) : return session.run(query)
            #---------------------------------------            
            result = _q("MATCH (v:Video {url: '" + url + "'}) RETURN count(v)")
            for record in result:
                if record["count(v)"] >= 1:
                    video_exists = True
            #---------------------------------------
        driver.close()
        
        if video_exists :
            print("Video with url", url, "already exists in database. Skipping ...", "(",counter, "of", len(httpUrls), "Videos scraped.)")
            counter = counter + 1
            continue
            
        # |--------------------------------------------------------------------------------
        # | end 
        # |--------------------------------------------------------------------------------
        
        # run await outside of transaction because asynchronous transactions for Neo4j are not yet available for Python
        print()
        print("|===========================================================================================================|")
        print("| STARTING --> SCRAPING, PARSING AND STORING VIDEO : " + url)
        print("|===========================================================================================================|")
        try:
            metadata = await _scrape_and_parse_video_meta_data(url, log_ = log_process)
            result = session.write_transaction(_store_video_metadata, url, metadata)
            print("Starting scraping and storing comments...")

            comments_with_authors = await _scrape_and_parse_youtube_comments(url, log_ = log_process)
            if len(comments_with_authors) == 0 :
                print("Video without comments found! This might be wrong!")
                print("Let's try again...")
                comments_with_authors = await _scrape_and_parse_youtube_comments(url, log_ = log_process)

            ratio = (int(metadata['likes']))/(int(metadata['likes'])+int(metadata['dislikes']))*2 - 1 

            data = (comments_with_authors, ratio)
            result = session.write_transaction(_store_video_comments, url, data)
            #print(result)
            
            counter = counter + 1
            print("Info:", counter, "of", len(httpUrls), "Videos scraped.")            
        except Exception as e:
            print('Failed loading url:', url)
            print("Error:", "\n", e)
            error_urls.append(url)
            print("Videos which failed to be scraped:", error_urls, "\n")
    
        print("|==================================================|")
        print("| FINISHED --> SCRAPING, PARSING AND STORING VIDEO |")  
        print("|==================================================|")
        print()
            
    print("Whole scraping finished. Results:")
    print("Scraped videos:", counter)
    print("Unscraped videos:", len(error_urls))
    print("The Following videos were not scraped:", error_urls)

driver.close()

First three urls: ['https://www.youtube.com/watch?v=dyN_WtjdfpA', 'https://www.youtube.com/watch?v=Ul0ZgDoamco', 'https://www.youtube.com/watch?v=lcgqP8g6i84'] ... 

Video with url https://www.youtube.com/watch?v=ASFSXNQKPDI already exists in database. Skipping ...
Video with url https://www.youtube.com/watch?v=HtYweBOCp7A already exists in database. Skipping ...
Video with url https://www.youtube.com/watch?v=1vWTJzJx0i4 already exists in database. Skipping ...
Video with url https://www.youtube.com/watch?v=L0H6xYwMQnk already exists in database. Skipping ...

| STARTING --> SCRAPING, PARSING AND STORING VIDEO : https://www.youtube.com/watch?v=Bd0cMmBvqWc
-------------------------Scrape Log Begin-------------------------- 

Browser, Incognito Context and Page created
User Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.2117.157 Safari/537.36
Url opened: https://www.youtube.com/watch?v=Bd0cMmBvqWc
Target: div#info-contents
Selector loaded: div#info