# Project 9 - Scrapping Youtube Comments With Selenium 
---
The objective of this mini-project is to take write a guide of how to scrap the comments from an Youtube video webpage. The focus is in automating the scroll down process of Javascript based websites such as Youtube.

To scrap:
- Main comments by users and channel owner (contained in the comment section). Reply comment are not intended to be scrapped.
- The usernames associated with each comment.

In [15]:
import pandas as pd
import re
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

The webdriver required perform the task is constantly changing versions to keep up with the browser updates. Therefore it is required to download the latest or at least a compatible version of the driver. The webdriver for Chrome based browsers such as Brave can be found [here](https://chromedriver.chromium.org/downloads). 

To set up the webdriver:
1. Download, unzip and store the file into a desired location.
2. Create a symbolic link.
    - `sudo ln -s source_directory /usr/bin`
3. Set the PATH environment by inserting the following line in `.bashrc` and refresh it: 
    - `export PATH="/usr/bin/chromedriver:$PATH"`
    -  `source ~/.bashrc`

Activating the webdriver:

In [16]:
website = "https://www.youtube.com/watch?v=LDlS-A0kF5w"

driver_path = "/usr/bin/chromedriver"

brave_path = "/usr/bin/brave-browser-stable"

option_1 = Options()

option_1.binary_location = brave_path

driver = webdriver.Chrome(executable_path=driver_path, options=option_1)

driver.get(website)

driver.maximize_window()

If the cookies/monitoring window appears:
 - Wait 3 second for main page to load 
 - Wait up to 5 seconds for the cookie window to pop and click 'reject all' button.
 - click the button to reject all cookies

In [17]:
try:
    time.sleep(3)
    reject_cookies_xpath = '//*[@id="content"]/div[2]/div[6]/div[1]/ytd-button-renderer[1]'

    reject_all_cookies_button =(
        WebDriverWait(driver, 5)
        .until(EC.presence_of_element_located((By.XPATH, reject_cookies_xpath)))
    )

    reject_all_cookies_button.click()
except:
    pass

Next:
- Pause Youtube video.
- Scroll 800 pixels down so that the beginning of the comment section loads. (it can be more or less pixels. 800 does the job)
- Wait 3 seconds for the comments to load properly.

In [18]:
try:
    pause_video = driver.find_element_by_xpath('//*/button[@aria-label="Pause (k)"]')

    pause_video.click()
except:
    pass

driver.execute_script("window.scrollTo(0, 800);")

time.sleep(3)

In contrast with the comment boxes from regular users, we are not able to retrieve username of the the channel owner in his comment boxes, if we use `.text` in the Xpath.

The first comment is from the channel owner; let's use `.text` to see if what is retrieved.

In [19]:
channel_owner_name_1 = driver.find_element_by_xpath('.//*[@id="author-text"]').text

channel_owner_name_1

'Forthright Gambitia'

With `get_attribute(innerHTLM)` we are able to retrieve the raw string.

In [20]:
channel_owner_name_2 = driver.find_element_by_xpath('.//*[@id="author-text"]').get_attribute("textContent")

channel_owner_name_2

'\n            \n              Forthright Gambitia\n            \n          '

We use regex to clean the last string.

In [21]:
re.sub('\n\s+', '', channel_owner_name_2)

'Forthright Gambitia'

Sometimes the comment is lengthy and in order for the webpage to load the full comment, we must click in the 'Read more' button. With the `try` clause we assess always if such button is available for any given comment. 

In [22]:
data_0 = []

comment_boxes_xpath = '//*[(@id="main") and (@class="style-scope ytd-comment-renderer")]'

comment_boxes = (
    WebDriverWait(driver, 5)
    .until(EC.presence_of_all_elements_located((By.XPATH, comment_boxes_xpath)))
)

for box in comment_boxes[:3]:
    
    # Usernames.
    user = (
        box
        .find_element_by_xpath(".//*[@id='author-text']")
        .text
    )
    
    if user == '':
        user = (
            box
            .find_element_by_xpath(".//*[@id='author-text']")
            .get_attribute("textContent")
        )
        
        user = re.sub('\n\s+', '', user)
    
    
    # Comments.
    try:
        read_mode_button = box.find_element_by_id("more")
        read_more_click = read_mode_button.click()
    except:
        pass
    
    comment_list = box.find_elements_by_xpath(".//*[@id='content-text']")
    
    full_comment = ''
    
    for paragraph in comment_list:
        if len(full_comment) == 0:
            full_comment = paragraph.text
        else:
            full_comment += '\n' + paragraph.text
            
    value = user + " | " + full_comment        
    
    data_0.append(value)


We join in the same string the username and the comment with ' | ' so that we can easily avoid appending duplicates in case of the algorithm tries to scrap the same comment more than once. Later, is easy to split back the string into username and comment strings.

In [23]:
for i in data_0[:10]:
    print(i, '\n')

Forthright Gambitia | I have to say, despite not being much of a monarchist, this event, like others of recent years, feels me with a certain sense of foreboding for the future. 


Lorem Ipsum | I'm not a monarchist though my parents certainly are, but as a public figure the Queen was the perfect example of what some would say 'old-fashioned' values that 'came with the job' and are slowly vanishing from society. Her sense of duty was something that has to be admired and deserves the highest respect. I think with her passing we will see she was the last link we had to a Britain of the past. People's values and behaviour have changed so much and I don't think any of it has been to anyone's benefit. We've descended into the "so, you think you're better than me?" society where so much means so little to so many.

One question, Vlad, if I may? You used the phrase 'our culture' and all citizens -- do you think there is a shared single culture, even shared values in Britain today? Was that a 

The function below encapsulates the tasks we've been working on: it takes a comment box as input and returns a string which is, has seen previously, a joint string with the username and its comment.

In [24]:
def get_info(comment_box):
    
    # Usernames:
    user = comment_box.find_element_by_xpath(".//*[@id='author-text']").text

    if user == '':
        user = comment_box.find_element_by_xpath(".//*[@id='author-text']").get_attribute("textContent")

        user = re.sub('\n\s+', '', user)


    # Comments
    try:
        read_mode_button = comment_box.find_element_by_id("more")
        read_more_click = read_mode_button.click()
    except:
        pass

    comment_list = comment_box.find_elements_by_xpath(".//*[@id='content-text']")

    full_comment = ''

    for paragraph in comment_list:
        if len(full_comment) == 0:
            full_comment = paragraph.text
        else:
            full_comment += '\n' + paragraph.text

    value = user + " | " + full_comment   
    
    return value

To retrieve all the comments we have to scroll down the webpage to allow the comments boxes to load. What can happen is that, when scrolling down the webpage, content which is not visible in the window not be loaded, therefore, the best way to scrap the desired information is to make a while loop that allows to 'scrap as you scroll'. The logic is as follows:
- Set `scrolling = True` to activate the while loop.
- At the beginning of each iteration scrap with 'waits' the comment boxes available.
- Store the current webpage height. This [height](https://developer.mozilla.org/en-US/docs/Web/API/Element/scrollHeight) gives the total number of vertical pixels of the webpage (when loading the Youtube page for the first time it is set at 2210 in our case).
- If `while` is true (second while loop) we scroll down once and store the new height:
     - if we reached the bottom of the webpage the current height and the new height are the same, so we finish the loop.
- If `while` is false:
     - the new height is now the current height and we re-start the loop until current and new height are the same: `break` ends the nested while loop and setting `scrolling = False` ends the first while loop.

Notes: 
- after scrolling down continuously, eventually the end of the webpage is reached. After that, the command that scrolls down will not do any action, nor it will return a warning or an error, therefore the new height stored will be the same (previous) current height; as they coincide, the loop ends.


- some pages may load all comments that were loaded from previous scrolling downs, regardless, the 'scrap as you go' method allows to scrap without having to know in advance if the webpage omits previously loaded content.

The line show below should display the current webpage height after after having scrolled down 800 pixels, but sometimes fails:

    driver.execute_script("return document.documentElement.scrollHeight")
    
Instead, we use this one:

In [25]:
driver.execute_script("return document.documentElement.scrollHeight")

4113

Printing also `current_height` and `new_height` for each iteration, to help understanding the process.

In [26]:
data = []

scrolling = True

while scrolling:
    container_xpath = '//*[(@id="main") and (@class="style-scope ytd-comment-renderer")]'
    
    # The container that has all comments plus replies, likes, etc.
    comments_container = (
        WebDriverWait(driver, 5)
        .until(
            EC.presence_of_all_elements_located((By.XPATH, container_xpath))
        )
    )
    
    for comment_box in comments_container:
        value = data.append(get_info(comment_box))
        if value not in data:
            data.append(value)
     
    # Get the initial scroll height.
    current_height = driver.execute_script("return document.documentElement.scrollHeight")
    print(f'{current_height = }')
    
    while True:
        # Scroll down to bottom.
        driver.execute_script(f"window.scrollTo(0, {current_height})")
        
        # Wait to load page: very important to set a fair period of time! 3 seconds wasn't enough.
        time.sleep(5)
        
        # Calculate new scroll height and compare it with last scroll height.
        new_height = driver.execute_script("return document.documentElement.scrollHeight")
        print(f'{new_height = }')
        
        # Condition: if the new and last height are equal, it means that there isn't any 
        # new page to load, so we stop scrolling.
        if new_height == current_height: 
            scrolling = False
            break

        else:
            current_height = new_height
            break
                   

current_height = 4474
new_height = 7171
current_height = 7351
new_height = 10272
current_height = 11992
new_height = 14730
current_height = 15170
new_height = 17612
current_height = 17773
new_height = 17773


We can see in the last two values above, current and new height are the same, hence the loop stopped.

The scrapping is done, we can close and quit the driver.

In [27]:
driver.close()

driver.quit()

Showing the results:
- Splitting username from comment and display results in a DataFrame.
- If some username/comments empty appear, they can be dropped.

In [28]:
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

df = (
    pd.Series(data, name='username')
    .str.split(' \| ', n=1, expand=True)
    .rename(columns={0: 'username', 1: 'comment'})
    .dropna()
    .reset_index(drop=True)
)

In [29]:
df

Unnamed: 0,username,comment
0,Forthright Gambitia,"I have to say, despite not being much of a monarchist, this event, like others of recent years, feels me with a certain sense of foreboding for the future."
1,Vlad Vexler Chat,"The Queen's death is an unsettling time for many in the Uk and all over the world. Hers is an extraordinary history. There is a special and moving way in which she understood that minimalism was a virtue in her role. I appreciate that an analytical video about our politics may not be right to watch for many at this time. So I just want to put that warning here. As well as mention that the BBC, a great institution that in many ways is vulnerable and at risk, is doing a fantastic job with coverage today."
2,Lorem Ipsum,"I'm not a monarchist though my parents certainly are, but as a public figure the Queen was the perfect example of what some would say 'old-fashioned' values that 'came with the job' and are slowly vanishing from society. Her sense of duty was something that has to be admired and deserves the highest respect. I think with her passing we will see she was the last link we had to a Britain of the past. People's values and behaviour have changed so much and I don't think any of it has been to anyone's benefit. We've descended into the ""so, you think you're better than me?"" society where so much means so little to so many.\n\nOne question, Vlad, if I may? You used the phrase 'our culture' and all citizens -- do you think there is a shared single culture, even shared values in Britain today? Was that a generalisation? It's just not what I have found in 52 years, sadly."
3,Stephen Rose,"Many years ago, a friend of mine from Africa, told me of coming to the UK in 1946.\nHe had witnessed WW2 from a country untouched by the conflict. . Out of curiosity he went to a music hall, in South London. The theatre was full to capacity, the audience largely from the poorer classes. They were singing all the old songs and laughing along in good spirits. He had wondered how an unprepared democracy could resist a totalitarian state. In that moment he realised why they had prevailed . The sense of good hearted solidarity, struck him as a young communist and he made his home here."
4,W,"As a Swiss the significance of a monarch or the monarchy for a country or a people in whatever form completly escapes me. Even the concept of ""important national leaders"" seems strange to me. If you ask me who our head of state is, I always have to think for a second or two which one of the seven members of the federal council it is. Because they take turn every year. Swiss politics is boring as hell in a good way. We almost have no really disruptive power changes, no single ruling party. Also no significant losses of certain national figures, that could shake the whole country."
5,Jennifer,"It’s strange, all these things seem to be at the sub conscious level. It’s telling when you rarely pay much attention and aren’t a royalist. But on hearing the news you feel strangely sad and unsettled. Which is why I like your conversations so much- they bring things to the fore."
6,KernowPolski,"Well put Vlad. The Queen for me was the perfect symbol of national unity beyond politics and the right sense of duty and political silence that is the requirement of a constitutional monarch. My proudest inheritance from my father are two plaques: his award of the Polish Virtuti Militari medal his bravery in the War and the post-war commission as a Royal Air Force officer he received from the Queen's father King George VI.\nWith her passing I feel the huge gap of that departing generation, but I hope her loss will renew interest in a sense of duty to our civilisation which she so embodied, a move away from narcissism and a return to working for the common good. Our national anthem will or course change which will be hard to get used to.\nNothing is immortal, we have to keep renewing the good things!"
7,Duck,"one of the most frustrating parts of hyper identity politics is that any criticism of it from other leftists comes off, to the more identity-ey people, like dismissing group-specific problems. Perhaps I just don’t know the lingo to talk about it yet, but we can talk about issues that affect Black people more without this wall of identity around it. we should do things that help the Black community because they’re our neighbors and that’s what we do. and we can recognize the background. but idk— identity is important but there’s too many barriers all around us now. it’s like a maze with no corridors."
8,CS79N,"Thanks Vlad. I'm sorry to see the Queen go - I'm neither an abolitionist or monarchist, I feel that the monarchy while imperfect (cloaked as it is in religiosity and so on) is as good a system as any. I think/hope we in Britain can cohere in a time of need and it would be useful for us to follow the Queen's example in a crisis; to be clearheaded and rational as opposed to being led by emotion.\n\nOn your wider point I think I see more and more people with a childlike level of political understanding. You don't have to scroll far through YouTube comments or the ""Have Your Say"" section on the BBC to see hackneyed or cliched political discourse. You know the sort of thing - the Tories are only interested in helping their ""rich CEO friends"" or Labour want to draw and quarter every bourgeois they can lay hands upon and hang their steaming entrails on the tower of London. Anyone in political life is presented as a comic book villain and there's little or no understanding that the world isn't really like that.\n\nI think we need an adult to gently explain to people that, believe it or not, the vast majority of people in public life are there because they genuinely want to contribute towards the advancement of our society such that the maximum number are able to ascend the maximum height up their hierarchy of needs, and that the disagreements in politics are bona fide, between individuals who are political adversaries but public service colleagues, about how best we procure and marshal resources to do so.\n\nA good friend of mine remarked that he couldn't believe he saw Boris Johnson and Keir Starmer talking good naturedly in Parliament and that they should hate each other, and the fact that they did was evidence of some corruption or malpractice, and I think my friend might unfortunately be characteristic of the rest of the country. This isn't Game of Thrones, this is real life. People need to understand that - as today has proven - rich or poor, urchin or Queen, we all die in the end, and that until that happens we all need somewhere to sleep, food to eat, and our goals in life aren't that dissimilar. Maybe then we could return to some common ground."
9,Aaron,"Second! It's hard to imagine a world without the Queen, my condolences to the British people."


\[End of Project\]

\***