# YouTube video scraping

This notebook is used to get YouTube comments from: https://www.youtube.com/watch?v=V79x7045Bp0

## Import Important Libraries

In [1]:
import time

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
import pandas as pd

## Define Empty Data List

In [2]:
comment_data = []
author_data = []

## Define Target Url and Services

In [3]:
service = Service(executable_path=r'C:\Users\steam\OneDrive\Desktop\Coding Kuliah\Python\Text Mining\.env\chromedriver\chromedriver.exe')
timeout = 10 # 10 seconds of timeout
long_sleep_time = 10
short_sleep_time = 3

## Web Scraping

### Scrape from the links in YoutubeLinks.tsv

In [4]:
links = pd.read_csv('../Data/YoutubeLinks.tsv', sep='\t')
links

Unnamed: 0,creator,title,link
0,@gradehacker,Coursera Review: Our Experience and How it Works,https://www.youtube.com/watch?v=l5V2BaoYnWo&pp...
1,@RichardWalls,Coursera Review | My Thoughts After 5 Years an...,https://www.youtube.com/watch?v=V79x7045Bp0&pp...
2,@Daniel-Dann,Coursera Review (2024) - Is Coursera Worth it?...,https://www.youtube.com/watch?v=rnpzU7GBHlI&pp...
3,@gradehacker,Top 5 Online Learning Platforms 2024 | Review ...,https://www.youtube.com/watch?v=wY5n3uGZ6Js&pp...
4,@loistalagrand,Coursera Review (The Best E-learning Site?),https://www.youtube.com/watch?v=LdQBrMWAU_w&pp...
5,@MotasemHamdan,Palo Alto Networks Cybersecurity Professional ...,https://www.youtube.com/watch?v=Y6YNM-2P32Y&pp...
6,@JonGoodCyber,ONLY UNSPONSORED Review of the Google Cybersec...,https://www.youtube.com/watch?v=lZ6p_djgNWI&pp...
7,@vitaliylahno,Coursera Review: Why Is It the Best Online Lea...,https://www.youtube.com/watch?v=91w68nfT3Qw&pp...
8,@Khosomaty,Coursera Plus 2023 Review 7000+ Online Courses...,https://www.youtube.com/shorts/Lds9UVRlzlQ
9,@thesocialguide7659,Coursera Review - Best Platform for Courses?,https://www.youtube.com/watch?v=QXb9gNPLB4A&pp...


### Scraping!

In [5]:
with Chrome(service=service) as driver:
    
    wait = WebDriverWait(driver, timeout)

    for i in range(len(links)):
        url = links['link'][i]
        driver.get(url)
    
        # Get initial scroll height
        previous_height = driver.execute_script("return document.documentElement.scrollHeight")

        while True:

            # Scroll to the bottom of page
            wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)

            # Wait for page to load using long sleep time
            time.sleep(long_sleep_time)

            # Get scroll height
            new_height = driver.execute_script("return document.documentElement.scrollHeight")

            # If the page doesn't move, stop the loop
            if (previous_height == new_height):
                break
            
            previous_height = new_height

        # Check if there is an expand comment button
        try:

            # Find the expand comment buttons
            for reply_button in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#more-replies-icon'))):

                driver.execute_script('arguments[0].scrollIntoView(true)', reply_button)

                # If the button is not hidden, click the button
                if (reply_button.is_displayed()):

                    driver.execute_script('arguments[0].click()', reply_button)

                # Give a short delay
                time.sleep(short_sleep_time)

        except TimeoutException:
            print(f'The video \"{links["title"][i]}\" does not have any reply, proceeding to next statement...')

        # Scroll to the top of page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        
        # Give a short delay
        time.sleep(short_sleep_time)

        # Scroll to the bottom of page, then save the keys
        wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
        
        # Give a short delay
        time.sleep(short_sleep_time)

        # Check if there is a comment
        try:

            # Save text in the variable data
            for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content-text"))): 
                    comment_data.append(comment.text.replace('\n', '\\n')) # Replace '\n' with '\\n'
                    print(comment.text)
            
            # Save author (comment writer) in the variable data
            for author in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#author-text"))):
                    author_data.append(author.text.replace('\n', '\\n')) # Replace '\n' with '\\n'

                    if (author.text == links['link'][i]):
                        comment_data.pop(len(author_data) - 1)
                        author_data.pop(len(author_data) - 1)

                    print(author.text)
        
        except:
             print(f'The video \"{links["title"][i]}\" does not have any comment, proceeding to next video...')

Have you taken any course in Coursera? How was your experience with it?
I’m interested in doing tech courses
i completed Everyday Excel Part 1 and everything went well. now i'm having issues with Everyday Excel Part 2 because the 2nd Assignment it's very far from the content of videos of the week, is definetely beyond the scope of the videos and i'm stuck there from a while now. i'm considering to give up 
 @fabfitforever4934  There's plenty there for you to choose!
 @teclasicbaldi  Don't give up! It may be challenge that will pay off later. These are skills you can add to your resume :)
Not yet, there are many things I need to understand about this platform? How can we communicate
Idk about the certificates but I use their top university lessons to learn korean for free. I did one lesson today and I'm impressed.
Do they take the lessons on meet? or are they video explanations?
 @berryberrystrawberry2428  Video explanations but what I have used is very interractive and at the end of ea

## Save the data as dataframe

In [6]:
df = pd.DataFrame({'author': author_data, 'comment': comment_data})

## Remove missing values immediately

In [7]:
df = df.replace('', pd.NA)
df = df.dropna()

## Save the data to tsv

Since this is a text data, there are probably many usages of ',' in the data

In [8]:
df.to_csv('../Data/comment_data.tsv', index=False, sep='\t')