# Theory

1. Install packages and libraries for `webdriver`.
2. Review the webpage's `HTML` structure and determine the path to obtain the data we want to scrape.
3. Start scraping the data using `Selenium` and `pandas`.
4. Output the processed data into a `CSV` file.

The `ipynb` below is run on [Google Colab](https://colab.research.google.com/?utm_source=scs-index) as the IDE contains multiple built-in libraries such as `pandas` and many more. It also runs on LinuxOS, which allows more functions that a WinOS user cannot work with.

[Link to the Colab Notebook](https://colab.research.google.com/drive/1bsTGqjkcSL4J3TJJnDVxGwPUmsQwHFFQ?usp=sharing)

# Install Packages

In [1]:
# Install the package.
!pip install selenium

# Import the required libraries.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

import pandas as pd

# Install the chrome web driver from selenium. 
!apt-get update 
!apt install chromium-chromedriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', chrome_options = chrome_options)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.3.0-py3-none-any.whl (981 kB)
[K     |████████████████████████████████| 981 kB 5.2 MB/s 
[?25hCollecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
[K     |████████████████████████████████| 358 kB 34.3 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 60.0 MB/s 
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting pyOpenSSL>=0.14
  Downloading pyOpenSSL



The web driver is a key component of selenium. The web driver is a browser automation framework that works with open source APIs. The framework operates by accepting commands, sending those commands to a browser, and interacting with applications.

Selenium supports multiple web browsers and offers web drivers for each browser. I have imported the chrome web driver from selenium. Alternatively, you can download the web driver for your specific browser and store it in a location where it can be easily accessed (C:\users\webdriver\chromedriver.exe). You can download a web driver for your browser at [this site](https://selenium-python.readthedocs.io/installation.html#:~:text=Selenium%20requires%20a-,driver,-to%20interface%20with).

# Review Web Page's `HTML` Structure

Things to scrape:

| Item | Common Structure |
| :--- | :--- |
| Title | div (ptam-block-post-grid-text) > h6 (ptam-block-post-grid-title) > a (local-link) |
| Date | div (ptam-block-post-grid-text) > div(ptam-block-post-grid-byline) > time (ptam-block-post-grid-date) |
| Site URL | div (ptam-block-post-grid-image) > a (local-link) > "href" |
| Image URL (Optional) | div (ptam-block-post-grid-image) > a (local-link) > img (attachment-medium size-medium jetpack-lazy-image jetpack-lazy-image--handled lazyloaded) > "src" |

## Try Page 1

1. Set up scraper at main page.

In [2]:
# Navigate to main page
driver.get('https://butchixanh.com/')
page = 1

2. Scrape data in page 1.

In [3]:
# Get elements from HTML
raw_titles = driver.find_elements(By.XPATH, "(//div[@class='ptam-block-post-grid-text']/h6[@class='ptam-block-post-grid-title']/a[@class='local-link'])")

# Copy elements into list
movie_titles = []
for title in raw_titles: movie_titles.append(title.text)

print(movie_titles)

['Money Heist: Korea – Joint Economic Area', 'Cafe Minamdang', 'Item', 'Brilliant Legacy', 'Alchemy of Souls', 'The Witch Is Alive', 'It’s Beautiful Now', 'Miracle', 'Ultimate Weapon Alice', 'Anna', 'Why Her?', 'Doctor Lawyer', 'Yumi’s Cells 2', 'There is No Goo Pil-Soo', 'Jejungwon', 'Scandal in Old Seoul', 'Father, I’ll Take Care of You', 'Eve', 'Insider', 'Jinxed at First']


In [4]:
# Get date elements from HTML
raw_dates = driver.find_elements(By.CLASS_NAME, "ptam-block-post-grid-date")

# Copy elements into list
dates = []
for date in raw_dates: dates.append(date.text)

print(dates)

['JUNE 24, 2022', 'JUNE 27, 2022', 'JUNE 27, 2022', 'JUNE 27, 2022', 'JUNE 25, 2022', 'JUNE 25, 2022', 'JUNE 25, 2022', 'JUNE 25, 2022', 'JUNE 25, 2022', 'JUNE 25, 2022', 'JUNE 24, 2022', 'JUNE 24, 2022', 'JUNE 24, 2022', 'JUNE 24, 2022', 'JUNE 23, 2022', 'JUNE 23, 2022', 'JUNE 23, 2022', 'JUNE 23, 2022', 'JUNE 22, 2022', 'JUNE 22, 2022']


In [5]:
# Get site url attributes from HTML
raw_site_urls = driver.find_elements(By.XPATH, "//div[@class='ptam-block-post-grid-image']/a")

# Copy elements into list
site_urls = []
for site in raw_site_urls: site_urls.append(site.get_attribute('href'))

print(site_urls)

['https://butchixanh.com/money-heist-korea-joint-economic-area/', 'https://butchixanh.com/cafe-minamdang/', 'https://butchixanh.com/item/', 'https://butchixanh.com/brilliant-legacy/', 'https://butchixanh.com/alchemy-of-souls/', 'https://butchixanh.com/the-witch-is-alive/', 'https://butchixanh.com/the-present-is-beautiful/', 'https://butchixanh.com/miracle-2/', 'https://butchixanh.com/ultimate-weapon-alice/', 'https://butchixanh.com/anna/', 'https://butchixanh.com/why-her/', 'https://butchixanh.com/doctor-lawyer/', 'https://butchixanh.com/yumis-cells-2/', 'https://butchixanh.com/there-is-no-goo-pil-soo/', 'https://butchixanh.com/jejungwon/', 'https://butchixanh.com/scandal-in-old-seoul/', 'https://butchixanh.com/father-ill-take-care-of-you/', 'https://butchixanh.com/eve/', 'https://butchixanh.com/insider/', 'https://butchixanh.com/jinxs-lover/']


3. Missing Value Check

In [6]:
# Check if data counts are same
print(len(movie_titles), len(dates), len(site_urls))

20 20 20


## Continue From Page 2 to Page 67

Repeatedly loop the following:
1. Relocate scraper
2. Append scraped data
3. Change URL (in this case is change page number at the end of URL)

In [8]:
while page < 67:
  
  # Flip Page
  page += 1
  new_url = 'https://butchixanh.com/page/' + str(page)
  driver.get(new_url)

  # Clear Medium-List
  raw_titles = []
  raw_dates = []
  raw_site_urls = []
  
  # Get Titles
  raw_titles = driver.find_elements(By.XPATH, "(//div[@class='ptam-block-post-grid-text']/h6[@class='ptam-block-post-grid-title']/a[@class='local-link'])")
  for title in raw_titles: movie_titles.append(title.text)

  # Get Dates
  raw_dates = driver.find_elements(By.CLASS_NAME, "ptam-block-post-grid-date")
  for date in raw_dates: dates.append(date.text)

  # Get Site URL
  raw_site_urls = driver.find_elements(By.XPATH, "//div[@class='ptam-block-post-grid-image']/a")
  for site in raw_site_urls: site_urls.append(site.get_attribute('href'))

print(len(movie_titles), len(dates), len(site_urls))

1339 1339 1339


It is great here that all `movie_titles`, `dates` and `site_urls` have complete amount of non-`Null` values, which is 1339 in this case. One down side of this scraping method is that if there are `Null` values, no value will be return (it will skip it completely) instead of `None`. Therefore, this will cause the whole list to be offset, which is not ideal when we want to convert it into tabular data.

## Convert Scraped Data

We will combine the lists created above using `zip()`, and then `list()` to convert it into a single 2D list. Then, we will create a `pandas` Data Frame to show us our tabular data.

In [9]:
# Combine lists
data = list(zip(movie_titles, dates, site_urls))

# Create Dataframe
df = pd.DataFrame(data, columns=['Movie Titles', 'Post Date', 'Site URL'])
print(df)

                                  Movie Titles        Post Date  \
0     Money Heist: Korea – Joint Economic Area    JUNE 24, 2022   
1                               Cafe Minamdang    JUNE 27, 2022   
2                                         Item    JUNE 27, 2022   
3                             Brilliant Legacy    JUNE 27, 2022   
4                             Alchemy of Souls    JUNE 25, 2022   
...                                        ...              ...   
1334                                   Watcher  AUGUST 26, 2019   
1335             Love Affairs in the Afternoon  AUGUST 25, 2019   
1336                        The King’s Letters  AUGUST 21, 2019   
1337              Designated Survivor: 60 Days  AUGUST 20, 2019   
1338                                  Parasite  AUGUST 20, 2019   

                                               Site URL  
0     https://butchixanh.com/money-heist-korea-joint...  
1                https://butchixanh.com/cafe-minamdang/  
2                    

Finally, we have the option to convert the Data Frame into a `CSV` file for analytical purposes.

In [12]:
df.to_csv('Korean_Drama.csv', index = False)