# 1. Data Collection

## 1.0. Import Necessary Libraries

In [1]:
import pandas as pd
import time, random, re
import multiprocessing
from multiprocessing import Pool
from selenium import webdriver
from selenium.common import exceptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait

This section imports the essential Python libraries and modules required for automating web interactions, data extraction, and processing. Each library and module serves a specific purpose:
- **Selenium**: (webdriver, Service, WebDriverWait, By, exceptions): A powerful tool for automating web browsers. It enables programmatically interacting with web elements for data scraping or testing purposes. Key functionalities include:
    - Automating web browser actions.
    - Locating and interacting with elements on a webpage.
    - Handling exceptions and timeouts during operations.
- **time**: Provides time-related functions to control the execution flow, such as adding delays between automated browser actions to mimic human behavior.
- **random**: Useful for generating random numbers, introducing variability in actions like wait times to avoid detection during web scraping.
- **re**: A module for working with regular expressions, enabling pattern matching for string processing tasks, such as extracting specific text from webpage content.
- **pandas (pd)**: A robust data manipulation library, used here to organize and store collected data in structured formats like DataFrames for easy analysis.
- **multiprocessing (Pool, multiprocessing)**: Supports parallel processing to speed up data collection by distributing tasks across multiple CPU cores, making scraping large datasets more efficient.

## 1.1. Extract TVShows Link

This function retrieves product page URLs from a list of web elements. It locates the `<a>` tags within the elements and extracts their `href` attributes.

- **Input:** List of product elements.
- **Output:** List of product URLs.

In [29]:
def get_links(products):
    url = []
    for p in products:
        url.append(p.find_element(By.TAG_NAME, value='a').get_attribute('href'))
    return url

## 1.2. Get TVShows Information

This function extracts detailed information about a TV show or movie from a given URL using Selenium for web scraping.
- **Input:** A URL pointing to the show's details page.
- **Output:** A list containing the following details:

| **Field**         | **Description**                                                                                     |
|--------------------|-----------------------------------------------------------------------------------------------------|
| **Title**          | The name of the show.                                                                              |
| **Years**          | The years the show was active or released.                                                         |
| **Certification**  | The age rating of the show (e.g., PG, R).                                                          |
| **Runtime**        | Duration of the show or episodes.                                                                  |
| **Rating**         | Audience or critic rating (e.g., IMDb score).                                                      |
| **Votes**          | Number of votes contributing to the rating.                                                        |
| **Emmys**          | Number of Primetime Emmy awards won.                                                               |
| **Creators**       | List of creators associated with the show.                                                         |
| **Actors**         | List of main cast members.                                                                         |
| **Genres**         | Categories the show falls into, such as drama, comedy, etc.                                        |
| **Origins**        | Countries of origin for the show.                                                                  |
| **Languages**      | Languages in which the show is available.                                                          |
| **Productions**    | Production companies involved in creating the show.                                                |
| **URL**            | The original URL used to extract the details (for reference).                                      |

**Key Features:**
- Utilizes Selenium for DOM interaction and dynamic content loading.
- Implements error handling for missing elements using `try-except` blocks.
- Extracts multiple fields from different parts of the page:
  - Awards, genres, details (e.g., origin, language, production companies).
- Incorporates scrolling and waiting for dynamic elements to load fully.

**Usage:**
This function is suitable for gathering comprehensive metadata about shows for building datasets or conducting detailed analysis.

In [2]:
def get_shows_info(url):
    CHROMEDRIVER_PATH = '/usr/local/bin/chromedriver'
    service = Service(executable_path=CHROMEDRIVER_PATH)
    driver = webdriver.Chrome(service=service)
    driver.get(url)

    title = WebDriverWait(driver, 2).until(lambda x: x.find_element(By.CLASS_NAME, 'hero__primary-text')).get_attribute('innerHTML')
    try:
        years = driver.find_element(By.XPATH, '//*[@id="__next"]/main/div/section[1]/section/div[3]/section/section/div[2]/div[1]/ul/li[2]/a').text
    except exceptions.NoSuchElementException:
        years = pd.NA
    try:
        certification = driver.find_element(By.XPATH, '//*[@id="__next"]/main/div/section[1]/section/div[3]/section/section/div[2]/div[1]/ul/li[3]/a').text
    except exceptions.NoSuchElementException:
        certification = pd.NA
    try:
        runtime = driver.find_element(By.XPATH, '//*[@id="__next"]/main/div/section[1]/section/div[3]/section/section/div[2]/div[1]/ul/li[4]').text   
    except exceptions.NoSuchElementException:
        runtime = pd.NA
    try:
        rating = driver.find_element(By.XPATH, '//span[@class="sc-d541859f-1 imUuxf"]').get_attribute('innerHTML')
    except exceptions.NoSuchElementException:
        rating = pd.NA
    try:
        votes = driver.find_element(By.XPATH, '//div[@class="sc-d541859f-3 dwhNqC"]').get_attribute('innerHTML')   
    except exceptions.NoSuchElementException:
        votes = pd.NA
    emmys = 0
    try:
        awards = driver.find_element(By.XPATH, '//li[@data-testid="award_information"]//a').get_attribute('innerHTML')
        awards = re.search(r'Won (\d{1,2}) Primetime Emmy', awards)
        emmys = int(awards.group(1)) if awards else 0
    except exceptions.NoSuchElementException:
        pass

    creators = []
    actors = []
    try:
        title_cast = driver.find_element(By.XPATH, '//section[@data-testid="title-cast"]')  
        driver.execute_script("arguments[0].scrollIntoView();", title_cast)
        time.sleep(1)
        try:
            creator = title_cast.find_element(By.XPATH, './/ul[contains(@class, "ipc-metadata-list")]//ul')
            for c in creator.find_elements(By.TAG_NAME, 'a'):
                creators.append(c.get_attribute('innerHTML'))
        except exceptions.NoSuchElementException:
            pass
        for c in title_cast.find_elements(By.XPATH, './/div[@data-testid="title-cast-item"]'):
            try:
                actors.append(c.find_element(By.XPATH, './/a[@data-testid="title-cast-item__actor"]').get_attribute('innerHTML'))
            except exceptions.NoSuchElementException:
                pass
    except exceptions.NoSuchElementException:
        pass

    creators = ', '.join(creators) if creators else pd.NA
    actors = ', '.join(actors) if actors else pd.NA
    for _ in range(5):  
        try:
            storyline = WebDriverWait(driver, 2).until(lambda x: x.find_element(By.XPATH, '//li[@data-testid="storyline-genres"]'))
            driver.execute_script("arguments[0].scrollIntoView();", storyline)
            time.sleep(1)
            break
        except exceptions.TimeoutException:
            driver.execute_script("window.scrollBy(0, 1000);")
            time.sleep(1)

    time.sleep(1)
    genres = []
    for e in storyline.find_elements(By.TAG_NAME, 'a'):
        try:
            genres.append(e.get_attribute('innerHTML'))
        except exceptions.StaleElementReferenceException:
            print(title)
    genres = ', '.join(genres) if genres else pd.NA
    for _ in range(5):    
        try: 
            details = WebDriverWait(driver, 2).until(lambda x: x.find_element(By.XPATH, '//section[@data-testid="Details"]'))
            driver.execute_script("arguments[0].scrollIntoView();", details)
            time.sleep(1)  
            break
        except exceptions.TimeoutException:
            driver.execute_script("window.scrollBy(0, 1000);") 
            time.sleep(1)
    
    origins = []
    try:
        og = WebDriverWait(details, 2).until(lambda x: x.find_elements(By.XPATH, './/li[@data-testid="title-details-origin"]//ul//a'))
        for o in og:
            origins.append(o.get_attribute('innerHTML'))
    except exceptions.TimeoutException:
        pass
    origins = ', '.join(origins) if origins else pd.NA

    languages = []
    try:
        lang = WebDriverWait(details, 2).until(lambda x: x.find_elements(By.XPATH, './/li[@data-testid="title-details-languages"]//ul//a'))
        for l in lang:
            languages.append(l.get_attribute('innerHTML'))
    except exceptions.TimeoutException:
        pass
    languages = ', '.join(languages) if languages else pd.NA

    productions = []
    try:
        prods = WebDriverWait(details, 2).until(lambda x: x.find_elements(By.XPATH, './/li[@data-testid="title-details-companies"]//ul//a'))
        for p in prods:
            productions.append(p.get_attribute('innerHTML'))
    except exceptions.TimeoutException:
        pass
    productions = ', '.join(productions) if productions else pd.NA

    driver.close()
    return [title, years, certification, runtime, rating, votes, emmys, creators, actors, genres, origins, languages, productions, url]

## 1.3. Code Explanation and Workflow

This script automates the process of scraping links to TV series and miniseries from IMDb's "Top Rated" section. 

**Key Components:**
| **Component**             | **Description**                                                                                      |
|----------------------------|------------------------------------------------------------------------------------------------------|
| **CHROMEDRIVER_PATH**      | Specifies the path to the ChromeDriver executable needed for Selenium to control the Chrome browser. |
| **Base URL**               | IMDb URL configured to display TV series and miniseries sorted by user rating.                      |
| **Driver Setup**           | Initializes the Selenium WebDriver with the specified ChromeDriver service.                         |
| **Navigating the Base URL**| Accesses the IMDb base URL using the WebDriver.                                                     |
| **Handling "Load More" Button** | A loop that clicks the "Load More" button to reveal additional results. Handles exceptions for visibility and element presence. |
| **Extracting Product Links** | Uses the `get_links()` function to extract and store all product URLs from the page.               |
| **Saving Links to File**   | Writes the scraped URLs into a text file named `links.txt`.                                         |
| **Driver Closure**         | Closes the browser session once the scraping process is complete.                                   |

**Output:**
- A text file named `links.txt` containing all the scraped URLs.

**Code Behavior:**
- **Dynamic Loading**: Ensures all items are loaded before scraping.
- **Resilience**: Handles exceptions related to element visibility and interaction.
- **Efficiency**: Captures all relevant URLs in a structured format.

In [None]:
CHROMEDRIVER_PATH = '/usr/local/bin/chromedriver'
base = 'https://www.imdb.com/search/title/?title_type=tv_series,tv_miniseries&num_votes=25000,&sort=user_rating,desc'
service = Service(executable_path=CHROMEDRIVER_PATH)
driver = webdriver.Chrome(service=service)
driver.get(base)

while True:
    try:
        button = driver.find_element(By.XPATH, '//button[@class="ipc-btn ipc-btn--single-padding ipc-btn--center-align-content ipc-btn--default-height ipc-btn--core-base ipc-btn--theme-base ipc-btn--button-radius ipc-btn--on-accent2 ipc-text-button ipc-see-more__button"]')
        button.click()
    except exceptions.ElementClickInterceptedException:
        driver.execute_script("arguments[0].scrollIntoView();", button)
    except exceptions.NoSuchElementException:
        break

    time.sleep(1.5)
products = driver.find_elements(By.XPATH, value='//li[@class="ipc-metadata-list-summary-item"]')
links = get_links(products)
print(len(links))

with open('links.txt', 'w') as f:
    f.write("\n".join(links))
driver.close()

## 1.4. Adding Multiprocessing and Saving Data

After scraping the links, we want to use multiprocessing to parallelize the task of gathering detailed information about each TV show, and then store the results in a CSV file.

**Key Components:**
| **Component**               | **Description**                                                                                     |
|-----------------------------|-----------------------------------------------------------------------------------------------------|
| **Multiprocessing**          | Creates a pool of processes to gather detailed information for each link in the `links` list.       |
| **get_shows_info**           | A function that fetches detailed information for each TV show from the scraped links.               |
| **DataFrame**                | Uses `pandas` to store the detailed information into a DataFrame.                                    |
| **CSV File Output**          | Saves the DataFrame to a CSV file named `tvshows_raw.csv`.                                       |

**Workflow Explanation:**
1. **Setting Up CPU Cores**:
   - The number of CPU cores to be used for multiprocessing is determined. Here, the script uses half of the available CPU cores.
2. **Creating a Pool of Processes**:
   - A `multiprocessing.Pool` is used to create a pool of processes, each of which calls the `get_shows_info()` function to fetch detailed information for each link in the `links` list.
3. **Gathering Detailed Information**:
   - Each process will fetch detailed information about a TV show, including attributes like title, year of release, certification, runtime, rating, number of votes, and more.
4. **Storing Results in a DataFrame**:
   - After gathering the information, the results are stored in a `pandas` DataFrame.
5. **Saving Results to CSV**:
   - The DataFrame is saved to a CSV file named `tvshows_raw.csv`.

In [None]:
if __name__ == '__main__':
    coresNr = multiprocessing.cpu_count() // 2
    with Pool(coresNr) as p:
        results = p.map(get_shows_info, links)
        tvshows = [result for result in results]
    df = pd.DataFrame(tvshows, columns=['Title', 'Years', 'Certification', 'Runtime', 'Rating', 'Number of Votes', 'Emmys', 'Creators', 'Actors', 'Genres', 'Countries of origins', 'Languages', 'Production companies', 'Link'])

In [27]:
df.to_csv('tvshows_raw.csv', index=False, header=True)