# Web scraping with Scrapy

---
format:
  html:
      code-tools: true
      page-layout: full
      
author: "Haoran Jia"
date: "2023-10-25"
---

Scraping has become an essential skill for data enthusiasts, researchers, and developers. Today, we'll explore how to extract movie and actor data from The Movie Database (TMDb) using `Scrapy`, a popular Python web scraping framework. Our goal would be to scrape data about movies that share actors with a certain movie of our choice. Specifically, we'll start with the page of a particular movie, navigate to its cast page, and then visit each actor's page to get the list of movies they starred.

![](https://i.etsystatic.com/12154873/r/il/a4cbbc/2493823260/il_570xN.2493823260_pioo.jpg){fig-align="center"}


## Part I: Let's start scraping!

**A note before start:**

- For now, add `CLOSESPIDER_PAGECOUNT = 20` to the file `settings.py`. This line just prevents your scraper from downloading too much data while you’re still testing things out. You’ll remove this line later.
- You may run into `403 Forbidden` errors once the website detects that you’re a bot. One of the easiest solution is to add the following line to the same file:  
`USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'`

### Step0 : Start a Scrapy project

After `pip install` the package, start a new project with the following command:  
>`conda activate PIC16B-2`  
>`scrapy startproject TMDB_scraper`  
>`cd TMDB_scraper`  
This will create a folder with necessary files.

### Step1: Define a spider class

Pick your favorite movie or TV show, and locate its TMDB page by searching on https://www.themoviedb.org/. Here I chose *Léon: The Professional* as an example. The we will start by creating a spider class which defines how to follow links and extract data. In the code, the `name` attribute is a unique identifier for the spider, and `start_urls` contains the initial URL(s) the spider will begin scraping from. 

In [None]:
class TmdbSpider(scrapy.Spider):
    """
    A Scrapy Spider class to scrape movie and actor data from The Movie Database (TMDb).

    Attributes:
        name (str): The name of the spider.
        start_urls (list): A list containing the initial URL to start scraping.
    """

    # Create a spider class, and assign a name to the spider
    name = 'tmdb_spider'
    start_urls = ['https://www.themoviedb.org/movie/101-leon-the-professional']

### Step2: Locate the Starting TMDB Page

We will first define a `parse(self, response)` method to start on the movie page, and then navigate to the Full Cast & Crew page, which has url of the form `<movie_url>cast`.

In [None]:
def parse(self, response):
    """
    Parses the main movie page and navigates to the cast page.

    Args:
        response (obj): The response object containing the scraped data.

    Yields:
        scrapy.Request: A new request to the cast URL.
    """
    full_credits_url = f'{response.url}/cast' # hardcode the url format
    yield scrapy.Request(full_credits_url, callback=self.parse_full_credits) # Yield a request to the cast URL


### Step3: Locate the actor's page

Here we will define a method called `parse_full_credits(self, response)` to extract actor links from the cast page and navigate to each actor's page when we are at the cast page after step2.

In [None]:
def parse_full_credits(self, response):
    """
    Parses the cast page, extracting actor links and navigating to each actor's page.

    Args:
        response (obj): The response object containing the scraped data.

    Yields:
        scrapy.Request: A new request to each actor's URL.
    """
    # Extract all links to actor's page
    actor_links = response.css('ol.people.credits:not(.crew) div a::attr(href)').extract()
    # Iterate through the links and yield request to each
    pref = "https://www.themoviedb.org"
    for actor_link in actor_links:
        yield scrapy.Request(pref+actor_link, callback=self.parse_actor_page)

### Step4: Scrape actor's name and a list of movies

Finally we will define a `parse_actor_page(self, response)` method to interact with the page of an actor. It will yield a dictionary with two key-value pairs, of the form `{"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}`. The method will yield one such dictionary for each of the movies or TV shows on which that actor has worked. 

In [None]:
def parse_actor_page(self, response):
    """
    Parses an actor's page, extracting the actor's name and the movies or TV shows they have participated in.

    Args:
        response (obj): The response object containing the scraped data.

    Yields:
        dict: A dictionary containing the actor's name and the name of a movie or TV show they have participated in.
    """
    # Extract actor's name
    actor_name = response.css('head title::text').extract_first().split(" —")[0]
    # Extract a list of movies starred by this actor, and yield the dictionary
    movies = response.css('td.role.true.account_adult_false.item_adult_false bdi::text').extract()
    for movie in movies:
        yield {"actor": actor_name, "movie_or_TV_name": movie}


### Step5: Run the spider

A csv file will be created to store scraped data using command `scrapy crawl tmdb_spider -o <csv_name>.csv`

## Part II: Visualize the reult

In [1]:
# For successful rendering of the polt
import plotly.io as pio
pio.renderers.default="iframe"

In [3]:
import pandas as pd
import plotly.express as px

# Read in scraped data
df = pd.read_csv('movies.csv')

In [4]:
# Take a look at the data 
df.head()

Unnamed: 0,actor,movie_or_TV_name
0,Peter Linari,Hustle
1,Peter Linari,Good Time
2,Peter Linari,Straighten Up and Fly Right
3,Peter Linari,Season of the Hunted
4,Peter Linari,The Curse of the Jade Scorpion


In [5]:
# Get the top10 movies starred by same actors in our favorite movie 
top10 = df.groupby('movie_or_TV_name')['actor'].count().sort_values(ascending = False)[1:11]
top10 = pd.DataFrame(top10).reset_index() # Convert to dataframe

In [6]:
# Plot a bar chart, and add title
fig = px.bar(top10, x='movie_or_TV_name', y='actor')
fig.update_layout(title_text="Top 10 movies or TV shows with most shared actors")

fig.show()

From this bar plot we can see *Law & Order* has the most shared actors with *Léon: The Professional*.