<a href="https://colab.research.google.com/github/GoAshim/WebScraping/blob/main/Web_Scraping_2_Top_1000_Movies_from_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrap Info of Top 1000 All Time Hit Movies from IMDB
In this web scraping exercise we are going to scrap the 1000 all time hit movies from IMDB site (link [here](https://www.imdb.com/list/ls006266261/)).


## Summary
It's no surprise to say that IMDB is the number one source when it comes to movies. They have compiled an awesome list of 1000 greatest movies of all time and has provided that on the above link. We are going collect following data points for each movies and load that on multiple dataframes for further analysis.


*   Name of the movie to data frame 1
*   Year of release to data frame 1
*   Runtime to data frame 1
*   Rating to data frame 1
*   Gross earning to data frame 1
*   Directors and Stars to data frame 2

Also we see that the URL above only lists only 100 of the top 1000 movies, hence we will programatically continue navigate to the next pages to collect the information of all 1000 movies.






### Step 1 - Import required libraries

In [1]:
import requests # To pull data from webpage
from bs4 import BeautifulSoup # To parse data pulled from the webpage
import pandas as pd # To view, modify and store data parsed from the webpage 
import re # Regular expression library, will use to match string


### Step 2 - Extract the content of the webpage

In [2]:
url = "https://www.imdb.com/list/ls006266261/"

# Using requests.get to fetch the source content of the page
page_data = requests.get(url).text

# Uning BeautifulSoup to parse the content with the lxml parser
soup = BeautifulSoup(page_data, "lxml")

### Step 3 - Extract and print the names of the movies on the page

In [3]:
# This is a manual step where I inspected the source code of the page on my Web brouser and then identified that there are list of div tags with
# class="lister-item mode-detail" so we will use that to extract the content of of the div tags
movies = soup.find_all('div', {"class" : "lister-item mode-detail"})

# Then let's extract the body of the table
n = 0
for movie in movies:
  movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()
  n += 1
  print(movie_title)

  if n == 10:
    break


The Godfather
Goodfellas
Pulp Fiction
The Usual Suspects
Apocalypse Now
Trainspotting
Fight Club
Schindler's List
Boogie Nights
Reservoir Dogs


### Step 4 - Extract and print all information we need to extract for the first movie

In [4]:
for movie in movies:
  movie_rank = movie.find('span', {'class' : "lister-item-index unbold text-primary"}).get_text()[:-1]
  movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()
  movie_year = movie.find('span', {'class' : "lister-item-year text-muted unbold"}).get_text()[1:-1]
  movie_runtime = movie.find('span', {'class' : "runtime"}).get_text()
  movie_rating = movie.find('span', {'class' : "ipl-rating-star__rating"}).get_text()

  # To find the gross earning of the movie, we need to first firn the tag having content as 'Gross:' and then go to the next sibling
  movie_earning = movie.find('span', text="Gross:").find_next_sibling().get_text()

  # Extract directors and stars name based on text 'Director' and navigate to the surrounding tags to get the info
  movie_directors = []
  movie_stars = []
  star = 0
  cast_tag = movie.find(text=re.compile("Director")).parent # This extracts the tag which had 'Director' anywhere in the content

  for tag in cast_tag.find_all(['a','span']):
    if tag.name == 'span':
      star = 1
      continue
      
    if star == 0:
      movie_directors.append(tag.get_text())
    else:
      movie_stars.append(tag.get_text())

  print(movie_rank)
  print(movie_title)
  print(movie_year)
  print(movie_runtime)
  print(movie_earning)
  print(movie_directors)
  print(movie_stars)
  
  break

1
The Godfather
1972
175 min
$134.97M
['Francis Ford Coppola']
['Marlon Brando', 'Al Pacino', 'James Caan', 'Diane Keaton']


### Step 5 - Loop through the page to extract all required info of all the movies in the page and store that into appropriate dataframes

In [6]:
# First let's create two dataframes
df_movies = pd.DataFrame(columns=['rank','title','year_Released','runtime','earning'])
df_casts = pd.DataFrame(columns=['rank','role','name'])

for movie in movies:
  movie_rank = movie.find('span', {'class' : "lister-item-index unbold text-primary"}).get_text()[:-1]
  movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()
  movie_year = movie.find('span', {'class' : "lister-item-year text-muted unbold"}).get_text()[1:-1]
  movie_runtime = movie.find('span', {'class' : "runtime"}).get_text()
  movie_rating = movie.find('span', {'class' : "ipl-rating-star__rating"}).get_text()

  # To find the gross earning of the movie, we need to first firn the tag having content as 'Gross:' and then go to the next sibling
  movie_earning = movie.find('span', text="Gross:").find_next_sibling().get_text()

  # Insert the extracted info into the df_movies dataframe
  df_movies.loc[len(df_movies.index)] = [movie_rank, movie_title, movie_year, movie_runtime, movie_rating]


  # Extract directors and stars name based on text 'Director' and navigate to the surrounding tags to get the info
  n = 0
  cast_tag = movie.find(text=re.compile("Director")).parent # This extracts the tag which had 'Director' anywhere in the content

  for tag in cast_tag.find_all(['a','span']):
    if tag.name == 'span':
      n = 1
      continue
      
    if n == 0:
      director = tag.get_text()
      df_casts.loc[len(df_casts.index)] = [movie_rank, 'Director', director]
    else:
      star = tag.get_text()
      df_casts.loc[len(df_casts.index)] = [movie_rank, 'Star', star]

df_movies.head(10)
df_casts.head(20)

Unnamed: 0,rank,role,name
0,1,Director,Francis Ford Coppola
1,1,Star,Marlon Brando
2,1,Star,Al Pacino
3,1,Star,James Caan
4,1,Star,Diane Keaton
5,2,Director,Martin Scorsese
6,2,Star,Robert De Niro
7,2,Star,Ray Liotta
8,2,Star,Joe Pesci
9,2,Star,Lorraine Bracco


### Step 6 - Loop through all the pages by finding the link of the next page till the end.

In [13]:
# Define a function to capture the link of the next page 
def getNextPage(url):
  page_data = requests.get(url).text
  soup = BeautifulSoup(page_data, "lxml")

  # Extract the URL of the next page
  next_link_exists = soup.find('div', {"class" : "list-pagination"}).find('a', {"class" : "flat-button lister-page-next next-page"})

  if next_link_exists:
    next_link_url = soup.find('div', {"class" : "list-pagination"}).find('a', {"class" : "flat-button lister-page-next next-page"})['href']
    next_link_url = "https://www.imdb.com" + next_link_url
  else:
    next_link_url = "End"

  return next_link_url
# End of function

# The URL for the starting page 
url = "https://www.imdb.com/list/ls006266261/"

while True:
  next_link = getNextPage(url)

  if next_link == "End":
    break
  else:
    print(next_link)
    url = next_link

https://www.imdb.com/list/ls006266261/?page=2
https://www.imdb.com/list/ls006266261/?page=3
https://www.imdb.com/list/ls006266261/?page=4
https://www.imdb.com/list/ls006266261/?page=5
https://www.imdb.com/list/ls006266261/?page=6
https://www.imdb.com/list/ls006266261/?page=7
https://www.imdb.com/list/ls006266261/?page=8
https://www.imdb.com/list/ls006266261/?page=9
https://www.imdb.com/list/ls006266261/?page=10


### Step 7 - Putting all the codes together, which will traverse through each page to collect relevant info of movies and populate that into the data frame. The code will continue from one page to the next till it reaches to the last page. 

In [18]:
# Define a function to scrap movie data from a page and then return the link of the next page 
def getMoviesFromPageThenGoToNextPage(url):
  page_data = requests.get(url).text
  soup = BeautifulSoup(page_data, "lxml")

  # Code from previous steps to scrap movie data from a page and populate that into 2 data frames
  df_movies = pd.DataFrame(columns=['rank','title','year_Released','runtime','earning'])
  df_casts = pd.DataFrame(columns=['rank','role','name'])

  movies = soup.find_all('div', {"class" : "lister-item mode-detail"})

  for movie in movies:
    movie_rank = movie.find('span', {'class' : "lister-item-index unbold text-primary"}).get_text()[:-1]
    movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()

    # Print each movie title as the code is scraping through the page to help debug where the code is breaking due to missing value
    print(movie_title)
    
    movie_year = movie.find('span', {'class' : "lister-item-year text-muted unbold"}).get_text()[1:-1]
    movie_runtime = movie.find('span', {'class' : "runtime"}).get_text()
    movie_rating = movie.find('span', {'class' : "ipl-rating-star__rating"}).get_text()

    # To find the gross earning of the movie, we need to first firn the tag having content as 'Gross:' and then go to the next sibling
    if movie.find('span', text="Gross:"):
      movie_earning = movie.find('span', text="Gross:").find_next_sibling().get_text()

    # Insert the extracted info into the df_movies dataframe
    df_movies.loc[len(df_movies.index)] = [movie_rank, movie_title, movie_year, movie_runtime, movie_rating]

    # Extract directors and stars name based on text 'Director' and navigate to the surrounding tags to get the info
    n = 0
    
    if movie.find(text=re.compile("Director")):
      # When both director and stars information is available

      cast_tag = movie.find(text=re.compile("Director")).parent # This extracts the tag which had 'Director' anywhere in the content

      for tag in cast_tag.find_all(['a','span']):
        if tag.name == 'span':
          n = 1
          continue
          
        # Insert the name of the directors and stars into the df_casts dataframe
        if n == 0:
          director = tag.get_text()
          df_casts.loc[len(df_casts.index)] = [movie_rank, 'Director', director]
        else:
          star = tag.get_text()
          df_casts.loc[len(df_casts.index)] = [movie_rank, 'Star', star]

    else:
      # When only the stars information is available
      cast_tag = movie.find(text=re.compile("Stars:")).parent # This extracts the tag which had 'Stars:' anywhere in the content

      for tag in cast_tag.find_all(['a']):
        star = tag.get_text()
        df_casts.loc[len(df_casts.index)] = [movie_rank, 'Star', star]

  # End of populating data frame from movies in the current page 


  # Extract the URL of the next page
  next_link_exists = soup.find('div', {"class" : "list-pagination"}).find('a', {"class" : "flat-button lister-page-next next-page"})

  if next_link_exists:
    next_link_url = soup.find('div', {"class" : "list-pagination"}).find('a', {"class" : "flat-button lister-page-next next-page"})['href']
    next_link_url = "https://www.imdb.com" + next_link_url
  else:
    next_link_url = "End"

  return next_link_url
# End of the function



# The URL for the starting page 
url = "https://www.imdb.com/list/ls006266261/"
page_number = 1

while True:
  # Like the movie title above, here we print the page number during debug to find out in which page the code is breaking
  print("Page Number: "+str(page_number))
  
  next_link = getMoviesFromPageThenGoToNextPage(url)

  if next_link == "End":
    break
  else:
    page_number += 1
    url = next_link

Page Number: 1
The Godfather
Goodfellas
Pulp Fiction
The Usual Suspects
Apocalypse Now
Trainspotting
Fight Club
Schindler's List
Boogie Nights
Reservoir Dogs
The Shawshank Redemption
Jaws
Taxi Driver
L.A. Confidential
Back to the Future
The Godfather Part II
Fargo
The Dark Knight
The Matrix
Magnolia
Scarface
The Royal Tenenbaums
Donnie Darko
Platoon
Heat
American Beauty
The Big Lebowski
Raging Bull
A Prophet
The Departed
There's Something About Mary
Léon: The Professional
City of God
Once Upon a Time in America
Lock, Stock and Two Smoking Barrels
The Truman Show
Toy Story
Alien
Rushmore
Se7en
Good Will Hunting
Aliens
The Exorcist
One Flew Over the Cuckoo's Nest
Chinatown
Casino
True Romance
Gladiator
Indiana Jones and the Raiders of the Lost Ark
Miller's Crossing
Being John Malkovich
Die Hard
Casablanca
Kick-Ass
The Prestige
Inglourious Basterds
Evil Dead II
Jurassic Park
Batman Begins
The Green Mile
The French Connection
Unforgiven
Drive
The Shining
The Evil Dead
Forrest Gump
The Thin

### Step 8 - Tune the code in step 7 to accomodate any missing data in page 

In [23]:
# Define a function to scrap movie data from a page and then return the link of the next page 
def getMoviesFromPageThenGoToNextPage(url):
  page_data = requests.get(url).text
  soup = BeautifulSoup(page_data, "lxml")
  movies = soup.find_all('div', {"class" : "lister-item mode-detail"})

  for movie in movies:
    movie_rank = movie.find('span', {'class' : "lister-item-index unbold text-primary"}).get_text()[:-1]
    movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()
    movie_year = movie.find('span', {'class' : "lister-item-year text-muted unbold"}).get_text()[1:-1]
    movie_runtime = movie.find('span', {'class' : "runtime"}).get_text()
    movie_rating = movie.find('span', {'class' : "ipl-rating-star__rating"}).get_text()

    # To find the gross earning of the movie, we need to first firn the tag having content as 'Gross:' and then go to the next sibling
    if movie.find('span', text="Gross:"):
      movie_earning = movie.find('span', text="Gross:").find_next_sibling().get_text()

    # Insert the extracted info into the df_movies dataframe
    df_movies.loc[len(df_movies.index)] = [movie_rank, movie_title, movie_year, movie_runtime, movie_rating]

    # Extract directors and stars name based on text 'Director' and navigate to the surrounding tags to get the info
    n = 0
    
    if movie.find(text=re.compile("Director")):
      # When both director and stars information is available

      cast_tag = movie.find(text=re.compile("Director")).parent # This extracts the tag which had 'Director' anywhere in the content

      for tag in cast_tag.find_all(['a','span']):
        if tag.name == 'span':
          n = 1
          continue
          
        # Insert the name of the directors and stars into the df_casts dataframe
        if n == 0:
          director = tag.get_text()
          df_casts.loc[len(df_casts.index)] = [movie_rank, 'Director', director]
        else:
          star = tag.get_text()
          df_casts.loc[len(df_casts.index)] = [movie_rank, 'Star', star]

    else:
      # When only the stars information is available
      cast_tag = movie.find(text=re.compile("Stars:")).parent # This extracts the tag which had 'Stars:' anywhere in the content

      for tag in cast_tag.find_all(['a']):
        star = tag.get_text()
        df_casts.loc[len(df_casts.index)] = [movie_rank, 'Star', star]

  # End of populating data frame from movies in the current page 


  # Extract the URL of the next page
  next_link_exists = soup.find('div', {"class" : "list-pagination"}).find('a', {"class" : "flat-button lister-page-next next-page"})

  if next_link_exists:
    next_link_url = soup.find('div', {"class" : "list-pagination"}).find('a', {"class" : "flat-button lister-page-next next-page"})['href']
    next_link_url = "https://www.imdb.com" + next_link_url
  else:
    next_link_url = "End"

  return next_link_url
# End of the function


# The URL for the starting page 
url = "https://www.imdb.com/list/ls006266261/"
page_number = 1
df_movies = pd.DataFrame(columns=['rank','title','year_Released','runtime','earning'])
df_casts = pd.DataFrame(columns=['rank','role','name'])

while True:
  next_link = getMoviesFromPageThenGoToNextPage(url)

  if next_link == "End":
    break
  else:
    page_number += 1
    url = next_link