<a href="https://colab.research.google.com/github/GoAshim/WebScraping/blob/main/Web_Scraping_2_Top_1000_Movies_from_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrap Info of Top 1000 All Time Hit Movies from IMDB
In this web scraping exercise we are going to scrap the 1000 all time hit movies from IMDB site (link [here](https://www.imdb.com/list/ls006266261/)).


## Summary
It's no surprise to say that IMDB is the number one source when it comes to movies. They have compiled an awesome list of 1000 greatest movies of all time and has provided that on the above link. We are going collect following data points for each movies and load that on multiple dataframes for further analysis.


*   Name of the movie to data frame 1
*   Year of release to data frame 1
*   Runtime to data frame 1
*   Rating to data frame 1
*   Gross earning to data frame 1
*   Directors and Stars to data frame 2

Also we see that the URL above only lists only 100 of the top 1000 movies, hence we will programatically continue navigate to the next pages to collect the information of all 1000 movies.






### Step 1 - Import required libraries

In [None]:
import requests # To pull data from webpage
from bs4 import BeautifulSoup # To parse data pulled from the webpage
import pandas as pd # To view, modify and store data parsed from the webpage 
import re # Regular expression library, will use to match string


### Step 2 - Extract the content of the webpage

In [None]:
url = "https://www.imdb.com/list/ls006266261/"

# Using requests.get to fetch the source content of the page
page_data = requests.get(url).text

# Uning BeautifulSoup to parse the content with the lxml parser
soup = BeautifulSoup(page_data, "lxml")

### Step 3 - Extract and print the names of the movies on the page

In [None]:
# This is a manual step where I inspected the source code of the page on my Web brouser and then identified that there are list of div tags with
# class="lister-item mode-detail" so we will use that to extract the content of of the div tags
movies = soup.find_all('div', {"class" : "lister-item mode-detail"})

# Then let's extract the body of the table
n = 0
for movie in movies:
  movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()
  n += 1
  print(movie_title)

  if n == 10:
    break


The Godfather
Goodfellas
Pulp Fiction
The Usual Suspects
Apocalypse Now
Trainspotting
Fight Club
Schindler's List
Boogie Nights
Reservoir Dogs


### Step 4 - Extract and print all information we need to extract for the first movie

In [58]:
for movie in movies:
  movie_rank = movie.find('span', {'class' : "lister-item-index unbold text-primary"}).get_text()[:-1]
  movie_title = movie.find('h3', {"class" : "lister-item-header"}).a.get_text()
  movie_year = movie.find('span', {'class' : "lister-item-year text-muted unbold"}).get_text()[1:-1]
  movie_runtime = movie.find('span', {'class' : "runtime"}).get_text()
  movie_rating = movie.find('span', {'class' : "ipl-rating-star__rating"}).get_text()

  # To find the gross earning of the movie, we need to first firn the tag having content as 'Gross:' and then go to the next sibling
  movie_earning = movie.find('span', text="Gross:").find_next_sibling().get_text()

  # 
  movie_directors = []
  movie_stars = []
  star = 0
  cast_tag = movie.find(text=re.compile("Director")).parent # This extracts the tag which had 'Director' anywhere in the content

  for tag in cast_tag.find_all(['a','span']):
    if tag.name == 'span':
      star = 1
      continue
      
    if star == 0:
      movie_directors.append(tag.get_text())
    else:
      movie_stars.append(tag.get_text())

  print(movie_rank)
  print(movie_title)
  print(movie_year)
  print(movie_runtime)
  print(movie_earning)
  print(movie_directors)
  print(movie_stars)
  
  break

1
The Godfather
1972
175 min
$134.97M
['Francis Ford Coppola']
['Marlon Brando', 'Al Pacino', 'James Caan', 'Diane Keaton']


### Step 5 - Loop through the page to extract all required info of all the movies in the page and store that into appropriate dataframes