# Scraping Top Rated Action Movies on IMDb using Python

![](https://i.imgur.com/xeA9PoN.png)

IMDb (an abbreviation of Internet Movie Database) is an online database of information related to films, television series, home videos, video games, and streaming content online.for example [The Dark knight](https://www.imdb.com/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=Y7G20595TZFHGJGMZZ06&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3) gives all the details about the movie including reviews from the audience.

The page [Top Action Movies](https://www.imdb.com/search/title/?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=T3XD6BSDFKW8NJ3SCYSQ&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1) provides list of top rated action movies around the world.So,that audience watch quality movies.In this project, we'll retrive information from this page using _web scraping_ : the process of extracting information from a website in an automated fashion using code. We'll use the python libraries [requests](https://requests.readthedocs.io/en/latest/) and [BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape data from the page.


Here's an outline of the steps we will follow:

1. Download the page using `requests` library.
2. Parse HTML code using `BeautifulSoup4` library.
3. Extract Rank, Title , Release year and Rating of movies from page.
4. Compile extracted information into Python lists and dictionaries.
5. Extract data from multiple pages.
6. Save the extracted information to CSV file.


By the end of the project, we'll create a CSV file in the following format:
```
Movie_name,Release_Year,Rating
The Dark knight, (2008), 9.0

```

## How to Run the Code

In order to execute the code, please use the "Run" button at the top of this page and select "Run on Binder". You can edit the notebook and save a personal version to Jovian by executing the cells below:

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="imbd-top-action-movies")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "girikunal57/imbd-top-action-movies" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/girikunal57/imbd-top-action-movies[0m


'https://jovian.ai/girikunal57/imbd-top-action-movies'

### Identify the webpages
Here we have all the information that is needed for scraping data related to movies in [imdb](https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt%27). Here we have a total of 50 pages from which we will be scraping around 2 pages of info.
![](https://i.imgur.com/Xb7LXlB.png)

Let's create a function to get the list 10 other pages

# Download the Webpage using requests

We can the use the ***requests*** library to download the web page

In [4]:
!pip install requests --upgrade --quiet

In [5]:
import requests

The library is now installed and imported.
To download the Web page ,we can use **get** function from the requests, which returns a response object.

In [6]:
topic_url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt%27'

In [7]:
response = requests.get(topic_url)

In [8]:
type(response)

requests.models.Response

To check whether the response was successful we can use '.status_code' property of the response object. If we get the response between the 200 to 299 then response is successful.

In [9]:
response.status_code

200

The  request is successful.Now we can get the contents of the page using the `response.text`

In [10]:
Page_content = response.text

Let us check the number of character on the page

In [11]:
len(Page_content)

420795

The page contains over 420000 charcters.Here are the first 1000 character of the page:

In [12]:
Page_content[:1000]

'\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Feature Film,\nRating Count at least 25,000,\nAction\n(Sorted by IMDb Rating Descending) - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == \'function\') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n\n        <link rel="

We can write the page content to a file, which then allows us to view the page locally within Jupyter using "File > Open": . Once you open the file you will the exact page but the only difference is the links will not be working.

In [13]:
with open('Top_Rated_Action_Movies.html',"w") as file:
    file.write(Page_content)

### BeautifulSoup to parse the HTML source code
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

*Lets install and import the Beautiful Soup Library*

In [14]:
!pip install beautifulsoup4 --upgrade --quiet

In [15]:
from bs4 import BeautifulSoup

Here we will create BeautifulSoup object **doc** which will contain the parsed content of the page

In [16]:
doc = BeautifulSoup(response.text,'html.parser')

Checking type of doc

In [17]:
type(doc)

bs4.BeautifulSoup

BeautifulSoup has several properties and methods which will help in extracting the data. For example we can use the '.title' property to get the title of the page.

In [18]:
doc.find('title')

<title>Feature Film,
Rating Count at least 25,000,
Action
(Sorted by IMDb Rating Descending) - IMDb</title>

### Let us create a Function Get_imdb_Pages_url that will store the links of all pages which we want to scrap.

In [19]:
Base_Url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start='
def Get_imdb_Pages_Url(Base_Url):
    URL_List = []
    for i in range(1,25000,50): # Using this loop to reterive url for different pages
        urls = Base_Url +str(i) +"&ref_=adv_nxt%27"
        URL_List.append(urls)
    return URL_List

In [20]:
Get_imdb_Pages_Url(Base_Url)

['https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=51&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=101&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=151&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=201&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=251&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=301&ref_=adv_nxt%27',
 'https://www.imdb.com/search/title/?title_t

### Now we can use the Parse_Url to downlaod and parse a web page

In [21]:
def Parse_Url(Base_Url):
    response = requests.get(Base_Url)
    if response.status_code != 200:
        raise Exception("error in getting the page {}".format(base_url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    return doc

## Extract Movie Name, Release Year and Rating

### Extracting Movie Names
![](https://i.imgur.com/CkQLevN.png)

We're extracting movie names by first selecting  tag 'h3' and  class 'lister_item_header' and then inside it there is a 'a' tag which gives the movie name

In [22]:
def Get_Movie_names(doc):
    Movie_names = [] 
    
    # find all h3 tag by using the class "lister-item-header"
        
    Movie_name_tags =doc.find_all('h3',class_='lister-item-header')
    # as we can see the Movie name is inside the 'a' tag 
    for tag in Movie_name_tags:
        titles = tag.find('a')
    # use the .append method to store all movie name in Movie_names 
        Movie_names.append(titles.text)
    
    return Movie_names

In [23]:
movie_name =Get_Movie_names(doc) # we store the movie name 

In [24]:
movie_name[:6]

['The Dark Knight',
 'The Lord of the Rings: The Return of the King',
 'Inception',
 'The Lord of the Rings: The Two Towers',
 'The Lord of the Rings: The Fellowship of the Ring',
 'The Matrix']

In [25]:
len(movie_name) # as we can see there are 50 movies on one page

50

## Extracting  Movie Release Year

We're extracting movie Release year by selecting a tag `span` and class `lister-item-year text-muted unbold`![](https://i.imgur.com/bmUTFhP.png)

In [26]:
def Get_release_year(doc):
        
    Movie_year_tags = doc.find_all('span',class_='lister-item-year text-muted unbold')
    Movie_year = []
    
    for tag in Movie_year_tags:
       
        Movie_year.append(tag.text)
    
    return Movie_year

In [27]:
movie_year =Get_release_year(doc)

In [28]:
len(movie_year)

50

In [29]:
movie_year[:5]

['(2008)', '(2003)', '(2010)', '(2002)', '(2001)']

## Extracting Movie Rating
We're extracting movie year by selecting a tag `div` and class `inline-block ratings-imdb-rating`![](https://i.imgur.com/NTOa24V.png)

In [30]:

def Get_movie_rating(doc):
    Movie_ratings_tags = doc.find_all('div',class_='inline-block ratings-imdb-rating')
    Movie_ratings=[]
    
    for tag in Movie_ratings_tags:
        Movie_ratings.append(tag.text.replace('\n',""))
        
    return Movie_ratings

In [31]:
movie_rating = Get_movie_rating(doc)


In [32]:
len(movie_rating)

50

In [33]:
movie_rating[:5]

['9.0', '9.0', '8.8', '8.8', '8.8']

## Let us import Pandas

[Pandas](https://pandas.pydata.org/docs/) is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

In [34]:
import pandas as pd

**Let Us Create a Function which will store all the extracted Information into the python dictionary of all the pages**

In [35]:
def Scrape_Top_Rated_Movies(Base_Url):
    urls = Get_imdb_Pages_Url(Base_Url)
    Movie_Dict = {'Movie_name':[],'Release_Year':[],'Rating':[]}
    for url in urls:
        doc =Parse_Url(url)
        Movie_Dict['Movie_name'] += Get_Movie_names(doc)
        Movie_Dict['Release_Year'] += Get_release_year(doc)
        Movie_Dict['Rating'] += Get_movie_rating(doc)
       
    Movies_df = pd.DataFrame(Movie_Dict)
    Movies_df.to_csv('IMDBtopaction.csv',index = False)
    return Movie_Dict

### Now we write all the extracted information into the CSV file by using the Pandas Dataframe

In [36]:
Movies_df = pd.DataFrame(Scrape_Top_Rated_Movies(Base_Url))
Movies_df.to_csv('IMDBtopaction.csv',index = False)

In [37]:
Movies_df

Unnamed: 0,Movie_name,Release_Year,Rating
0,The Dark Knight,(2008),9.0
1,The Lord of the Rings: The Return of the King,(2003),9.0
2,Inception,(2010),8.8
3,The Lord of the Rings: The Two Towers,(2002),8.8
4,The Lord of the Rings: The Fellowship of the Ring,(2001),8.8
...,...,...,...
16670,Batman Begins,(2005),8.2
16671,Kill Bill: Vol. 1,(2003),8.2
16672,"Lock, Stock and Two Smoking Barrels",(1998),8.2
16673,Jurassic Park,(1993),8.2


## The Complete code of the project 

In [38]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


Base_Url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start='

def Get_imdb_Pages_Url(Base_Url):
    URL_List = []
    for i in range(1,25000,50): 
        urls = Base_Url +str(i) +"&ref_=adv_nxt%27"
        URL_List.append(urls)
    return URL_List
def Parse_Url(Base_Url):
    response = requests.get(Base_Url)
    if response.status_code != 200:
        raise Exception("error in getting the page {}".format(Base_Url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    return doc

def Get_Movie_names(doc):
    Movie_names = [] 
    
    # find all h3 tag by using the class "lister-item-header"
        
    Movie_name_tags =doc.find_all('h3',class_='lister-item-header')
    # as we can see the Movie name is inside the 'a' tag 
    for tag in Movie_name_tags:
        titles = tag.find('a')
    # use the .append method to store all movie name in Movie_names 
        Movie_names.append(titles.text)
    
    return Movie_names

def Get_release_year(doc):
        
    Movie_year_tags = doc.find_all('span',class_='lister-item-year text-muted unbold')
    Movie_year = []
    
    for tag in Movie_year_tags:
       
        Movie_year.append(tag.text)
    
    return Movie_year


def Get_movie_rating(doc):
    Movie_ratings_tags = doc.find_all('div',class_='inline-block ratings-imdb-rating')
    Movie_ratings=[]
    
    for tag in Movie_ratings_tags:
        Movie_ratings.append(tag.text.replace('\n',""))
        
    return Movie_ratings

def Scrape_Top_Rated_Movies(Base_Url):
    urls = Get_imdb_Pages_Url(Base_Url)
    Movie_Dict = {'Movie_name':[],'Release_Year':[],'Rating':[]}
    for url in urls:
        
        doc = Parse_Url(url)
        Movie_Dict['Movie_name'] += Get_Movie_names(doc)
        Movie_Dict['Release_Year'] += Get_release_year(doc)
        Movie_Dict['Rating'] += Get_movie_rating(doc)
       
    Movies_df = pd.DataFrame(Movie_Dict)
    Movies_df.to_csv('IMDBtopaction.csv',index = False)
    return Movie_Dict

In [39]:
 All_Movies = Scrape_Top_Rated_Movies(Base_Url)

Getting all the information

In [40]:
All_Movies

{'Movie_name': ['The Dark Knight',
  'The Lord of the Rings: The Return of the King',
  'Inception',
  'The Lord of the Rings: The Two Towers',
  'The Lord of the Rings: The Fellowship of the Ring',
  'The Matrix',
  'Star Wars: Episode V - The Empire Strikes Back',
  'Terminator 2: Judgment Day',
  'Star Wars',
  'Harakiri',
  'Seven Samurai',
  'Kaithi',
  'Asuran',
  'Sita Ramam',
  'Gladiator',
  'Léon: The Professional',
  'Vikram',
  'Spider-Man: Into the Spider-Verse',
  'Avengers: Endgame',
  'Avengers: Infinity War',
  'Top Gun: Maverick',
  'The Dark Knight Rises',
  'K.G.F: Chapter 2',
  'Shershaah',
  'Oldboy',
  'Princess Mononoke',
  'Aliens',
  'Raiders of the Lost Ark',
  'Vikram Vedha',
  'Dangal',
  'Spider-Man: No Way Home',
  'Heat',
  'Star Wars: Episode VI - Return of the Jedi',
  'North by Northwest',
  'Major',
  '1917',
  'Uri: The Surgical Strike',
  'K.G.F: Chapter 1',
  'The Mountain II',
  'Baahubali 2: The Conclusion',
  'Gangs of Wasseypur',
  'Paan Singh

## Summary

#### Here's what we have covered

1.First we had identified the Web pages to Scrape the related information.

2.Download webpages using "requests" Library

3.Used BeautifulSoup to parse the HTML source code

4.Extracted data like movie name, year in which it released and rating for each movie

5.Collect the downloaded data into Python lists

6.Extract and combine data from multiple pages

7.Create CSV file with the all the information that we had extracted from the above steps.

### Future Work

We can now work forward to explore this data more and more to fetch meaningful information out of it.  

With all the insights , and further analysis into the data, we can have answers to a lot of questions like -   
* Which actor has worked in most top rated movies across the world?
* Url of movies
* Which Director has directed the most top rated movies?
* Which year gave us the most Top Rated Movies till date?

And the list goes on..

> In the future, I would like to work to make this `DataSet` even richer with more data from other lists created by IMDB like - `Most Trending Movies`, `Top Rated Indian Movies`, `Lowest Rated Movies` etc.
I would then like to work on analysing the entire data, to know a lot more about movies than I currently know. 

## References


[1] Python offical documentation. https://docs.python.org/3/


[2] Requests library. https://pypi.org/project/requests/


[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/


[4] Pandas library documentation. https://pandas.pydata.org/docs/


[5] IMDB Website. https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=action&sort=user_rating,desc&start=1&ref_=adv_nxt%27


[6] Web Scraping Article. https://www.toptal.com/python/web-scraping-with-python
