### ***Learn how to scrape a list of top 300 movies from Rottentomatoes using BeautifulSoup and Requests Library. Extract Title, Release_Year, Ratings, Director,Synopsis,Critics_Consensus,Cast,Movie url for each movie and create a CSV file for data analysis.***

# Scraping Rotten Tomatoes Top 300 Movies to Watch using BeautifulSoup and Requests
#### *DataSource:[RottenTomatoes]( https://editorial.rottentomatoes.com/guide/essential-movies-to-watch-now/)*
![](https://imgur.com/zAlLO5T.png)

### *About Rotten Tomato:*
                Rotten Tomatoes is an American review-aggregation website for film and television. The company was launched in August 1998 by three undergraduate students at the University of California, Berkeley: Senh Duong, Patrick Y. Lee, and Stephen Wang. Although the name "Rotten Tomatoes" connects to the practice of audiences throwing rotten tomatoes in disapproval of a poor stage performance, the direct inspiration for the name from Duong, Lee, and Wang came from an equivalent scene in the 1992 Canadian film Léolo.

### *What is HTML?*
                The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser
![](https://imgur.com/sE5cxN0.png)                
### *What is Web Scraping?*
                Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.
![](https://imgur.com/Auc1eO6.png)
### *Why is Python programming language is used for Web Scraping?*
                Python is one of the most popular programming languages for web scraping. Its high speed, ease of use, and support for third-party libraries make it ideal for scraping the web. The language supports many web scraping functions and can be used for various projects.

### Objective
The objective of the project is to scrape the data such-Title, Movie_Release,Movie_Rating,
Director,Synopsis,Critics_Consensus,Cast,Movie URL from the above mentioned webpage and create a CSV file out of it.
![](https://imgur.com/HBxRUjP.png)

### *This is an outline of the steps that we were followed:*
1. Download the web page using "requests"
2. Parse the HTML code using "BeautifulSoup"
3. Extract data from web page :'Title', 'Release_year', 'Rating', 'Director','Synopsis','Critics_Consensus','Cast','url'
4. Compile extracted information into python dictionaries
5. Save the extracted information into a csv file
6. By the end of the project I have created a csv file in the following format:

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
# Execute this to save versions of the notebook
import jovian
jovian.commit(project="web-scraping-project_Top300_Movies")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "richardsamuelvincentpaul/web-scraping-project-top300-movies" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/richardsamuelvincentpaul/web-scraping-project-top300-movies[0m


'https://jovian.com/richardsamuelvincentpaul/web-scraping-project-top300-movies'

# *Use the requests library to download web pages*  
### *Requests:* 
                Requests is a library for making HTTP requests. It provides an easy-to-use interface that makes working with HTTP very simple, which means it simplifies the process of sending and receiving data from websites by providing a uniform interface.
![](https://imgur.com/rQnaZle.png)

In [3]:
!pip install requests --upgrade --quiet

In [4]:
#Import Requests Library
import requests
#Import Pandas Library to create dataframe
import pandas as pd

In [5]:
base_url="https://editorial.rottentomatoes.com/guide/essential-movies-to-watch-now/"

In [6]:
#Downloading the web page using requests.get function 
response = requests.get(base_url)

In [7]:
response.status_code #the response is successful

200

In [8]:
page_contents = response.text

In [9]:
len(response.text)

578744

In [10]:
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en-US" class="hitim">\n<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">\n    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n    \n    <!-- OneTrust Cookies Consent Notice start for rottentomatoes.com -->\n    <script src="https://cdn.cookielaw.org/consent/7e979733-6841-4fce-9182-515fac69187f/otSDKStub.js"\n        type="text/javascript"\n        charset="UTF-8"\n        data-domain-script="7e979733-6841-4fce-9182-515fac69187f"\n        integrity="sha384-WEHwEli88wqOiQd913F1utFZiwisa8XhCkbjLnbKEpFa/WbFcPKeGg7h4fdsv0Z/"\n        crossorigin="anonymous" >\n    </script>\n    <script type="text/javascript">\n        function OptanonWrapper() { }\n    </script>\n    <!-- OneTrust Cookies Consent Notice end for rottentomatoes.com -->\n    <!-- OneTrust IAB US Privacy (USP) -->\n    <script src="https://cdn.cookielaw.org/opt-out/otCCPAiab.js"\n        type="text/javascript"\n        charset="U

In [11]:
#Writing the page contents into a html file
with open('top_150_movies_final.html', 'w') as f:
    f.write(page_contents)

# *Use Beautiful Soup to parse and extract information*
### *Beautiful Soup :*
                     Beautiful Soup is a Python web scraping library that allows us to parse and scrape HTML and XML pages. You can search, navigate, and modify data using a parser. It's versatile and saves a lot of time. In this article we will learn how to scrape data using Beautiful Soup.
![](https://imgur.com/KwrNV3P.png)

In [12]:
# Installing the BeautifulSoup-4 library
!pip install beautifulsoup4 --upgrade --quiet

In [13]:
# Import the BeautifulSoup library
from bs4 import BeautifulSoup

In [14]:
# Converting the page to  Beautiful soup document using html.parser
doc = BeautifulSoup(page_contents, 'html.parser')

In [15]:
doc = BeautifulSoup(response.text)

In [16]:
type(doc)

bs4.BeautifulSoup

### Webpage going to scrape : Rotten Tomatoes
![](https://imgur.com/nMtq6ul.png)

### *Create a function to scrape Title,Rating,Year of Release ,Director,Cast,URL,Synopsis,Critics Consensus from the webpage*.

**1.Title from the webpage**

In [17]:
h2_tags = doc.find_all('h2')

In [18]:
h2_tags[:-1]

[<h2><a href="https://www.rottentomatoes.com/m/1000013_12_angry_men">12 Angry Men</a> <span class="subtle start-year">(1957)</span> <span class="icon tiny certified" title="Certified Fresh"></span> <span class="tMeterScore">100%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/2001_a_space_odyssey">2001: A Space Odyssey</a> <span class="subtle start-year">(1968)</span> <span class="icon tiny certified" title="Certified Fresh"></span> <span class="tMeterScore">92%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/400_blows">The 400 Blows</a> <span class="subtle start-year">(1959)</span> <span class="icon tiny certified" title="Certified Fresh"></span> <span class="tMeterScore">99%</span></h2>,
 <h2><a href="https://www.rottentomatoes.com/m/adventures_of_priscilla_queen_of_the_desert">The Adventures of Priscilla, Queen of the Desert</a> <span class="subtle start-year">(1994)</span> <span class="icon tiny certified" title="Certified Fresh"></span> <span class="tMeterS

![](https://imgur.com/cAfu9NU.png)

In [19]:
a_tags = doc.find_all('a')

In [20]:
a_tags[45].text

'Sidney Lumet'

In [21]:
# Define a funtion to get movie title from the webpage 
def get_movie_titles(doc):
    movie_tag = doc.find_all('div', {'class' :'article_movie_title' })

    movie_titles = []
    for movie_title_tag in movie_tag:
        movie_titles.append(movie_title_tag('a')[0].text)

    return movie_titles

In [22]:
get_movie_titles(doc)

['12 Angry Men',
 '2001: A Space Odyssey',
 'The 400 Blows',
 'The Adventures of Priscilla, Queen of the Desert',
 'The Adventures of Robin Hood',
 'Aguirre: The Wrath of God',
 'Airplane!',
 'Akira',
 'Alien',
 'Aliens',
 'All About Eve',
 'All About My Mother',
 "All the President's Men",
 'Almost Famous',
 'Amadeus',
 'Amélie',
 'Amour',
 'An American in Paris',
 'Annie Hall',
 'The Apartment',
 'Apocalypse Now',
 'Avengers: Endgame',
 'Back to the Future',
 'Badlands',
 'Beauty and the Beast',
 'Being John Malkovich',
 'Being There',
 'The Best Years of Our Lives',
 'Better Luck Tomorrow',
 'Bicycle Thieves',
 'The Big Lebowski',
 'The Big Sick',
 'Birdman or (The Unexpected Virtue of Ignorance)',
 'Black Hawk Down',
 'Black Orpheus',
 'Black Panther',
 'Blade Runner',
 'Blazing Saddles',
 'Boogie Nights',
 "Boys Don't Cry",
 'Boyz N the Hood',
 'The Breakfast Club',
 'Breathless',
 'Bridesmaids',
 'The Bridge on the River Kwai',
 "Bridget Jones's Diary",
 'Broadcast News',
 'Broke

In [23]:
len(get_movie_titles(doc))

150

In [24]:
movie_titles=(get_movie_titles(doc))
movie_titles

['12 Angry Men',
 '2001: A Space Odyssey',
 'The 400 Blows',
 'The Adventures of Priscilla, Queen of the Desert',
 'The Adventures of Robin Hood',
 'Aguirre: The Wrath of God',
 'Airplane!',
 'Akira',
 'Alien',
 'Aliens',
 'All About Eve',
 'All About My Mother',
 "All the President's Men",
 'Almost Famous',
 'Amadeus',
 'Amélie',
 'Amour',
 'An American in Paris',
 'Annie Hall',
 'The Apartment',
 'Apocalypse Now',
 'Avengers: Endgame',
 'Back to the Future',
 'Badlands',
 'Beauty and the Beast',
 'Being John Malkovich',
 'Being There',
 'The Best Years of Our Lives',
 'Better Luck Tomorrow',
 'Bicycle Thieves',
 'The Big Lebowski',
 'The Big Sick',
 'Birdman or (The Unexpected Virtue of Ignorance)',
 'Black Hawk Down',
 'Black Orpheus',
 'Black Panther',
 'Blade Runner',
 'Blazing Saddles',
 'Boogie Nights',
 "Boys Don't Cry",
 'Boyz N the Hood',
 'The Breakfast Club',
 'Breathless',
 'Bridesmaids',
 'The Bridge on the River Kwai',
 "Bridget Jones's Diary",
 'Broadcast News',
 'Broke

**2. Rating from the webpage**

In [25]:
rating_tags= doc.find_all('span' ,class_= "tMeterScore")

![](https://imgur.com/TgMorU6.png)

In [26]:
rating_tags[5].text

'96%'

In [27]:
# Define a funtion to get movie rating from the webpage 
def get_movie_rating(doc):
    rating_tags= doc.find_all('span' ,class_= "tMeterScore")
    movie_rating =[]
    for rating_tag in rating_tags:
        movie_rating.append(rating_tag.text)
    return movie_rating            

In [28]:
get_movie_rating(doc)

['100%',
 '92%',
 '99%',
 '94%',
 '100%',
 '96%',
 '97%',
 '91%',
 '98%',
 '98%',
 '99%',
 '98%',
 '94%',
 '89%',
 '89%',
 '89%',
 '93%',
 '95%',
 '97%',
 '93%',
 '98%',
 '94%',
 '93%',
 '97%',
 '93%',
 '94%',
 '95%',
 '97%',
 '81%',
 '99%',
 '80%',
 '98%',
 '91%',
 '77%',
 '87%',
 '96%',
 '89%',
 '90%',
 '94%',
 '90%',
 '96%',
 '89%',
 '96%',
 '90%',
 '96%',
 '80%',
 '98%',
 '88%',
 '90%',
 '92%',
 '96%',
 '94%',
 '94%',
 '99%',
 '94%',
 '92%',
 '99%',
 '91%',
 '90%',
 '99%',
 '95%',
 '90%',
 '88%',
 '81%',
 '97%',
 '91%',
 '95%',
 '98%',
 '94%',
 '93%',
 '95%',
 '92%',
 '84%',
 '94%',
 '92%',
 '84%',
 '96%',
 '94%',
 '97%',
 '98%',
 '94%',
 '93%',
 '84%',
 '91%',
 '99%',
 '84%',
 '89%',
 '92%',
 '92%',
 '88%',
 '92%',
 '88%',
 '84%',
 '97%',
 '94%',
 '78%',
 '78%',
 '79%',
 '91%',
 '96%',
 '96%',
 '94%',
 '96%',
 '76%',
 '91%',
 '89%',
 '82%',
 '92%',
 '98%',
 '95%',
 '95%',
 '92%',
 '97%',
 '96%',
 '93%',
 '99%',
 '97%',
 '96%',
 '92%',
 '97%',
 '100%',
 '77%',
 '94%',
 '94%',
 '92%

In [29]:
movie_ratings=get_movie_rating(doc)

In [30]:
movie_ratings

['100%',
 '92%',
 '99%',
 '94%',
 '100%',
 '96%',
 '97%',
 '91%',
 '98%',
 '98%',
 '99%',
 '98%',
 '94%',
 '89%',
 '89%',
 '89%',
 '93%',
 '95%',
 '97%',
 '93%',
 '98%',
 '94%',
 '93%',
 '97%',
 '93%',
 '94%',
 '95%',
 '97%',
 '81%',
 '99%',
 '80%',
 '98%',
 '91%',
 '77%',
 '87%',
 '96%',
 '89%',
 '90%',
 '94%',
 '90%',
 '96%',
 '89%',
 '96%',
 '90%',
 '96%',
 '80%',
 '98%',
 '88%',
 '90%',
 '92%',
 '96%',
 '94%',
 '94%',
 '99%',
 '94%',
 '92%',
 '99%',
 '91%',
 '90%',
 '99%',
 '95%',
 '90%',
 '88%',
 '81%',
 '97%',
 '91%',
 '95%',
 '98%',
 '94%',
 '93%',
 '95%',
 '92%',
 '84%',
 '94%',
 '92%',
 '84%',
 '96%',
 '94%',
 '97%',
 '98%',
 '94%',
 '93%',
 '84%',
 '91%',
 '99%',
 '84%',
 '89%',
 '92%',
 '92%',
 '88%',
 '92%',
 '88%',
 '84%',
 '97%',
 '94%',
 '78%',
 '78%',
 '79%',
 '91%',
 '96%',
 '96%',
 '94%',
 '96%',
 '76%',
 '91%',
 '89%',
 '82%',
 '92%',
 '98%',
 '95%',
 '95%',
 '92%',
 '97%',
 '96%',
 '93%',
 '99%',
 '97%',
 '96%',
 '92%',
 '97%',
 '100%',
 '77%',
 '94%',
 '94%',
 '92%

**3.Year of Release from the webpage**

In [31]:
release_tags =doc.find_all('span', class_= "subtle start-year")

![](https://imgur.com/NwlgYh0.png)

In [32]:
release_tags[5].text

'(1972)'

In [33]:
# Define a funtion to get movie release year from the webpage 
def get_movie_release(doc):
    release_tags =doc.find_all('span', class_= "subtle start-year")
    movie_release =[]
    for release_tag in release_tags:   
        movie_release.append(release_tag.text)
    return movie_release            

In [34]:
get_movie_release(doc)

['(1957)',
 '(1968)',
 '(1959)',
 '(1994)',
 '(1938)',
 '(1972)',
 '(1980)',
 '(1988)',
 '(1979)',
 '(1986)',
 '(1950)',
 '(1999)',
 '(1976)',
 '(2000)',
 '(1984)',
 '(2001)',
 '(2012)',
 '(1951)',
 '(1977)',
 '(1960)',
 '(1979)',
 '(2019)',
 '(1985)',
 '(1973)',
 '(1991)',
 '(1999)',
 '(1979)',
 '(1946)',
 '(2002)',
 '(1948)',
 '(1998)',
 '(2017)',
 '(2014)',
 '(2001)',
 '(1959)',
 '(2018)',
 '(1982)',
 '(1974)',
 '(1997)',
 '(1999)',
 '(1991)',
 '(1985)',
 '(1959)',
 '(2011)',
 '(1957)',
 '(2001)',
 '(1987)',
 '(2005)',
 '(1969)',
 '(1972)',
 '(1919)',
 '(2017)',
 '(2015)',
 '(1942)',
 '(2006)',
 '(2006)',
 '(1974)',
 '(2002)',
 '(1988)',
 '(1941)',
 '(1931)',
 '(1994)',
 '(1971)',
 '(1995)',
 '(2017)',
 '(2018)',
 '(2015)',
 '(2000)',
 '(2008)',
 '(1978)',
 '(1951)',
 '(1993)',
 '(1989)',
 '(1988)',
 '(1989)',
 '(1965)',
 '(1975)',
 '(1973)',
 '(1944)',
 '(1964)',
 '(1931)',
 '(2011)',
 '(1994)',
 '(1933)',
 '(1982)',
 '(1969)',
 '(1990)',
 '(1999)',
 '(1980)',
 '(1973)',
 '(2004)',

In [35]:
movie_releases=get_movie_release(doc)

In [36]:
movie_releases

['(1957)',
 '(1968)',
 '(1959)',
 '(1994)',
 '(1938)',
 '(1972)',
 '(1980)',
 '(1988)',
 '(1979)',
 '(1986)',
 '(1950)',
 '(1999)',
 '(1976)',
 '(2000)',
 '(1984)',
 '(2001)',
 '(2012)',
 '(1951)',
 '(1977)',
 '(1960)',
 '(1979)',
 '(2019)',
 '(1985)',
 '(1973)',
 '(1991)',
 '(1999)',
 '(1979)',
 '(1946)',
 '(2002)',
 '(1948)',
 '(1998)',
 '(2017)',
 '(2014)',
 '(2001)',
 '(1959)',
 '(2018)',
 '(1982)',
 '(1974)',
 '(1997)',
 '(1999)',
 '(1991)',
 '(1985)',
 '(1959)',
 '(2011)',
 '(1957)',
 '(2001)',
 '(1987)',
 '(2005)',
 '(1969)',
 '(1972)',
 '(1919)',
 '(2017)',
 '(2015)',
 '(1942)',
 '(2006)',
 '(2006)',
 '(1974)',
 '(2002)',
 '(1988)',
 '(1941)',
 '(1931)',
 '(1994)',
 '(1971)',
 '(1995)',
 '(2017)',
 '(2018)',
 '(2015)',
 '(2000)',
 '(2008)',
 '(1978)',
 '(1951)',
 '(1993)',
 '(1989)',
 '(1988)',
 '(1989)',
 '(1965)',
 '(1975)',
 '(1973)',
 '(1944)',
 '(1964)',
 '(1931)',
 '(2011)',
 '(1994)',
 '(1933)',
 '(1982)',
 '(1969)',
 '(1990)',
 '(1999)',
 '(1980)',
 '(1973)',
 '(2004)',

**4.Director's Name from the webpage**

In [37]:
dir_names = doc.find_all('div', class_= "info director")

![](https://imgur.com/15hIoAl.png)

In [38]:
dir_names[0]

<div class="info director">
<span class="descriptor">Directed By:</span> <a class="" href="//www.rottentomatoes.com/celebrity/sidney_lumet">Sidney Lumet</a></div>

In [39]:
# Define a funtion to get movie directors from the webpage 
def get_director_names(doc):
    dir_names = doc.find_all('div', class_= "info director")
    movie_director_names =[]
    for dir_name in dir_names:
        movie_director_names.append(dir_name.text.replace("\nDirected By:", ""))
    return movie_director_names           

In [40]:
get_director_names(doc)

[' Sidney Lumet',
 ' Stanley Kubrick',
 ' François Truffaut',
 ' Stephan Elliott',
 ' Michael Curtiz, William Keighley',
 ' Werner Herzog',
 ' Jim Abrahams, David Zucker, Jerry Zucker',
 ' Katsuhiro Ohtomo',
 ' Ridley Scott',
 ' James Cameron',
 ' Joseph L. Mankiewicz',
 ' Pedro Almodóvar',
 ' Alan J. Pakula',
 ' Cameron Crowe',
 ' Milos Forman',
 ' Jean-Pierre Jeunet',
 ' Michael Haneke',
 ' Vincente Minnelli',
 ' Woody Allen',
 ' Billy Wilder',
 ' Francis Ford Coppola',
 ' Anthony Russo, Joe Russo',
 ' Robert Zemeckis',
 ' Terrence Malick',
 ' Gary Trousdale, Kirk Wise',
 ' Spike Jonze',
 ' Hal Ashby',
 ' William Wyler',
 ' Justin Lin',
 ' Vittorio De Sica',
 ' Joel Coen',
 ' Michael Showalter',
 ' Alejandro González Iñárritu',
 ' Ridley Scott',
 ' Marcel Camus',
 ' Ryan Coogler',
 ' Ridley Scott',
 ' Mel Brooks',
 ' Paul Thomas Anderson',
 ' Kimberly Peirce',
 ' John Singleton',
 ' John Hughes',
 ' Jean-Luc Godard',
 ' Paul Feig',
 ' David Lean',
 ' Sharon Maguire',
 ' James L. Broo

In [41]:
movie_director_names=get_director_names(doc)

In [42]:
movie_director_names

[' Sidney Lumet',
 ' Stanley Kubrick',
 ' François Truffaut',
 ' Stephan Elliott',
 ' Michael Curtiz, William Keighley',
 ' Werner Herzog',
 ' Jim Abrahams, David Zucker, Jerry Zucker',
 ' Katsuhiro Ohtomo',
 ' Ridley Scott',
 ' James Cameron',
 ' Joseph L. Mankiewicz',
 ' Pedro Almodóvar',
 ' Alan J. Pakula',
 ' Cameron Crowe',
 ' Milos Forman',
 ' Jean-Pierre Jeunet',
 ' Michael Haneke',
 ' Vincente Minnelli',
 ' Woody Allen',
 ' Billy Wilder',
 ' Francis Ford Coppola',
 ' Anthony Russo, Joe Russo',
 ' Robert Zemeckis',
 ' Terrence Malick',
 ' Gary Trousdale, Kirk Wise',
 ' Spike Jonze',
 ' Hal Ashby',
 ' William Wyler',
 ' Justin Lin',
 ' Vittorio De Sica',
 ' Joel Coen',
 ' Michael Showalter',
 ' Alejandro González Iñárritu',
 ' Ridley Scott',
 ' Marcel Camus',
 ' Ryan Coogler',
 ' Ridley Scott',
 ' Mel Brooks',
 ' Paul Thomas Anderson',
 ' Kimberly Peirce',
 ' John Singleton',
 ' John Hughes',
 ' Jean-Luc Godard',
 ' Paul Feig',
 ' David Lean',
 ' Sharon Maguire',
 ' James L. Broo

**5.Cast Name from the webpage**

In [43]:
cast_names = doc.find_all('div', class_= "info cast")

![](https://imgur.com/mkM6zo0.png)

In [44]:
cast_names[0].text[11:]

'Henry Fonda, Lee J. Cobb, Ed Begley, E.G. Marshall'

In [45]:
# Define a funtion to get movie cast from the webpage 
def get_movie_cast(doc):
    cast_names = doc.find_all('div', class_= "info cast")
    stars = []
    for star in cast_names:
        stars.append(star.text[11:])

    return stars 

In [46]:
get_movie_cast(doc)[:5]

['Henry Fonda, Lee J. Cobb, Ed Begley, E.G. Marshall',
 'Keir Dullea, Gary Lockwood, William Sylvester, Daniel Richter',
 'Jean-Pierre Léaud, Claire Maurier, Albert Remy, Guy Decomble',
 'Terence Stamp, Hugo Weaving, Guy Pearce, Bill Hunter',
 'Errol Flynn, Olivia de Havilland, Basil Rathbone, Claude Rains']

In [47]:
stars=get_movie_cast(doc)
stars

['Henry Fonda, Lee J. Cobb, Ed Begley, E.G. Marshall',
 'Keir Dullea, Gary Lockwood, William Sylvester, Daniel Richter',
 'Jean-Pierre Léaud, Claire Maurier, Albert Remy, Guy Decomble',
 'Terence Stamp, Hugo Weaving, Guy Pearce, Bill Hunter',
 'Errol Flynn, Olivia de Havilland, Basil Rathbone, Claude Rains',
 'Klaus Kinski, Ruy Guerra, Helena Rojo, Del Negro',
 'Robert Hays, Julie Hagerty, Peter Graves, Robert Stack',
 'Mitsuo Iwata, Nozomu Sasaki, Mami Koyama, Tessho Genda',
 'Tom Skerritt, Sigourney Weaver, John Hurt, Veronica Cartwright',
 'Sigourney Weaver, Carrie Henn, Michael Biehn, Paul Reiser',
 'Bette Davis, Anne Baxter, Celeste Holm, George Sanders',
 'Cecilia Roth, Eloy Azorín, Marisa Paredes, Penélope Cruz',
 'Robert Redford, Dustin Hoffman, Jack Warden, Martin Balsam',
 'Billy Crudup, Frances McDormand, Kate Hudson, Jason Lee',
 'F. Murray Abraham, Tom Hulce, Jeffrey Jones, Elizabeth Berridge',
 'Audrey Tautou, Mathieu Kassovitz, Rufus, Yolande Moreau',
 'Jean-Louis Trinti

**6.Link of each movie from the webpage**

In [48]:
# Define a funtion to get movie url from the webpage 
def get_movie_url(doc):
    movie_href_tags = doc.find_all('div',class_= "article_movie_title")
    link_list = []
    for i in movie_href_tags:
        link_list.append(i('h2')[0]('a')[0]['href'])
    movies_url = []
    for link in link_list:
        url = link
        movies_url.append(url)

    return movies_url

In [49]:
get_movie_url(doc)[:5]

['https://www.rottentomatoes.com/m/1000013_12_angry_men',
 'https://www.rottentomatoes.com/m/2001_a_space_odyssey',
 'https://www.rottentomatoes.com/m/400_blows',
 'https://www.rottentomatoes.com/m/adventures_of_priscilla_queen_of_the_desert',
 'https://www.rottentomatoes.com/m/1000355-adventures_of_robin_hood']

In [50]:
movies_url = get_movie_url(doc)

In [51]:
movies_url[:5]

['https://www.rottentomatoes.com/m/1000013_12_angry_men',
 'https://www.rottentomatoes.com/m/2001_a_space_odyssey',
 'https://www.rottentomatoes.com/m/400_blows',
 'https://www.rottentomatoes.com/m/adventures_of_priscilla_queen_of_the_desert',
 'https://www.rottentomatoes.com/m/1000355-adventures_of_robin_hood']

**7.Synopsis for a movie from the webpage**

In [52]:
# Define a funtion to get movie synopsis from the webpage
def get_synopsis(doc):
    syn_tags = doc.find_all('div', class_= "info synopsis")

    synopsis = []
    for syn_tag in syn_tags:
        synopsis.append(syn_tag.text[10:])

    return synopsis

![](https://imgur.com/soKbwJS.png)

In [53]:
get_synopsis(doc)[:5]

['Following the closing arguments in a murder trial, the 12 members of the jury must deliberate, with a guilty verdict... [More]',
 'An imposing black structure provides a connection between the past and the future in this enigmatic adaptation of a short... [More]',
 'For young Parisian boy Antoine Doinel (Jean-Pierre Léaud), life is one difficult situation after another. Surrounded by inconsiderate adults, including... [More]',
 'When drag queen Anthony (Hugo Weaving) agrees to take his act on the road, he invites fellow cross-dresser Adam (Guy... [More]',
 'When King Richard the Lionheart is captured, his scheming brother Prince John (Claude Rains) plots to reach the throne, to... [More]']

In [54]:
synopsis = get_synopsis(doc)

**8.Critics Consensus for a movie from the webpage**

In [55]:
# Define a funtion to get movie critic consensus from the webpage
def get_critics_consensus(doc):
    cc_tags = doc.find_all('div', class_= "info critics-consensus")

    critics_consensus = []
    for cc_tag in cc_tags:
        critics_consensus.append(cc_tag.text[10:].replace("nsensus:", ""))

    return critics_consensus

![](https://imgur.com/hFHdTIM.png)

In [56]:
get_critics_consensus(doc)[:5]

[" Sidney Lumet's feature debut is a superbly written, dramatically effective courtroom thriller that rightfully stands as a modern classic.",
 " One of the most influential of all sci-fi films -- and one of the most controversial -- Stanley Kubrick's 2001 is a delicate, poetic meditation on the ingenuity -- and folly -- of mankind.",
 ' A seminal French New Wave film that offers an honest, sympathetic, and wholly heartbreaking observation of adolescence without trite nostalgia.',
 ' While its premise is ripe for comedy -- and it certainly delivers its fair share of laughs -- Priscilla is also a surprisingly tender and thoughtful road movie with some outstanding performances.',
 ' Errol Flynn thrills as the legendary title character, and the film embodies the type of imaginative family adventure tailor-made for the silver screen.']

In [57]:
critics_consensus = get_critics_consensus(doc)

**9.Creating a dataframe using pandas Library**

In [58]:
column_names = ['Title', 'Release_year', 'Rating', 'Director','Synopsis','Critics_Consensus','Cast','url']
Top_300_movie_df = pd.DataFrame(list(zip(movie_titles,movie_releases,movie_ratings,movie_director_names,synopsis,critics_consensus,stars,movies_url)), columns = column_names)

In [59]:
Top_300_movie_df

Unnamed: 0,Title,Release_year,Rating,Director,Synopsis,Critics_Consensus,Cast,url
0,12 Angry Men,(1957),100%,Sidney Lumet,Following the closing arguments in a murder tr...,Sidney Lumet's feature debut is a superbly wr...,"Henry Fonda, Lee J. Cobb, Ed Begley, E.G. Mars...",https://www.rottentomatoes.com/m/1000013_12_an...
1,2001: A Space Odyssey,(1968),92%,Stanley Kubrick,An imposing black structure provides a connect...,One of the most influential of all sci-fi fil...,"Keir Dullea, Gary Lockwood, William Sylvester,...",https://www.rottentomatoes.com/m/2001_a_space_...
2,The 400 Blows,(1959),99%,François Truffaut,For young Parisian boy Antoine Doinel (Jean-Pi...,A seminal French New Wave film that offers an...,"Jean-Pierre Léaud, Claire Maurier, Albert Remy...",https://www.rottentomatoes.com/m/400_blows
3,"The Adventures of Priscilla, Queen of the Desert",(1994),94%,Stephan Elliott,When drag queen Anthony (Hugo Weaving) agrees ...,While its premise is ripe for comedy -- and i...,"Terence Stamp, Hugo Weaving, Guy Pearce, Bill ...",https://www.rottentomatoes.com/m/adventures_of...
4,The Adventures of Robin Hood,(1938),100%,"Michael Curtiz, William Keighley","When King Richard the Lionheart is captured, h...",Errol Flynn thrills as the legendary title ch...,"Errol Flynn, Olivia de Havilland, Basil Rathbo...",https://www.rottentomatoes.com/m/1000355-adven...
...,...,...,...,...,...,...,...,...
145,Iron Man,(2008),94%,Jon Favreau,A billionaire industrialist and genius invento...,"Powered by Robert Downey Jr.'s vibrant charm,...","Robert Downey Jr., Terrence Howard, Gwyneth Pa...",https://www.rottentomatoes.com/m/iron_man
146,It Happened One Night,(1934),98%,Frank Capra,"In Frank Capra's acclaimed romantic comedy, sp...",Capturing its stars and director at their fin...,"Claudette Colbert, Clark Gable, Walter Connoll...",https://www.rottentomatoes.com/m/it_happened_o...
147,It's a Wonderful Life,(1946),94%,Frank Capra,After George Bailey (James Stewart) wishes he ...,The holiday classic to define all holiday cla...,"James Stewart, Donna Reed, Lionel Barrymore, T...",https://www.rottentomatoes.com/m/its_a_wonderf...
148,Jaws,(1975),97%,Steven Spielberg,When a young woman is killed by a shark while ...,"Compelling, well-crafted storytelling and a j...","Roy Scheider, Robert Shaw, Richard Dreyfuss, L...",https://www.rottentomatoes.com/m/jaws


# Scraping Second Page

### Creating Helper Function to scrape second page

In [60]:
def scrape_multiple_pages(n):
    final_dict = {'Title':[],'Rating':[],'Release_year':[],'Director':[],'Cast':[],
                  'Synopsis':[],'Critics_Consensus':[],'url':[] }
    for i in range(1,n+1):
        url = 'https://editorial.rottentomatoes.com/guide/essential-movies-to-watch-now/'+str(i)+'/'
        doc = BeautifulSoup(requests.get(url).text)
        final_dict['Title'].extend(movie_titles)
        final_dict['Rating'].extend(movie_ratings)
        final_dict['Release_year'].extend(movie_releases)
        final_dict['Director'].extend(movie_director_names)
        final_dict['Cast'].extend(stars)
        final_dict['Synopsis'].extend(synopsis)
        final_dict['Critics_Consensus'].extend(critics_consensus)
        final_dict['url'].extend(movies_url)
    return final_dict
dictionary = scrape_multiple_pages(2)
Top_300_movie_df = pd.DataFrame(dictionary)
Top_300_movie_df

Unnamed: 0,Title,Rating,Release_year,Director,Cast,Synopsis,Critics_Consensus,url
0,12 Angry Men,100%,(1957),Sidney Lumet,"Henry Fonda, Lee J. Cobb, Ed Begley, E.G. Mars...",Following the closing arguments in a murder tr...,Sidney Lumet's feature debut is a superbly wr...,https://www.rottentomatoes.com/m/1000013_12_an...
1,2001: A Space Odyssey,92%,(1968),Stanley Kubrick,"Keir Dullea, Gary Lockwood, William Sylvester,...",An imposing black structure provides a connect...,One of the most influential of all sci-fi fil...,https://www.rottentomatoes.com/m/2001_a_space_...
2,The 400 Blows,99%,(1959),François Truffaut,"Jean-Pierre Léaud, Claire Maurier, Albert Remy...",For young Parisian boy Antoine Doinel (Jean-Pi...,A seminal French New Wave film that offers an...,https://www.rottentomatoes.com/m/400_blows
3,"The Adventures of Priscilla, Queen of the Desert",94%,(1994),Stephan Elliott,"Terence Stamp, Hugo Weaving, Guy Pearce, Bill ...",When drag queen Anthony (Hugo Weaving) agrees ...,While its premise is ripe for comedy -- and i...,https://www.rottentomatoes.com/m/adventures_of...
4,The Adventures of Robin Hood,100%,(1938),"Michael Curtiz, William Keighley","Errol Flynn, Olivia de Havilland, Basil Rathbo...","When King Richard the Lionheart is captured, h...",Errol Flynn thrills as the legendary title ch...,https://www.rottentomatoes.com/m/1000355-adven...
...,...,...,...,...,...,...,...,...
295,Iron Man,94%,(2008),Jon Favreau,"Robert Downey Jr., Terrence Howard, Gwyneth Pa...",A billionaire industrialist and genius invento...,"Powered by Robert Downey Jr.'s vibrant charm,...",https://www.rottentomatoes.com/m/iron_man
296,It Happened One Night,98%,(1934),Frank Capra,"Claudette Colbert, Clark Gable, Walter Connoll...","In Frank Capra's acclaimed romantic comedy, sp...",Capturing its stars and director at their fin...,https://www.rottentomatoes.com/m/it_happened_o...
297,It's a Wonderful Life,94%,(1946),Frank Capra,"James Stewart, Donna Reed, Lionel Barrymore, T...",After George Bailey (James Stewart) wishes he ...,The holiday classic to define all holiday cla...,https://www.rottentomatoes.com/m/its_a_wonderf...
298,Jaws,97%,(1975),Steven Spielberg,"Roy Scheider, Robert Shaw, Richard Dreyfuss, L...",When a young woman is killed by a shark while ...,"Compelling, well-crafted storytelling and a j...",https://www.rottentomatoes.com/m/jaws


# Create CSV file(s) with the extracted information

In [61]:
#Converting the Movies Dataframe to a CSV File
Top_300_movie_df.to_csv('Top300movies.csv', index=None)   
print('file converted successfully')

file converted successfully


# Summary:
I have finished scraping a list top movies with the help of Python's `BeautifulSoup` and `Request` library. 
After scraping, we stored our result into a dataframe and CSV format. 


![](https://imgur.com/tWU5ngU.png)


 ### Successfully scarped the data such as 
     1. Title 
     2. Movie_Release 
     3. Movie_Ratings 
     4. Director
     5. Synopsis
     6. Critics_Consensus
     7. Cast
     8. Movie link 
 


# References


[1] Python offical documentation. https://docs.python.org/3/

[2] Requests library. https://pypi.org/project/requests/

[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[4] Pandas library documentation. https://pandas.pydata.org/docs/

[5] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

[6] GeeksforGeeks. https://practice.geeksforgeeks.org

[7] FreeCodeCamp. https://www.freecodecamp.org/

[8] ParseHub. https://www.parsehub.com/

[9] Jupyter Notebook. https://jupyter.org/


In [62]:
jovian.commit(files=['Top300movies.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "richardsamuelvincentpaul/web-scraping-project-top300-movies" on https://jovian.com[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.com/richardsamuelvincentpaul/web-scraping-project-top300-movies[0m


'https://jovian.com/richardsamuelvincentpaul/web-scraping-project-top300-movies'

In [63]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "richardsamuelvincentpaul/web-scraping-project-top300-movies" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/richardsamuelvincentpaul/web-scraping-project-top300-movies[0m


'https://jovian.com/richardsamuelvincentpaul/web-scraping-project-top300-movies'