Author: Suvansh Vaid

Dated: 15/02/2020

# Project idea

Most of us movie lovers like to see the top movies recommended by imdb or rotten tomatoes before trying a movie.

This project uses the web scraping of imdb page to extract the list of top 250 movies with highest imdb ratings. We use the `beautifulsoup` library and the `requests` library to achieve the project goals. 


<img src="imdb.png">

First, we use the `get` method from the `requests` library to open the url of imdb top 250 movies list.

In [8]:
from requests import get
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
response = get(url)

Now, we use the `BeautifulSoup` to parse the website and convert it to a `bs4` object. 

In [9]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

After looking at the source code of the page, we find that each movie information is in the `tr` tags, thus we use the `find_all` function to find all the tr tag content.

In [10]:
movie_containers = html_soup.find_all('tr')

In [11]:
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
251


We would notice the first `tr` tag content is not relevant to our goal, so we remove the first container. 

In [12]:
movie_containers = movie_containers[1:]

In [13]:
print(len(movie_containers))

250


Now, let's see whether we were able to extract the movie list correctly by checking the first container content.

In [14]:
first_movie = movie_containers[0]

In [15]:
first_name = first_movie.find('td', class_ = "titleColumn").a.text
first_name

'The Shawshank Redemption'

In [16]:
first_rating = first_movie.find('td', class_ = "ratingColumn imdbRating").strong.text
first_rating

'9.2'

In [17]:
first_release_year = first_movie.find('td', class_ = "titleColumn").span.text
first_release_year

'(1994)'

Now comes the essential part of our project. We use the basic python operations to store the list of 250 movies in the form of a dataframe. 

In [11]:
# Lists to store the scraped data in
names = []
release_years = []
imdb_ratings = []

In [12]:
# Extract data from individual movie container
for container in movie_containers:
    
    # Movie name
    name = container.find('td', class_ = "titleColumn").a.text
    names.append(name)
    
    # Release year
    year = str(container.find('td', class_ = "titleColumn").span.text)[1:5]
    release_years.append(year)
    
    # IMDB rating
    imdb = float(container.find('td', class_ = "ratingColumn imdbRating").strong.text)
    imdb_ratings.append(imdb)

In [13]:
import pandas as pd
imdb_df = pd.DataFrame({'movie': names,
'year': release_years,
'imdb': imdb_ratings
})

The final dataframe looks like the following. It clearly demonstrates how powerful web scraping is and how we can use a few lines of code to extract whatever information from the web we are interested in. 

In [14]:
imdb_df.head()

Unnamed: 0,movie,year,imdb
0,The Shawshank Redemption,1994,9.2
1,The Godfather,1972,9.1
2,The Godfather: Part II,1974,9.0
3,The Dark Knight,2008,9.0
4,12 Angry Men,1957,8.9
