# Scraping IMDb Data using Python BeautifulSoup



![img](https://i.imgur.com/H7ixzu6.png)

### Outline of the project:

- Introduction about web scraping
- Introduction about IMDB and the purpose of it
- We're using (Python, requests, Beautiful Soup, Pandas)

### About web scraping
Web scraping consists in gathering data available on websites. This can be done manually by a human user or by a bot. The latter can of course gather data much faster than a human user and that is why we are going to focus on this. Is it therefore technically possible to collect all the data of a website in a matter of minutes this kind of bot. The legality of this practice is not well defined however. Websites usually describe in their terms of use and in their robots.txt file if they allow scrapers or not.

### Introduction to IMDb Website

IMDb (an acronym for Internet Movie Database) is an online database of information related to films, television programs, home videos, video games, and streaming content online - including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.

Here are steps we'll follow:

- We're going to scrape https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=000&ref_=adv_nxt 
- We'll get a list of movies.For each movie,we'll get movie title, Movie genre, Movie  duration, Movie rating, movie year,Movie certification,Movie url
- we'll create a csv file in the followig format:
``` 
Movie Name, Genre, Duration,Rating, Year, Certification, URL
JaiBhim, 'Crime, Drama',164, 9.5,2021, A, imdb.com/title/tt15097216/?ref_=adv_li_tt
The Shawshank Redemption,'Drama', 142,9.3, 1994,R,imdb.com/title/tt0111161/?ref_=adv_li_tt
```

## Scrape the list of titles from IMDb



- use requests to download the page
- use BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [24]:
import requests
from bs4 import BeautifulSoup

def get_topics_page(url):
    
    
    response=requests.get(url)
    # check successfull response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    # Parse using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc



In [25]:
topic_url = 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=000&ref_=adv_next'
doc = get_topics_page(topic_url)

doc contains code like below

![img](https://i.imgur.com/IMxQO7G.png)

In [26]:
doc.find('title')

<title>IMDb "Top 1000"
(Sorted by IMDb Rating Descending) - IMDb</title>

Let's create some helper functions to parse information from the page

### To get title of movie

In [3]:
def get_movie_titles(doc):
    
    selection_class="lister-item-header"
    movie_title_tags=doc.find_all('h3',{'class':selection_class})
    movie_titles=[]

    for tag in movie_title_tags:
        title = tag.find('a').text
        movie_titles.append(title)
        
        
    return movie_titles

In [4]:
titles = get_movie_titles(doc)

In [6]:
titles[:5]

['Jai Bhim',
 'The Shawshank Redemption',
 'The Godfather',
 'Soorarai Pottru',
 'The Dark Knight']

Similarly we defined functions for movie url, ratings, certification, duration and  year

### To get URL's of movie

In [5]:
def get_movie_url(doc):
    url_selector="lister-item-header"           
    movie_url_tags=doc.find_all('h3',{'class':url_selector})
    movie_url_tagss=[]
    base_url = 'https://www.imdb.com/'
    for tag in movie_url_tags:
        movie_url_tagss.append('https://www.imdb.com/' + tag.find('a')['href'])
    return movie_url_tagss

In [6]:
urls = get_movie_url(doc)

In [9]:
urls[:5]

['https://www.imdb.com//title/tt15097216/',
 'https://www.imdb.com//title/tt0111161/',
 'https://www.imdb.com//title/tt0068646/',
 'https://www.imdb.com//title/tt10189514/',
 'https://www.imdb.com//title/tt0468569/']

### To get movie duration

In [7]:
def get_movie_duration(doc):
    
    selection_class="runtime"
    movie_duration_tags=doc.find_all('span',{'class':selection_class})
    movie_duration=[]

    for tag in movie_duration_tags:
        duration = tag.text[:-4]
        movie_duration.append(duration)
        
        
    return movie_duration

In [8]:
durations = get_movie_duration(doc)

In [12]:
durations[:5]

['164', '142', '175', '153', '152']

### To get certification of movie

In [9]:
def get_movie_certification(doc):
    
    selection_class="lister-item-content"
    movie_details_tags = doc.find_all('div',{'class':selection_class})
    movie_certification=[]
    

    for detail_tag in movie_details_tags:
        
        certification_tag = detail_tag.find('span',{'class':'certificate'})
        if certification_tag:
            movie_certification.append(certification_tag.text)
        else:
            movie_certification.append('NA')                                                           
        
    return movie_certification

In [10]:
certifications = get_movie_certification(doc)

In [15]:
certifications[:5]

['NA', 'R', 'R', 'TV-MA', 'PG-13']

### To get year of movie

In [11]:
def get_movie_year(doc):
    year_selector = "lister-item-year text-muted unbold"           
    movie_year_tags=doc.find_all('span',{'class':year_selector})
    movie_year_tagss=[]
    for tag in movie_year_tags:
        movie_year_tagss.append(tag.get_text().strip()[1:5])
    return movie_year_tagss

In [12]:
years = get_movie_year(doc)

In [18]:
years[:5]

['2021', '1994', '1972', '2020', '2008']

### To get genre of movie

In [13]:
def get_movie_genre(doc):
    genre_selector="genre"            
    movie_genre_tags=doc.find_all('span',{'class':genre_selector})
    movie_genre_tagss=[]
    for tag in movie_genre_tags:
        movie_genre_tagss.append(tag.get_text().strip())
    return movie_genre_tagss

In [14]:
genres = get_movie_genre(doc)

In [21]:
genres[:5]

['Crime, Drama', 'Drama', 'Crime, Drama', 'Drama', 'Action, Crime, Drama']

### To get ratings of movie

In [15]:
def get_movie_rating(doc):
    rating_selector="inline-block ratings-imdb-rating"            
    movie_rating_tags=doc.find_all('div',{'class':rating_selector})
    movie_rating_tagss=[]
    for tag in movie_rating_tags:
        movie_rating_tagss.append(tag.get_text().strip())
    return movie_rating_tagss

In [16]:
ratings = get_movie_rating(doc)

In [24]:
ratings[:5]

['9.5', '9.3', '9.2', '9.1', '9.0']

Let's put all together into a single function

In [17]:
import pandas as pd

In [28]:
def all_pages(num=10):
# Let's we create a dictionary to store data of all movies
    movies_dict={
        'titles':[],
        'genre':[],
        'duration':[],
        'rating':[],
        'year':[],
        'certification':[],
        'url':[]
    }
  # We have to scrap more than one page so we want urls of all pages with the help of loop we can get all urls
    for i in range(1,num*110,100):
       
        url = 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start='+str(i)+'&ref_=adv_next'
        doc = get_topics_page(url)
   
        
        movies_dict['titles'] += get_movie_titles(doc)
        movies_dict['url'] += get_movie_url(doc)
        movies_dict['certification'] += get_movie_certification(doc)
        movies_dict['rating'] += get_movie_rating(doc)
        movies_dict['duration'] += get_movie_duration(doc)
        movies_dict['year'] += get_movie_year(doc)
        movies_dict['genre'] += get_movie_genre(doc)   
        
    return pd.DataFrame(movies_dict)


In [29]:
movies = all_pages()

In [30]:
movies.to_csv('movies.csv',index=None)

movies csv file has been created, which contains the data that we've scrapped

In [31]:
dataframe = pd.read_csv('movies.csv')

In [32]:
dataframe

Unnamed: 0,titles,genre,duration,rating,year,certification,url
0,Jai Bhim,"Crime, Drama",164,9.5,2021,,https://www.imdb.com//title/tt15097216/
1,The Shawshank Redemption,Drama,142,9.3,1994,R,https://www.imdb.com//title/tt0111161/
2,The Godfather,"Crime, Drama",175,9.2,1972,R,https://www.imdb.com//title/tt0068646/
3,Soorarai Pottru,Drama,153,9.1,2020,TV-MA,https://www.imdb.com//title/tt10189514/
4,The Dark Knight,"Action, Crime, Drama",152,9.0,2008,PG-13,https://www.imdb.com//title/tt0468569/
...,...,...,...,...,...,...,...
1995,Drishyam,"Crime, Drama, Thriller",160,8.3,2013,Not Rated,https://www.imdb.com//title/tt3417422/
1996,The Hunt,Drama,115,8.3,2012,R,https://www.imdb.com//title/tt2106476/
1997,A Separation,Drama,123,8.3,2011,PG-13,https://www.imdb.com//title/tt1832382/
1998,Incendies,"Drama, Mystery, War",131,8.3,2010,R,https://www.imdb.com//title/tt1255953/


### Summary

In this project, we scrapped the specific data, which we want using Python programming language, requests and BeautifulSoup library  have been used to download the pages and then exploring and getting the relevant data from the website. Further the work has been in a csv file to be used for further processing.We've created a dataframe which consists of 1000 rows and 7 columns fom 10 pages.

### Future Works

The current web scrappning project is a starting point for a bigger NLP project. The updates of which will be posted afterwards. In this project I started to look into some of the websites where I could get the data for movies in IMDm website.


## References

1. https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis
2. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. https://docs.python-requests.org/en/master/

## Commiting the work on Jovian platform along with the required file

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="scraping-IMDb-data-using-python-beautifulsoup",files =['movies.csv'])

<IPython.core.display.Javascript object>