# IMDB Top 50 Popular Movies by Genre

## Introduction

IMDB (an abbreviation of Internet Movie Database) is an online database of information related to films, television series, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. IMDb began as a fan-operated movie database in 1990, and moved to the Web in 1993. It is now owned and operated by IMDb.com, Inc., a subsidiary of Amazon. As of March 2022, the database contained some 10.1 million titles (including television episodes).



In this project I would like to extract the data of top 50 popular movies by genre from IMDB website using Web Scraping techniques, Python and it's libraries.

## Steps 

- Scrape the IMDB website and extract the list of all the movie genres and the URL to their top 50 popular movies
- Scrape each genre URL to obtain the data of the top 50 popular movies
- Extract the details like Movie Title, Description, Rating, Cast, etc.
- Create a csv file containing the data

## Import all the required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

## Scrape IMDB website 

Copy the link to movies genres from the IMDB website and store it in a variable

In [2]:
imdb = 'https://www.imdb.com/feature/genre'

Use requests library to obtain the response from the web page

In [3]:
response = requests.get(imdb)
response.status_code

200

Check for the page contents

In [4]:
response.text[:1000]

'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n    \n    \n    \n\n    \n    \n    \n\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n            </style>\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Browse Movies and TV by Genre - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1})

In [5]:
page_contents = response.text

Create a html file with the page contents

In [6]:
with open('imdb.html','w',encoding='utf-8') as f:
    f.write(page_contents)

## Get Movies Genres 

Use beautiful soup library to extract the data from contents of the response

In [7]:
data = BeautifulSoup(page_contents,'html.parser')

Get the class name or any unique details by inspecting the web page

In [8]:
genre_class = 'table-cell primary'

Extract the tags associated with the class using find all function. Only the first 24 tags are related to movies

In [9]:
genre_tags = data.find_all('div', {'class': genre_class})[:24]

Define a function to get the data of the genres

In [10]:
def get_genres_data(genre_tags):
    
    genres = []
    genre_urls = []
    base_url = 'https://www.imdb.com'
    
    for genre in genre_tags:
        
        genres.append(genre.find('a').text.strip())
        genre_urls.append(base_url + genre.find('a')['href'])
        
    return pd.DataFrame({'Genre':genres,'URL':genre_urls})

In [11]:
genre_df = get_genres_data(genre_tags)

In [12]:
genre_df

Unnamed: 0,Genre,URL
0,Action,https://www.imdb.com/search/title?genres=actio...
1,Adventure,https://www.imdb.com/search/title?genres=adven...
2,Animation,https://www.imdb.com/search/title?genres=anima...
3,Biography,https://www.imdb.com/search/title?genres=biogr...
4,Comedy,https://www.imdb.com/search/title?genres=comed...
5,Crime,https://www.imdb.com/search/title?genres=crime...
6,Documentary,https://www.imdb.com/search/title?genres=docum...
7,Drama,https://www.imdb.com/search/title?genres=drama...
8,Family,https://www.imdb.com/search/title?genres=famil...
9,Fantasy,https://www.imdb.com/search/title?genres=fanta...


## Extract Movies data

From the obtained URLs, extract top 50 popular movies data by scraping each URL

 ### Obtain data for the first genre to understand the pattern

In [13]:
genre_url = genre_df['URL'][0]

Get response using requests for the url

In [14]:
response = requests.get(genre_url)
response.status_code

200

Parse the response contents using Beautiful Soup 

In [15]:
data = BeautifulSoup(response.text,'html.parser')

Get the unique identifier (tags or classes or both) for the movies data from the obtained web page

In [16]:
movie_class = 'lister-item-content'

In [17]:
movies_tags = data.find_all('div', {'class': movie_class})
len(movies_tags)

50

Get all the details for one movie by identfying unique tags/classes

In [18]:
movie = movies_tags[0]

Title

In [19]:
movie.find('a').text

'Prey'

Adult rating

In [20]:
movie.find('span',{'class':"certificate"}).text

'R'

Year Released

In [21]:
re.findall('[0-9]+',movie.find('span',{'class':"lister-item-year text-muted unbold"}).text)[0]

'2022'

Duration in mins

In [22]:
movie.find('span',{ 'class':"runtime"}).text[:-4]

'99'

Genres tagged

In [23]:
movie.find('span',{ 'class':"genre"}).text.strip()

'Action, Drama, Horror'

IMDB Rating

In [24]:
movie.find('strong').text

'7.3'

Votes

In [25]:
movie.find('meta',{'itemprop':"ratingCount"}).attrs['content']

'53123'

In [26]:
movie.find('span',{'name':'nv'}).text

'53,123'

Summary

In [27]:
movie.find_all('p')[1].text.strip()

'The origin story of the Predator in the world of the Comanche Nation 300 years ago. Naru, a skilled female warrior, fights to protect her tribe against one of the first highly-evolved Predators to land on Earth.'

Directors and Stars

In [28]:
director, star = movie.find_all('p')[2].text.strip().split('|')

Directors

In [29]:
', '.join(list(map(str.strip,director.split(':')[1].strip().split(','))))

'Dan Trachtenberg'

Stars

In [30]:
', '.join(list(map(str.strip,star.split(':')[1].strip().split(','))))

'Amber Midthunder, Dakota Beavers, Dane DiLiegro, Stormee Kipp'

### Define a function to obtain the data for all popular movies in a genre

In [31]:
def get_movies_data(genre, genre_url):
    '''
    Get movies data: Movie Title, Release year, Adult rating, Duration, Genres tagged, IMDB rating, Summary, Directors, Stars, Votes
    
    Params:
        genre: genre to which the movies belong
        genre_url: URL to genre's top 50 popular movies 
    
    Returns:
    Dataframe containing the movies data 
    '''
    response = requests.get(genre_url)
    if response.status_code != 200:
        print('No response')
        return 
    
    data = BeautifulSoup(response.text, 'html.parser')
    
    movies_tags = data.find_all('div', {'class': movie_class})
    
    titles = []
    release_year = []
    adult_rating = []
    duration = []
    genres_tagged = []
    imdb_rating = []
    summary = []
    directors = []
    stars = []
    
    for movie in movies_tags:
        
        titles.append(movie.find('a').text)
        try:
            release_year.append(re.findall('[0-9]+',movie.find('span',{'class':"lister-item-year text-muted unbold"}).text)[0])
        except: release_year.append(np.nan)
        try:
            adult_rating.append(movie.find('span',{'class':"certificate"}).text)
        except: adult_rating.append(np.nan)
        try:
            duration.append(movie.find('span',{ 'class':"runtime"}).text[:-4])
        except: duration.append(np.nan)
        try:
            genres_tagged.append(movie.find('span',{ 'class':"genre"}).text.strip())
        except: genres_tagged.append(np.nan)
        try:
            imdb_rating.append(movie.find('strong').text)
        except: imdb_rating.append(np.nan)
        try:
            summary.append(movie.find_all('p')[1].text.strip())
        except: summary.append(np.nan)
        try:    
            director, star = movie.find_all('p')[2].text.strip().split('|')
        except: 
            film_star = movie.find_all('p')[2].text.strip()
            if film_star.split(':')[0][:4] == 'Star':
                star = film_star
            else: director = film_star
        try:
            directors.append(', '.join(list(map(str.strip,director.split(':')[1].strip().split(',')))))
        except: directors.append(np.nan)
        try:
            stars.append(', '.join(list(map(str.strip,star.split(':')[1].strip().split(',')))))
        except: stars.append(np.nan)
            
    movies_data = dict({'Genre': genre, 'Title': titles, 'Release Year': release_year, 'Adult Rating': adult_rating,
                       'Duration in mins': duration, 'Genres Tagged': genres_tagged, 'IMDB Rating': imdb_rating,
                       'Summary': summary, 'Directors': directors, 'Stars': stars})
    return pd.DataFrame(movies_data)

In [32]:
action_df = get_movies_data('Action',genre_url)

In [33]:
action_df.head()

Unnamed: 0,Genre,Title,Release Year,Adult Rating,Duration in mins,Genres Tagged,IMDB Rating,Summary,Directors,Stars
0,Action,Prey,2022.0,R,99.0,"Action, Drama, Horror",7.3,The origin story of the Predator in the world ...,Dan Trachtenberg,"Amber Midthunder, Dakota Beavers, Dane DiLiegr..."
1,Action,The Gray Man,2022.0,UA 16+,122.0,"Action, Thriller",6.5,When the CIA's most skilled operative-whose tr...,"Anthony Russo, Joe Russo","Ryan Gosling, Chris Evans, Ana de Armas, Billy..."
2,Action,Bullet Train,2022.0,A,126.0,"Action, Comedy, Thriller",7.5,Five assassins aboard a fast moving bullet tra...,David Leitch,"Brad Pitt, Joey King, Aaron Taylor-Johnson, Br..."
3,Action,Batgirl,,,,"Action, Adventure, Crime",,"Based upon the popular DC character, Barbara G...","Adil El Arbi, Bilall Fallah","Leslie Grace, J.K. Simmons, Michael Keaton, Br..."
4,Action,Thor: Love and Thunder,2022.0,UA,118.0,"Action, Adventure, Comedy",6.7,"Thor enlists the help of Valkyrie, Korg and ex...",Taika Waititi,"Chris Hemsworth, Natalie Portman, Christian Ba..."


### Get movies data for all genres

Use the function get_movies_data to obtain the data for all the listed genres

In [34]:
genres = genre_df['Genre']
genre_urls = genre_df['URL']

Loop over genres to get data for each genre and concatenate the obtained dataframe to the main dataframe

In [35]:
movies_df = pd.DataFrame()

for i in range(len(genres)):
    
    print('Getting top 50 popular',genres[i],'movies')
    movies_df = pd.concat([movies_df,get_movies_data(genres[i],genre_urls[i])],ignore_index=True)

Getting top 50 popular Action movies
Getting top 50 popular Adventure movies
Getting top 50 popular Animation movies
Getting top 50 popular Biography movies
Getting top 50 popular Comedy movies
Getting top 50 popular Crime movies
Getting top 50 popular Documentary movies
Getting top 50 popular Drama movies
Getting top 50 popular Family movies
Getting top 50 popular Fantasy movies
Getting top 50 popular Film Noir movies
Getting top 50 popular History movies
Getting top 50 popular Horror movies
Getting top 50 popular Music movies
Getting top 50 popular Musical movies
Getting top 50 popular Mystery movies
Getting top 50 popular Romance movies
Getting top 50 popular Sci-Fi movies
Getting top 50 popular Short Film movies
Getting top 50 popular Sport movies
Getting top 50 popular Superhero movies
Getting top 50 popular Thriller movies
Getting top 50 popular War movies
Getting top 50 popular Western movies


In [36]:
movies_df

Unnamed: 0,Genre,Title,Release Year,Adult Rating,Duration in mins,Genres Tagged,IMDB Rating,Summary,Directors,Stars
0,Action,Prey,2022,R,99,"Action, Drama, Horror",7.3,The origin story of the Predator in the world ...,Dan Trachtenberg,"Amber Midthunder, Dakota Beavers, Dane DiLiegr..."
1,Action,The Gray Man,2022,UA 16+,122,"Action, Thriller",6.5,When the CIA's most skilled operative-whose tr...,"Anthony Russo, Joe Russo","Ryan Gosling, Chris Evans, Ana de Armas, Billy..."
2,Action,Bullet Train,2022,A,126,"Action, Comedy, Thriller",7.5,Five assassins aboard a fast moving bullet tra...,David Leitch,"Brad Pitt, Joey King, Aaron Taylor-Johnson, Br..."
3,Action,Batgirl,,,,"Action, Adventure, Crime",,"Based upon the popular DC character, Barbara G...","Adil El Arbi, Bilall Fallah","Leslie Grace, J.K. Simmons, Michael Keaton, Br..."
4,Action,Thor: Love and Thunder,2022,UA,118,"Action, Adventure, Comedy",6.7,"Thor enlists the help of Valkyrie, Korg and ex...",Taika Waititi,"Chris Hemsworth, Natalie Portman, Christian Ba..."
...,...,...,...,...,...,...,...,...,...,...
1195,Western,Wild Wild West,1999,A,106,"Action, Comedy, Sci-Fi",4.9,The two best special agents in the Wild West m...,Barry Sonnenfeld,"Will Smith, Kevin Kline, Kenneth Branagh, Salm..."
1196,Western,Young Guns II,1990,A,104,"Action, Western",6.5,"In 1881, cattle baron John Chisum pays a bount...",Geoff Murphy,"Emilio Estevez, Kiefer Sutherland, Lou Diamond..."
1197,Western,The Devil's Rejects,2005,A,107,"Crime, Horror, Western",6.7,"The murderous, backwoods Firefly family take t...",Rob Zombie,"Sid Haig, Sheri Moon Zombie, Bill Moseley, Wil..."
1198,Western,Three Amigos!,1986,U,104,"Comedy, Western",6.5,Three actors accept an invitation to a Mexican...,John Landis,"Steve Martin, Chevy Chase, Martin Short, Alfon..."


## Create a csv file

Create a csv file containing the top 50 popular movies in each genre

In [37]:
movies_df.to_csv('Popular_IMDB_movies_by_genre.csv')