# IMDb-top-1000-movies-scraping-project



Project Outline

- We're going to scrape https://www.imdb.com/search/title/?groups=top_1000&count=250&start=000&ref_=adv_nxt


- We'll get a list of top 1000 movies. For each of them, we'll get:
    movie title,
    movie duration,
    release year,
    genre,
    rating,
    movie page URL.
    
    
- We'll create a CSV file in the following format:

    `Movie Title, Release Year, Duration, Genre, Rating, Url
    Top Gun: Maverick,2022, 130 min, Action, Drama,8.6, /title/tt1745960/?ref_=adv_li_tt  `  

## Install and import necessary modules and libraries

In [1]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

## Scrape list of top 1000 movies

- use requests to download the page
- use BS4 to parse and extract information
- convert to a Pandas dataframe

1. Download page

In [2]:
def get_movies_page(url):    
    """Function to download page.
    If it fails to do so, raises an exception.
    When successful, returns the html documentation."""    
        
    response = requests.get(url)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [3]:
url = 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=000&ref_=adv_nxt'
doc = get_movies_page(url)

2. Parse info from downloaded page

Find specific tags

<img src='https://i.imgur.com/3dUYQys.png' width = 2000 height = 100>

In [4]:
doc.find('title')

<title>IMDb "Top 1000"
(Sorted by IMDb Rating Descending) - IMDb</title>

In [5]:
doc.find('h3')

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>

## Title

In [6]:
title_selector = 'lister-item-header'
title_tags = doc.find_all('h3',{'class' : title_selector})
len(title_tags)

250

In [7]:
title_tags[0].find_all('a')

[<a href="/title/tt0111161/">The Shawshank Redemption</a>]

In [8]:
movie_titles = []

for tag in title_tags:
    title = tag.find('a').text
    # print (title)
    movie_titles.append(title)
#len(movie_titles)
movie_titles[:4]

['The Shawshank Redemption',
 'The Godfather',
 'Rocketry: The Nambi Effect',
 'The Dark Knight']

In [9]:
def get_movie_titles(doc):
    title_selector = 'lister-item-header'
    title_tags = doc.find_all('h3',{'class': title_selector})
    
    movie_titles = []
    for tag in title_tags:
        title = tag.find('a').text       
        movie_titles.append(title)
    return movie_titles

In [10]:
get_movie_titles(doc)[:4]

['The Shawshank Redemption',
 'The Godfather',
 'Rocketry: The Nambi Effect',
 'The Dark Knight']

## Duration tag

In [11]:
dur_selector = 'runtime'
duration_tags =  doc.find_all('span',{'class' : dur_selector})

In [12]:
len(duration_tags)

250

In [13]:
duration_tags[:4]

[<span class="runtime">142 min</span>,
 <span class="runtime">175 min</span>,
 <span class="runtime">157 min</span>,
 <span class="runtime">152 min</span>]

In [14]:
movie_duration = []
for time in duration_tags:
    runtime = time.text[:-4]
    movie_duration.append(runtime)
movie_duration[:4]

['142', '175', '157', '152']

In [15]:
def get_runtime(doc):
    
    dur_selector = 'runtime'
    duration_tags =  doc.find_all('span',{'class' : dur_selector})
    
    movie_duration = []
    for time in duration_tags:
        runtime = time.text[:-4]
        movie_duration.append(runtime)
    return movie_duration
    

In [16]:
get_runtime(doc)[:4]

['142', '175', '157', '152']

## Genre tag

In [17]:
genre_selector = 'genre'
genre_tags = doc.find_all('span',{'class':genre_selector})
genre_tags[:4]

[<span class="genre">
 Drama            </span>,
 <span class="genre">
 Crime, Drama            </span>,
 <span class="genre">
 Biography, Drama            </span>,
 <span class="genre">
 Action, Crime, Drama            </span>]

In [18]:
def get_genre(doc):
    genre_selector = 'genre'
    genre_tags = doc.find_all('span',{'class':genre_selector})
    
    movie_genre = []
    for genre in genre_tags:   
        movie_genre.append(genre.text.strip())
    return movie_genre

In [19]:
get_genre(doc)[:4]

['Drama', 'Crime, Drama', 'Biography, Drama', 'Action, Crime, Drama']

## Rating 

In [20]:
rate_sel = 'inline-block ratings-imdb-rating'
rate_tags = doc.find_all('div',{'class':rate_sel})
rate_tags[:4]    

[<div class="inline-block ratings-imdb-rating" data-value="9.3" name="ir">
 <span class="global-sprite rating-star imdb-rating"></span>
 <strong>9.3</strong>
 </div>,
 <div class="inline-block ratings-imdb-rating" data-value="9.2" name="ir">
 <span class="global-sprite rating-star imdb-rating"></span>
 <strong>9.2</strong>
 </div>,
 <div class="inline-block ratings-imdb-rating" data-value="9.1" name="ir">
 <span class="global-sprite rating-star imdb-rating"></span>
 <strong>9.1</strong>
 </div>,
 <div class="inline-block ratings-imdb-rating" data-value="9" name="ir">
 <span class="global-sprite rating-star imdb-rating"></span>
 <strong>9.0</strong>
 </div>]

In [21]:
rates = []
for rate in rate_tags:
    rates.append(rate.text.strip())
rates[:4]

['9.3', '9.2', '9.1', '9.0']

In [22]:
def get_rating(doc):
    rate_sel = 'inline-block ratings-imdb-rating'
    rate_tags = doc.find_all('div',{'class':rate_sel})
    
    rates = []
    for rate in rate_tags:
        rates.append(rate.text.strip())
    return rates

In [23]:
get_rating(doc)[:4]

['9.3', '9.2', '9.1', '9.0']

## Url tag

    

In [24]:
url_selec = 'lister-item-header'
url_tags = doc.find_all('h3', {'class': url_selec})
url_tags[0]

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>

In [25]:
website = 'https://www.imdb.com'
urls = []
for tag in url_tags:
    url = tag.find('a')['href']
    urls.append(website + url)
urls[:2]

['https://www.imdb.com/title/tt0111161/',
 'https://www.imdb.com/title/tt0068646/']

In [26]:
def get_url(doc):    
    url_selec = 'lister-item-header'
    url_tags = doc.find_all('h3', {'class': url_selec})    

    website = 'https://www.imdb.com'
    urls = []
    for tag in url_tags:
        url = tag.find('a')['href']
        urls.append(website + url)
    return urls

In [27]:
get_url(doc)[:4]

['https://www.imdb.com/title/tt0111161/',
 'https://www.imdb.com/title/tt0068646/',
 'https://www.imdb.com/title/tt9263550/',
 'https://www.imdb.com/title/tt0468569/']

## Year tag

In [28]:
year_selector = 'lister-item-year text-muted unbold'
year_tags = doc.find_all('span',{'class' : year_selector})
len(year_tags)

250

In [29]:
def get_year(doc):
    year_selector = 'lister-item-year text-muted unbold'
    year_tags = doc.find_all('span',{'class' : year_selector})
    
    years = []
    for tag in year_tags:
        years.append(tag.text.strip()[-5:-1])
    return years

In [30]:
get_year(doc)[:4]

['1994', '1972', '2022', '2008']

## Go to next pages

In [31]:
def visualize_250movies_per_page():
    urls=[]
    for num in range(1,1000,250):
        urls.append('https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start='+ str(num)+ '&ref_=adv_nxt')
    return urls

In [32]:
visualize_250movies_per_page()

['https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=1&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=251&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=501&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=751&ref_=adv_nxt']

In [33]:
url_250 = visualize_250movies_per_page()
len(url_250)

4

In [34]:
for i in range(len(url_250)):
    print(i)
    print(url_250[i])

0
https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=1&ref_=adv_nxt
1
https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=251&ref_=adv_nxt
2
https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=501&ref_=adv_nxt
3
https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=751&ref_=adv_nxt


## Create dataframe

In [35]:
req = requests.get('https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=250&start=1&ref_=adv_nxt')
doc = BeautifulSoup(req.text, 'html.parser')

In [36]:
doc.find('h3')

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>

In [47]:
def movie_to_df():
    # create empty dict for all movies data storage
    movies_dict = {   
    'title': [], 
    'duration': [], 
    'release_year':[], 
    'genre': [], 
    'rating': [],
    'url':[] 
    }
      
    for index, value in enumerate(url_250):    
        req = requests.get(value)
        doc = BeautifulSoup(req.text, 'html.parser')

        movies_dict['title']+=(get_movie_titles(doc))
        movies_dict['duration']+=(get_runtime(doc))
        movies_dict['release_year']+=(get_year(doc))
        movies_dict['genre']+=(get_genre(doc))
        movies_dict['rating']+=(get_rating(doc))
        movies_dict['url']+=(get_url(doc))

        movies_df = pd.DataFrame(movies_dict)

    return movies_df

In [48]:
movie_to_df().tail(4)

Unnamed: 0,title,duration,release_year,genre,rating,url
996,From Here to Eternity,118,1953,"Drama, Romance, War",7.6,https://www.imdb.com/title/tt0045793/
997,Snow White and the Seven Dwarfs,83,1937,"Animation, Adventure, Family",7.6,https://www.imdb.com/title/tt0029583/
998,The 39 Steps,86,1935,"Crime, Mystery, Thriller",7.6,https://www.imdb.com/title/tt0026029/
999,The Invisible Man,71,1933,"Horror, Sci-Fi",7.6,https://www.imdb.com/title/tt0024184/


## Create csv

In [49]:
movies = movie_to_df()
movies.to_csv('movies.csv', index = None)

movies_df = pd.read_csv('movies.csv')
movies_df.head(4)

Unnamed: 0,title,duration,release_year,genre,rating,url
0,The Shawshank Redemption,142,1994,Drama,9.3,https://www.imdb.com/title/tt0111161/
1,The Godfather,175,1972,"Crime, Drama",9.2,https://www.imdb.com/title/tt0068646/
2,Rocketry: The Nambi Effect,157,2022,"Biography, Drama",9.1,https://www.imdb.com/title/tt9263550/
3,The Dark Knight,152,2008,"Action, Crime, Drama",9.0,https://www.imdb.com/title/tt0468569/


## Quick data analysis of the scraped data

In [50]:
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

os.getcwd()

'/home/jovyan'

In [51]:
movie_df = pd.read_csv('movies.csv')
movie_df.describe()

Unnamed: 0,duration,release_year,rating
count,1000.0,1000.0,1000.0
mean,123.657,1991.274,7.9646
std,28.52506,23.955308,0.278504
min,45.0,1920.0,7.6
25%,103.0,1975.0,7.7
50%,120.0,1999.0,7.9
75%,138.0,2010.0,8.1
max,321.0,2022.0,9.3


In [52]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         1000 non-null   object 
 1   duration      1000 non-null   int64  
 2   release_year  1000 non-null   int64  
 3   genre         1000 non-null   object 
 4   rating        1000 non-null   float64
 5   url           1000 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 47.0+ KB


### Did movie ratings increase over the years?

In [69]:
!pip install plotly.express --quiet
import plotly.express as px

fig = px.histogram(movie_df,                    
                   x = "release_year",
                   title = "Distribution of movies throughout the years") 
fig.update_layout(bargap=0.2)

In [66]:
fig = px.histogram(movie_df, 
                   y = "rating",
                   x = "release_year",
                   title = "Movie ratings throughout the years") 
fig.update_layout(bargap=0.2)

#### Movie with lowest and highest rating 

In [81]:
idx_min = movie_df.rating.idxmin()
idx_max = movie_df.rating.idxmax()

In [92]:
print(r'Movie with lowest rating amongst the 1000 top rated movies: ', movie_df.title.loc[idx_min], '- with rating of {}'.format(movie_df.rating.loc[idx_min]))

Movie with lowest rating amongst the 1000 top rated movies:  Dark Waters - with rating of 7.6


In [93]:
print(r'Movie with highest rating amongst the 1000 top rated movies: ', movie_df.title.loc[idx_max], '- with rating of {}'.format(movie_df.rating.loc[idx_max]))

Movie with highest rating amongst the 1000 top rated movies:  The Shawshank Redemption - with rating of 9.3


### What were the most frequent genre combinations?

In [54]:
movie_df['genre'].value_counts()

Drama                         83
Drama, Romance                36
Comedy, Drama                 35
Comedy, Drama, Romance        33
Action, Crime, Drama          31
                              ..
Drama, Fantasy, Mystery        1
Action, Sci-Fi, Thriller       1
Action, Adventure, Mystery     1
Mystery, Romance, Thriller     1
Action, Crime, Sci-Fi          1
Name: genre, Length: 202, dtype: int64

In [55]:
movie_df.genre.mode()
# same result with: movie_df['genre'].value_counts().idxmax()

0    Drama
dtype: object

In [52]:
from itertools import combinations
from collections import Counter

count = Counter()

for row in movie_df.genre:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 3)))

for key,value in count.most_common(15):
    print(key, value)

('Comedy', ' Drama', ' Romance') 34
('Action', ' Crime', ' Drama') 31
('Animation', ' Adventure', ' Comedy') 30
('Crime', ' Drama', ' Mystery') 29
('Crime', ' Drama', ' Thriller') 29
('Biography', ' Drama', ' History') 25
('Action', ' Adventure', ' Sci-Fi') 20
('Action', ' Adventure', ' Drama') 15
('Biography', ' Crime', ' Drama') 14
('Animation', ' Action', ' Adventure') 14
('Comedy', ' Crime', ' Drama') 12
('Action', ' Crime', ' Thriller') 11
('Action', ' Adventure', ' Comedy') 11
('Drama', ' Mystery', ' Thriller') 9
('Action', ' Adventure', ' Fantasy') 8


### Movies with highest rating (threshold at 8.5)

In [53]:
movie_df[(movie_df.rating.sort_values(ascending = False)>8.5) & 
         (movie_df.rating.sort_values(ascending = False)< 8.9)][['title', 'genre', 'rating']]


Boolean Series key will be reindexed to match DataFrame index.



Unnamed: 0,title,genre,rating
9,Inception,"Action, Adventure, Sci-Fi",8.8
10,The Lord of the Rings: The Two Towers,"Action, Adventure, Drama",8.8
11,Fight Club,Drama,8.8
12,The Lord of the Rings: The Fellowship of the Ring,"Action, Adventure, Drama",8.8
13,Forrest Gump,"Drama, Romance",8.8
14,"The Good, the Bad and the Ugly","Adventure, Western",8.8
15,Soorarai Pottru,Drama,8.7
16,The Matrix,"Action, Sci-Fi",8.7
17,Goodfellas,"Biography, Crime, Drama",8.7
18,Star Wars: Episode V - The Empire Strikes Back,"Action, Adventure, Fantasy",8.7


### A different way to filter pandas df: `nlargest` & `nsmallest` methods

In [55]:
movie_df.nlargest(5,'rating')

Unnamed: 0,title,duration,release_year,genre,rating,url
0,The Shawshank Redemption,142,1994,Drama,9.3,https://www.imdb.com/title/tt0111161/
1,The Godfather,175,1972,"Crime, Drama",9.2,https://www.imdb.com/title/tt0068646/
2,The Dark Knight,152,2008,"Action, Crime, Drama",9.0,https://www.imdb.com/title/tt0468569/
3,The Lord of the Rings: The Return of the King,201,2003,"Action, Adventure, Drama",9.0,https://www.imdb.com/title/tt0167260/
4,Schindler's List,195,1993,"Biography, Drama, History",9.0,https://www.imdb.com/title/tt0108052/


In [56]:
movie_df.nsmallest(5,'rating')

Unnamed: 0,title,duration,release_year,genre,rating,url
910,Dark Waters,126,2019,"Biography, Drama, History",7.6,https://www.imdb.com/title/tt9071322/
911,The Mitchells vs the Machines,114,2021,"Animation, Adventure, Comedy",7.6,https://www.imdb.com/title/tt7979580/
912,Searching,102,2018,"Drama, Horror, Mystery",7.6,https://www.imdb.com/title/tt7668870/
913,Once Upon a Time... In Hollywood,161,2019,"Comedy, Drama",7.6,https://www.imdb.com/title/tt7131622/
914,Guardians of the Galaxy Vol. 2,136,2017,"Action, Adventure, Comedy",7.6,https://www.imdb.com/title/tt3896198/


### Longest duration films filtered with sorting values

In [59]:
movie_df.sort_values(by='duration', ascending = False)[:5]

Unnamed: 0,title,duration,release_year,genre,rating,url
149,Gangs of Wasseypur,321,2012,"Action, Comedy, Crime",8.2,https://www.imdb.com/title/tt1954470/
678,Hamlet,242,1996,Drama,7.8,https://www.imdb.com/title/tt0116477/
360,Zack Snyder's Justice League,242,2021,"Action, Adventure, Fantasy",8.0,https://www.imdb.com/title/tt12361974/
202,Gone with the Wind,238,1939,"Drama, Romance, War",8.2,https://www.imdb.com/title/tt0031381/
115,Once Upon a Time in America,229,1984,"Crime, Drama",8.3,https://www.imdb.com/title/tt0087843/


In [100]:
fig = px.histogram(movie_df.sort_values(by='duration', ascending = False)[:10], 
                   y = "duration",
                   x = "title",
                   title = "Longest duration (in minutes) of top 10 rated movies",
                   color = 'duration') 
fig.update_layout(bargap=0.2)

### Movies released after 2015 filtered by using `query()`

In [62]:
movie_df.query('release_year > 2015').sort_values(by = 'duration')[:10]

Unnamed: 0,title,duration,release_year,genre,rating,url
611,Loving Vincent,94,2017,"Animation, Biography, Crime",7.8,https://www.imdb.com/title/tt3262342/
748,Perfect Strangers,96,2016,"Comedy, Drama",7.7,https://www.imdb.com/title/tt4901306/
212,Klaus,96,2019,"Animation, Adventure, Comedy",8.1,https://www.imdb.com/title/tt4729430/
157,The Father,97,2020,"Drama, Mystery",8.2,https://www.imdb.com/title/tt10272386/
594,Dragon Ball Super: Broly,100,2018,"Animation, Action, Adventure",7.8,https://www.imdb.com/title/tt7961060/
344,Soul,100,2020,"Animation, Adventure, Comedy",8.0,https://www.imdb.com/title/tt2948372/
597,"I, Daniel Blake",100,2016,Drama,7.8,https://www.imdb.com/title/tt5168192/
763,Toy Story 4,100,2019,"Animation, Adventure, Comedy",7.7,https://www.imdb.com/title/tt1979376/
335,"Quo Vadis, Aida?",101,2020,"Drama, History, War",8.0,https://www.imdb.com/title/tt8633462/
598,Isle of Dogs,101,2018,"Animation, Adventure, Comedy",7.8,https://www.imdb.com/title/tt5104604/
