## Scrapping Popular Movies on themoviedb.org 



#### Project Outline

- We're going to scrape https://www.themoviedb.org/movie
- First we will get the Names of First 50 the popular Movies 
- For each movie we will get the following information :
- (Movie Name , User Score(%) , Time , Date of Release , Genre , Director , Overview)
- Lastly we will create a CSV file in the following format:

```
Movie Name,User Score(%),Time ,Date of  Release,Director,Genre,Overview
Guardians of the Galaxy Vol.3,81,2h30m,06/09/2023,James Gunn,"Science Fiction,Adventure ,Action","Peter Quill, still..."
```

### Use requests library to Download web pages

In [4]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import random

In [5]:
HEADERS = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}

base_url = 'https://www.themoviedb.org/movie?page=1'

response = requests.get(base_url, headers=HEADERS)

In [6]:
response.status_code

200

In [7]:
len(response.text)

207780

### Use Beautiful soup to parse and extract information

In [8]:
contents = BeautifulSoup(response.text,'html.parser')

In [9]:
type(contents)

bs4.BeautifulSoup

In [10]:
# 'all_movies_contents' have the a tag
class_selection = 'image'

all_movies_contents = contents.find_all('a',class_=class_selection)
all_movies_contents[:3]

[<a class="image" href="/movie/447365" title="Guardians of the Galaxy Vol. 3">
 <img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/r2J02Z2OpNTctfOSN1Ydgii51I3.jpg" srcset="/t/p/w220_and_h330_face/r2J02Z2OpNTctfOSN1Ydgii51I3.jpg 1x, /t/p/w440_and_h660_face/r2J02Z2OpNTctfOSN1Ydgii51I3.jpg 2x"/>
 </a>,
 <a class="image" href="/movie/667538" title="Transformers: Rise of the Beasts">
 <img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/gPbM0MK8CP8A174rmUwGsADNYKD.jpg" srcset="/t/p/w220_and_h330_face/gPbM0MK8CP8A174rmUwGsADNYKD.jpg 1x, /t/p/w440_and_h660_face/gPbM0MK8CP8A174rmUwGsADNYKD.jpg 2x"/>
 </a>,
 <a class="image" href="/movie/385687" title="Fast X">
 <img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/fiVW06jE7z9YnO4trhaMEdclSiC.jpg" srcset="/t/p/w220_and_h330_face/fiVW06jE7z9YnO4trhaMEdclSiC.jpg 1x, /t/p/w440_and_h660_face/fiVW06jE7z9YnO4trhaMEdclSiC.jpg 2x"/>
 </a>]

In [11]:
len(all_movies_contents)

20

In [12]:
all_movies_contents[0]['href']

'/movie/447365'

In [13]:
all_movies_contents[0]['title']

'Guardians of the Galaxy Vol. 3'

In [14]:
movies_name = []
movies_link = []
for i in range(len(all_movies_contents)):
    url = 'https://www.themoviedb.org'
    temp_url = (url + all_movies_contents[i]['href'])
    movies_link.append(temp_url)
    
    movies_name.append(all_movies_contents[i]['title'])
    


In [15]:
movies_link

['https://www.themoviedb.org/movie/447365',
 'https://www.themoviedb.org/movie/667538',
 'https://www.themoviedb.org/movie/385687',
 'https://www.themoviedb.org/movie/455476',
 'https://www.themoviedb.org/movie/445651',
 'https://www.themoviedb.org/movie/678512',
 'https://www.themoviedb.org/movie/254128',
 'https://www.themoviedb.org/movie/569094',
 'https://www.themoviedb.org/movie/603692',
 'https://www.themoviedb.org/movie/406563',
 'https://www.themoviedb.org/movie/1070802',
 'https://www.themoviedb.org/movie/1130818',
 'https://www.themoviedb.org/movie/502356',
 'https://www.themoviedb.org/movie/346698',
 'https://www.themoviedb.org/movie/976573',
 'https://www.themoviedb.org/movie/575264',
 'https://www.themoviedb.org/movie/47964',
 'https://www.themoviedb.org/movie/423108',
 'https://www.themoviedb.org/movie/614479',
 'https://www.themoviedb.org/movie/1103825']

In [16]:
movies_name

['Guardians of the Galaxy Vol. 3',
 'Transformers: Rise of the Beasts',
 'Fast X',
 'Knights of the Zodiac',
 'The Darkest Minds',
 'Sound of Freedom',
 'San Andreas',
 'Spider-Man: Across the Spider-Verse',
 'John Wick: Chapter 4',
 'Insidious: The Last Key',
 'Confidential Informant',
 'Sheroes',
 'The Super Mario Bros. Movie',
 'Barbie',
 'Elemental',
 'Mission: Impossible - Dead Reckoning Part One',
 'A Good Day to Die Hard',
 'The Conjuring: The Devil Made Me Do It',
 'Insidious: The Red Door',
 'War of the Worlds: The Attack']

In [17]:
len(movies_link) == len(movies_name) == 20

True

### Scraping specific information from the Movies

In [18]:
movie1 = movies_link[0]
movie1

'https://www.themoviedb.org/movie/447365'

In [19]:
header = { 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}

response1 = requests.get(movie1, headers=header)
response1.status_code


200

In [20]:
movie1_contents = BeautifulSoup(response1.text)

In [21]:
type(movie1_contents)

bs4.BeautifulSoup

In [22]:
class_selection1 = 'user_score_chart'
user_score_tag = movie1_contents.find_all('div',{'class':class_selection1})
user_score_tag[0]

<div class="user_score_chart" data-bar-color="#21d07a" data-percent="81.0" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r81"></span>
</div>
</div>

In [23]:
type(user_score_tag)

bs4.element.ResultSet

In [24]:
user_score1 = user_score_tag[0]['data-percent']
user_score1

'81.0'

In [25]:
time_tag1 = movie1_contents.find_all('span',class_='runtime')
time_tag1[0]

<span class="runtime">
        2h 30m
    </span>

In [26]:
time1 = time_tag1[0].text.strip()
time1

'2h 30m'

In [27]:
release_tag1 = movie1_contents.find_all('span',class_='release')
release_tag1[0]

<span class="release">
        05/05/2023 (US)
      </span>

In [28]:
release1 = release_tag1[0].text.strip()
print(release1)
release1 = release1[:10]
release1

05/05/2023 (US)


'05/05/2023'

In [29]:
genres_tag1 = movie1_contents.find_all('span',class_='genres')
genres_tag1[0]

<span class="genres">
<a href="/genre/878-science-fiction/movie">Science Fiction</a>, <a href="/genre/12-adventure/movie">Adventure</a>, <a href="/genre/28-action/movie">Action</a>
</span>

In [30]:
genres_tag1[0].find_all('a')

[<a href="/genre/878-science-fiction/movie">Science Fiction</a>,
 <a href="/genre/12-adventure/movie">Adventure</a>,
 <a href="/genre/28-action/movie">Action</a>]

In [31]:
genres_tag1[0].text.strip()

'Science Fiction,\xa0Adventure,\xa0Action'

In [32]:
' , '.join( 'Science Fiction,\xa0Adventure,\xa0Action'.split(',\xa0'))

'Science Fiction , Adventure , Action'

In [33]:
genres1 = genres_tag1[0].text.strip()
genres1


'Science Fiction,\xa0Adventure,\xa0Action'

In [34]:
genres1  = ' , '.join(genres1.split(',\xa0'))
genres1

'Science Fiction , Adventure , Action'

In [35]:
overview_tag1 = movie1_contents.find_all('div',class_ = 'overview')
overview_tag1[0]

<div class="overview" dir="auto">
<p>Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.</p>
</div>

In [36]:
overview_tag1[0].text.strip()

'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.'

In [37]:
overview1 = overview_tag1[0].text.strip()
overview1

'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.'

In [38]:
directors_tag1 = movie1_contents.find_all('li',class_='profile')
directors_tag1

[<li class="profile">
 <p><a href="/person/15218-james-gunn">James Gunn</a></p>
 <p class="character">Director, Writer</p>
 </li>,
 <li class="profile">
 <p><a href="/person/7624-stan-lee">Stan Lee</a></p>
 <p class="character">Characters</p>
 </li>,
 <li class="profile">
 <p><a href="/person/1372782-steve-englehart">Steve Englehart</a></p>
 <p class="character">Characters</p>
 </li>,
 <li class="profile">
 <p><a href="/person/1222509-keith-giffen">Keith Giffen</a></p>
 <p class="character">Characters</p>
 </li>,
 <li class="profile">
 <p><a href="/person/1713975-jim-starlin">Jim Starlin</a></p>
 <p class="character">Characters</p>
 </li>,
 <li class="profile">
 <p><a href="/person/18866-jack-kirby">Jack Kirby</a></p>
 <p class="character">Characters</p>
 </li>,
 <li class="profile">
 <p><a href="/person/18876-larry-lieber">Larry Lieber</a></p>
 <p class="character">Characters</p>
 </li>,
 <li class="profile">
 <p><a href="/person/1768857-bill-mantlo">Bill Mantlo</a></p>
 <p class="cha

In [54]:
for i in directors_tag1:
    temp_p_tag = i.find('p',class_ = 'character')
    temp_a_tag = i.find('a')
    
    if ('Director' in temp_p_tag.text):
        director1 = temp_a_tag.text
        break

In [56]:
type(temp_p_tag)

bs4.element.Tag

In [57]:
temp_p_tag.text

'Director, Writer'

In [58]:
director1

'James Gunn'

In [59]:
overview1

'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.'

In [60]:
movies_name[0]

'Guardians of the Galaxy Vol. 3'

In [61]:
genres1

'Science Fiction , Adventure , Action'

In [62]:
user_score1

'81.0'

In [63]:
time1

'2h 30m'

In [64]:
release1

'05/05/2023'

In [70]:
# creating a dictionary for creating a DataFrame
dic = {
    'Movie_Name':movies_name[0],
    'User_Score(%)': user_score1,
    'Time':time1,
    'Release of Date':release1,
    'Director':director1,
    'Genres':genres1,
    'Overview':overview1
}

first_movie_df = pd.DataFrame(dic,index=list(range(0,1)))
first_movie_df

Unnamed: 0,Movie_Name,User_Score(%),Time,Release of Date,Director,Genres,Overview
0,Guardians of the Galaxy Vol. 3,81.0,2h 30m,05/05/2023,James Gunn,"Science Fiction , Adventure , Action","Peter Quill, still reeling from the loss of Ga..."


Now we have experimented on How to extract information , now we will write the final code in Classes.

### Final Code

In [133]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import random

# This Class takes care of the home page
class ExtractHomePage():
    
    def __init__(self,url):
        self.base_url = url
        
        HEADERS = {
            'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
        }
        
        response = requests.get(self.base_url, headers=HEADERS)
        
        # checking if the status code is 200
        self.check_status_code(response)
    
        
        self.contents = BeautifulSoup(response.text,'html.parser')
        
        
    def check_status_code(self,res):
        if res.status_code != 200:
            raise Exception(f'Unable to open {self.base_url}')
            
        else:
            print('Got NO ERROR while SCRAPING\n')
            
    def return_movies_name(self):
        class_selection = 'image'

        all_movies_contents = contents.find_all('a',class_=class_selection)
        
        
        movies_name = []
        for i in range(len(all_movies_contents)):
            movies_name.append(all_movies_contents[i]['title'])
            
        return movies_name
    
    def return_movies_link(self):
        class_selection = 'image'

        all_movies_contents = contents.find_all('a',class_=class_selection)
        
        
        movies_link = []
        for i in range(len(all_movies_contents)):
            url = 'https://www.themoviedb.org'
            temp_url = (url + all_movies_contents[i]['href'])
            movies_link.append(temp_url)
            
        return movies_link
    
        

        
        

In [82]:
home_page = ExtractHomePage('https://www.themoviedb.org/movie?page=1')

Got NO ERROR while SCRAPING


In [98]:
home_page.return_movies_link()

['https://www.themoviedb.org/movie/447365',
 'https://www.themoviedb.org/movie/667538',
 'https://www.themoviedb.org/movie/385687',
 'https://www.themoviedb.org/movie/455476',
 'https://www.themoviedb.org/movie/445651',
 'https://www.themoviedb.org/movie/678512',
 'https://www.themoviedb.org/movie/254128',
 'https://www.themoviedb.org/movie/569094',
 'https://www.themoviedb.org/movie/603692',
 'https://www.themoviedb.org/movie/406563',
 'https://www.themoviedb.org/movie/1070802',
 'https://www.themoviedb.org/movie/1130818',
 'https://www.themoviedb.org/movie/502356',
 'https://www.themoviedb.org/movie/346698',
 'https://www.themoviedb.org/movie/976573',
 'https://www.themoviedb.org/movie/575264',
 'https://www.themoviedb.org/movie/47964',
 'https://www.themoviedb.org/movie/423108',
 'https://www.themoviedb.org/movie/614479',
 'https://www.themoviedb.org/movie/1103825']

In [95]:
home_page.return_movies_name()

['Guardians of the Galaxy Vol. 3',
 'Transformers: Rise of the Beasts',
 'Fast X',
 'Knights of the Zodiac',
 'The Darkest Minds',
 'Sound of Freedom',
 'San Andreas',
 'Spider-Man: Across the Spider-Verse',
 'John Wick: Chapter 4',
 'Insidious: The Last Key',
 'Confidential Informant',
 'Sheroes',
 'The Super Mario Bros. Movie',
 'Barbie',
 'Elemental',
 'Mission: Impossible - Dead Reckoning Part One',
 'A Good Day to Die Hard',
 'The Conjuring: The Devil Made Me Do It',
 'Insidious: The Red Door',
 'War of the Worlds: The Attack']

In [134]:
# Using this Class we will be going to scrape the inforamtion
# from each movie's url


class EachMovieInformation():
    
    def __init__(self,url,name):
        self.base_url = url
        
        HEADERS = {
            'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
        }
        
        response = requests.get(self.base_url, headers=HEADERS)
        
        # checking if the status code is 200
        self.check_status_code(response)
    
        
        self.movie_contents = BeautifulSoup(response.text,'html.parser')
        
        print('Scrapping Information for {} \n'.format(name))
        
    def check_status_code(self,res):
        if res.status_code != 200:
            raise Exception(f'Unable to open {self.base_url}')
            
    
    def get_user_score(self):
        class_selection1 = 'user_score_chart'
        user_score_tag = self.movie_contents.find_all('div',{'class':class_selection1})
        
        user_score = user_score_tag[0]['data-percent']
        return user_score
                                                  
    
    def get_time(self):                         
        time_tag = self.movie_contents.find_all('span',class_='runtime')
        
        time = time_tag[0].text.strip()                                         
        return time
    
    
    def get_release_date(self):
        release_tag = self.movie_contents.find_all('span',class_='release')
        
        release = release_tag[0].text.strip()
                                               
        release = release[:10]
        return release
    
    
    def get_genres(self):
        genres_tag = self.movie_contents.find_all('span',class_='genres')
        
        genres = genres_tag[0].text.strip()
                                                  
        genres  = ' , '.join(genres.split(',\xa0'))
        
        return genres
    
    
    
    def get_overview(self):
        overview_tag = self.movie_contents.find_all('div',class_ = 'overview')
        
        overview = overview_tag[0].text.strip()
        return overview
        
        
    def get_directors(self):
        directors_tag = self.movie_contents.find_all('li',class_='profile')
        
        director = []
        for i in directors_tag:
            temp_p_tag = i.find('p',class_ = 'character')
            temp_a_tag = i.find('a')

            if ('Director' in temp_p_tag.text):
                director.append(temp_a_tag.text)
                
        director = ','.join(director)
            
        
        return director
    
    
    
    
    
    
    
    

In [112]:
movie_info = EachMovieInformation(home_page.return_movies_link()[0],home_page.return_movies_name()[0])

Got NO ERROR while SCRAPING
Scrapping Information for Guardians of the Galaxy Vol. 3


In [113]:
movie_info.get_directors()

'James Gunn'

In [114]:
movie_info.get_genres()

'Science Fiction , Adventure , Action'

In [115]:
movie_info.get_overview()

'Peter Quill, still reeling from the loss of Gamora, must rally his team around him to defend the universe along with protecting one of their own. A mission that, if not completed successfully, could quite possibly lead to the end of the Guardians as we know them.'

In [116]:
movie_info.get_release_date()

'05/05/2023'

In [117]:
movie_info.get_time()

'2h 30m'

In [118]:
movie_info.get_user_score()

'81.0'

In [120]:
movies_link = home_page.return_movies_link()
movies_name = home_page.return_movies_name()

In [136]:
def create_df_and_save_it(movies_name , movies_link , movies_directors,
                          movies_time , movies_score , movies_overview ,
                          movies_release , movies_genre , Name_of_df):
    
    dic = {
        'Movie_Name':movies_name,
        'User_Score(%)':movies_score,
        'Time':movies_time,
        'Date of Release':movies_release,
        'Genres':movies_genre,
        'Director':movies_directors,
        'Link':movies_link,
        'Overview':movies_overview
    }
    
    
    df = pd.DataFrame(dic,index=list(range(len(movies_genre))))
    
    df.to_csv(Name_of_df,index=False)
    
    
    print('DONE SUCCESSFULLY')
                          

In [135]:
movies_directors  = [] 
movies_time = []
movies_score = [] 
movies_overview = [] 
movies_release  = []
movies_genre = []


url = 'https://www.themoviedb.org/movie?page=1'
home_page = ExtractHomePage(url)

movies_link = home_page.return_movies_link()
movies_name = home_page.return_movies_name() 


for i in range(len(movies_link)):
    movie_info = EachMovieInformation(movies_link[i],movies_name[i])
    
    movies_directors.append(movie_info.get_directors())
    
    movies_genre.append(movie_info.get_genres())
    
    movies_time.append(movie_info.get_time())
    
    movies_score.append(movie_info.get_time())
    
    movies_release.append(movie_info.get_release_date())
    
    movies_overview.append(movie_info.get_overview())
    
    
    
    
create_df_and_save_it(movies_name=movies_name,Name_of_df='Movies',movies_directors=movies_directors,movies_overview=movies_overview,
                     movies_genre=movies_genre,movies_release=movies_release,
                     movies_time=movies_time,movies_link=movies_link,movies_score=movies_score)

Got NO ERROR while SCRAPING

Scrapping Information for Guardians of the Galaxy Vol. 3 

Scrapping Information for Transformers: Rise of the Beasts 

Scrapping Information for Fast X 

Scrapping Information for Knights of the Zodiac 

Scrapping Information for The Darkest Minds 

Scrapping Information for Sound of Freedom 

Scrapping Information for San Andreas 

Scrapping Information for Spider-Man: Across the Spider-Verse 

Scrapping Information for John Wick: Chapter 4 

Scrapping Information for Insidious: The Last Key 

Scrapping Information for Confidential Informant 

Scrapping Information for Sheroes 

Scrapping Information for The Super Mario Bros. Movie 

Scrapping Information for Barbie 

Scrapping Information for Elemental 

Scrapping Information for Mission: Impossible - Dead Reckoning Part One 

Scrapping Information for A Good Day to Die Hard 

Scrapping Information for The Conjuring: The Devil Made Me Do It 

Scrapping Information for Insidious: The Red Door 

Scrapping 