# Scraping top movies and TV shows on TMDB AND IMDB using Python

- Scraping is the method of extracting data from a web page and using it to perform analysis and take actions accordingly. You can click [here](https://en.wikipedia.org/wiki/Web_scraping) to learn more about scraping.
- Today we will be scraping [TMDB](https://www.themoviedb.org/?language=en-US) and [IMDB](https://www.imdb.com/) to get the top rated movies and TV shows of all time.
- We will use `requests` library along with `BeautifulSoup4` library for scrapping IMDB.
- `requests` library is used to interact with the web page using the `GET` and `POST` methods. You can learn more about `requests` library [here](https://www.w3schools.com/python/module_requests.asp).
- `BeautifulSoup4` is a library which makes web scraping by turning a .html documents into a beautifulsoup document object that can be used to to extract data from. You can learn more about it [here](https://beautiful-soup-4.readthedocs.io/en/latest/).
- We will use official *REST API* provided by the TMDB Developers team to extract this information from TMDB as it does not allow manual scraping.
- **REST API** or **Representaion State Transfer Application Program Interface** is an application interface used to transfer data between 2 same or different types of programs/applications. You can learn more about REST API [here](https://en.wikipedia.org/wiki/Representational_state_transfer).



## 1. Scraping IMDB for top rated TV shows of all time

Here are the steps we will follow
***
- We are going to scrape [Top 250 TV shows](https://www.imdb.com/chart/toptv/) URL.
- For each TV series, we will get the series name,release year,star rating and URL to the movie's IMDB page.
- For each series we will create a .csv file of the following format.

    ```name,release year,stars,url 
    cosmos,1980,9.2,imdb.com/(url)```

 Let us first begin with importing the `requests` and `BeautifulSoup4` library

In [28]:
!pip install requests --upgrade --quiet

In [29]:
!pip install BeautifulSoup4 --upgrade --quiet

In [30]:
import requests
from bs4 import BeautifulSoup

#### Now that our libraries are imported and ready to use lets get the top 250 tv series page [250 TV series](https://www.imdb.com/chart/toptv/)  from IMDB.

- We will be using the `get()` method to get the web page response or we can say to download the web page.
- After we run the `get()` method we can check if our request was successful by validating the `status_code` of the received response.

In [31]:
url = 'https://www.imdb.com/chart/toptv/'
response = requests.get(url)
response.status_code

200

As we can see above the response code is `200`.  Any response having status code between `200` to `299` is a successful request and we have the web page we wanted. You can learn more about different HTTP response codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
 
 Now we will get the actual text or the actual web page information from the response by using `text` method

In [32]:
response.text

'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n    \n    \n    \n\n    \n    \n    \n\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n            </style>\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>IMDb Top 250 TV - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</sc

The text is not very helpful and we cant really extract any information from it, lets write this into a `.html` file to make sense out of all this. For this we will use the `open()` function which is an  in-built function in python.
- We create the file `imdb_top_250_tv.html` with the `open()` function and we pass the parameters `filename` and `mode`. In our case since we want to write to a file we will use `w` or write mode.
- We use the with keyword so that the file is closed once the write operation is complete.
- You can learn more about `open()` [here](https://www.w3schools.com/python/ref_func_open.asp).

In [33]:
with open("imdb_top_250_tv.html","w") as f:
    f.write(response.text)

- Now that we have created the file lets open it and see whats in there, to open the file you can click on `File>Open` then select the file in our case `imdb_top_250_tv.html`. You can see the web page now is opened locally, note that no hyper-links will work on the web page since its not hosted on the internet. 
- You can see the actual html code of the page by clicking on the checkbox beside the file name and click on `edit`. This will show the actual HTML code.
You can also open the file here using the open() function in `r` or read mode.

In [34]:
with open("imdb_top_250_tv.html","r") as d:
    print(d.read())




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         

        <meta charset="utf-8">

    
    
    

    
    
    

            <style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
        <title>IMDb Top 250 TV - IMDb</title>
  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') 

We have now retrieved the actual web page we wanted, now further we will convert this into BeautifulSoup document.
- A BeautifulSoup document is very powerful and makes extracting data really easy.
- We can extract whatever data we need from the HTML page through their tags. 
- Lets create a function to convert the HTML code into a beautifulsoup object.

In [35]:
def get_series_page(url):
     #The URL we want to scrape data from.
    response = requests.get(url)
    if response.status_code!=200:                 #If response is not 200 then we will raise exception
        raise Exception("Unable to retrieve the web page from {}".format(url)) 
    doc = BeautifulSoup(response.text,'html.parser')    #Converting the text into to BS4 document
    return doc
                          

In the above function we used the `BeautifulSoup()` to convert the page contents from simple text to BS4 document. 
- The `BeautifulSoup()` takes in 2 arguments
    - The actual content you want to convert to bs4 doc 
    - The optional argument where you can tell the bs4 function what are we parsing
- Now lets call the function and see how a ds4 document works

In [36]:
url = "https://www.imdb.com/chart/toptv/"
doc=get_series_page(url)

Now you can see that we have got the the whole text in the correct HTML format. This is now a bs4 document which we can query to get data from. You can check this using the `type()` function.

In [37]:
type(doc)

bs4.BeautifulSoup

#### Now lets see which tag  we need to access in order to get the data we require

![image](https://i.imgur.com/6fP5cPl.png)

From the above snip we can see that the `<td>` tag has all the information we need except for the stars. Lets try and grab that and see what it returns.

To extract data from a Bs4 document we use the `.find()` and `.find_all()` function
- `doc.find()`:
 - This function will return the first instance of the mentioned tag, we can pass in `class_="class_name"` parameter along with the tag to get the first instance of the specific tag.
- `doc.find_all()`:
 - This will return a list of all the instance of the mentioned tag. This too can take in identifiers to get a specific tag such as `class_ = "class_name"` .
 
>Note that `class` is a keyword in python that is why we use `class_ = "class_name"`. We can also use other identifiers like `id="id"` to get a specific tag.

In [38]:
tags = doc.find('td',class_="titleColumn")
tags

<td class="titleColumn">
      1.
      <a href="/title/tt5491994/" title="David Attenborough, Chadden Hunter">Planet Earth II</a>
<span class="secondaryInfo">(2016)</span>
</td>

Now we have the tag we require that has all the information we need, except for the stars.
We can also run the `find()` and `find_all()` functions to search for tags within the tag.
Lets try to get the title for this TV series.

Since the title is inside the `<a>` tag we can do the following:

In [39]:
title=tags.find('a').text #We use the .text method to get the text out.
title

'Planet Earth II'

Similarly lets get the series release year which is present inside the `<span>` tag

In [40]:
year=tags.find('span').text.strip("()") #We use the '.strip()' method to remove `()` from both ends
year

'2016'

Now, lets get the URL for the IMDB page of the movie. We can see that the URL is actually an attribute 
`<href>` of the `<a>` tag. Let us see how we access attributes of a tag

In [41]:
imdb_url = tags.find('a')["href"] #We can use the [] dictionary notation to access a tag's attribute
imdb_url

'/title/tt5491994/'

We can see that we only got the "Contextual URL" and not the full URL. This on its own is not any use to us. Since we know that the begin will always be https://www.imdb.com we can add this to the start of our url to get the full URL

In [42]:
base_url = "https://www.imdb.com"
imdb_url = base_url+imdb_url
imdb_url

'https://www.imdb.com/title/tt5491994/'

Now lets work on getting the ratings or stars of the tv show

- We cans see from the below snip that star count is present in another `<td>` tag with the `class_=ratingColumn imdbRating` 

![img](https://i.imgur.com/lkJ78wA.png)

Lets try and get the rating

As the ratings are present outside the current `<td>` tag and in enclosed by different `<td>` tag having `class_="ratingColumn imdbRating"` we will again extract the new `<td>` from the bs4 doc.
Later we will see how can we encapsulate all of this in a single function and use a single tag.

In [43]:
ratings_tag = doc.find('td',class_="ratingColumn imdbRating") #Get the <td> tag
ratings = ratings_tag.text.strip() #The rating has some whitespace at the start and end that we will remove
ratings

'9.4'

Now we have all the information we wanted, lets put them all in a dictionary.

In [44]:
{
    'title':title,
    'year':year,
    'ratings':ratings,
    'Series home':imdb_url
}

{'title': 'Planet Earth II',
 'year': '2016',
 'ratings': '9.4',
 'Series home': 'https://www.imdb.com/title/tt5491994/'}

**GREAT!** Now we know how to extract information for a series from the top 250 page. So far we had only extracted a information about a single series. Lets see how will we get information about all the 250 movies in a list of dictionary
************************************************************************************************

![img](https://i.imgur.com/lIoXPeM.png)

- From the above snip we can see that the both of the `<td>` tag we accessed i.e `class_="titleColumn"` and the other having `class_="ratingColumn imdbRating"` are enclosed in a `<tr>` tag.
- Further, `<tr>` tag for all 250 shows are enclosed in a `<tbody>` tag.
- So we can iterate over each `<tr>` tag in the `<tbody>` tag and use the above logic in each iteration to get information for all 250 shows.
- Lets code this out

Let us try and get the `<tr>` tag from the `<tbody>` tag

In [45]:
tbody_tag = doc.find('tbody',class_="lister-list") ##Accessing the maine <tbody> tag
tbody_tag

<tbody class="lister-list">
<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.430334652471991" name="ir"></span>
<span data-value="1.4783904E12" name="us"></span>
<span data-value="150341" name="nv"></span>
<span data-value="-1.5696653475280087" name="ur"></span>
<a href="/title/tt5491994/"> <img alt="Planet Earth II" height="67" src="https://m.media-amazon.com/images/M/MV5BMGZmYmQ5NGQtNWQ1MC00NWZlLTg0MjYtYjJjMzQ5ODgxYzRkXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UY67_CR1,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
      1.
      <a href="/title/tt5491994/" title="David Attenborough, Chadden Hunter">Planet Earth II</a>
<span class="secondaryInfo">(2016)</span>
</td>
<td class="ratingColumn imdbRating">
<strong title="9.4 based on 150,341 user ratings">9.4</strong>
</td>
<td class="ratingColumn">
<div class="seen-widget seen-widget-tt5491994 pending" data-titleid="tt5491994">
<div class="boundary">
<div class="popover">
<span class="de

Now we have the `<tbody>` tag lets get the `<tr>` tags from inside this

In [46]:
tr_tag_list = tbody_tag.find_all('tr')
tr_tag_list

[<tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.430334652471991" name="ir"></span>
 <span data-value="1.4783904E12" name="us"></span>
 <span data-value="150341" name="nv"></span>
 <span data-value="-1.5696653475280087" name="ur"></span>
 <a href="/title/tt5491994/"> <img alt="Planet Earth II" height="67" src="https://m.media-amazon.com/images/M/MV5BMGZmYmQ5NGQtNWQ1MC00NWZlLTg0MjYtYjJjMzQ5ODgxYzRkXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UY67_CR1,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt5491994/" title="David Attenborough, Chadden Hunter">Planet Earth II</a>
 <span class="secondaryInfo">(2016)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.4 based on 150,341 user ratings">9.4</strong>
 </td>
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt5491994 pending" data-titleid="tt5491994">
 <div class="boundary">
 <div class="popover">
 <span class="delete">

Now we have a list of all the `<tr>` tags, lets try to get just the title for the first show

- We first get the first elemnet or the first `<tr>` tag from the list
- Then since we know from our findings above we can find all the movie information inside the `<td>` tags with `class_="titleColumn"` and `class_="ratingColumn imdbRating"` for stars.
- We combine all this below and get the title

In [47]:
tr = tr_tag_list
title_test=tr[0].find('td',class_="titleColumn").find('a').text
title_test
tr 

[<tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.430334652471991" name="ir"></span>
 <span data-value="1.4783904E12" name="us"></span>
 <span data-value="150341" name="nv"></span>
 <span data-value="-1.5696653475280087" name="ur"></span>
 <a href="/title/tt5491994/"> <img alt="Planet Earth II" height="67" src="https://m.media-amazon.com/images/M/MV5BMGZmYmQ5NGQtNWQ1MC00NWZlLTg0MjYtYjJjMzQ5ODgxYzRkXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UY67_CR1,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt5491994/" title="David Attenborough, Chadden Hunter">Planet Earth II</a>
 <span class="secondaryInfo">(2016)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.4 based on 150,341 user ratings">9.4</strong>
 </td>
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt5491994 pending" data-titleid="tt5491994">
 <div class="boundary">
 <div class="popover">
 <span class="delete">

Now, lets create a function that will use the `<tr>` tag and return us the show information in a form of a dictionary.

In [48]:
def get_show_info(tr): ##We will pass in the <tr> tag in this function for extracting info
    base_url = "https://www.imdb.com"
    td = tr.find('td',class_='titleColumn')
    
    #Get the title from the enclosing `<td>` tag
    title = td.find('a').text
    
    #Get the release_year from the `<span>` inside the  `<td>` tag
    year = td.find('span',class_="secondaryInfo").text.strip('()')
    
    #Get the tv series imdb url from the `href` attribute of the `<a>` tag inside the `<td>` tag
    series_home_url = base_url+td.find('a')['href']
    
    #Get the star raitngs from the `<td>` tag with class_=`imdb ratings column`
    ratings = tr.find('td',class_="ratingColumn imdbRating").text.strip()
    return {
        'Title':title,
        'Year':year,
        'IMDB page':series_home_url,
        'Rating':ratings
        
    }   

In [49]:
show_info1 = get_show_info(tr[0])
show_info1 

{'Title': 'Planet Earth II',
 'Year': '2016',
 'IMDB page': 'https://www.imdb.com/title/tt5491994/',
 'Rating': '9.4'}

Great! We are getting the information for the first show, lets see if we can get information for other shows by changing the input.

In [50]:
show_info1 = get_show_info(tr[1])
show_info1 

{'Title': 'Breaking Bad',
 'Year': '2008',
 'IMDB page': 'https://www.imdb.com/title/tt0903747/',
 'Rating': '9.4'}

Similarly, for the 3rd show which is enclosed in the 3rd `<tr>` tag in the list we can give `tr[2]` as input. 

In [51]:
show_info2 = get_show_info(tr[2])
show_info2 

{'Title': 'Planet Earth',
 'Year': '2006',
 'IMDB page': 'https://www.imdb.com/title/tt0795176/',
 'Rating': '9.4'}

Since we have 250 shows to go through we can just iterate through the `<tr>` tag list by using list comprehension and get output in the form of list of dictionaries

In [52]:
top_shows = [get_show_info(show) for show in tr] #passing all the 250 <tr> in tbody to get all shows
top_shows

[{'Title': 'Planet Earth II',
  'Year': '2016',
  'IMDB page': 'https://www.imdb.com/title/tt5491994/',
  'Rating': '9.4'},
 {'Title': 'Breaking Bad',
  'Year': '2008',
  'IMDB page': 'https://www.imdb.com/title/tt0903747/',
  'Rating': '9.4'},
 {'Title': 'Planet Earth',
  'Year': '2006',
  'IMDB page': 'https://www.imdb.com/title/tt0795176/',
  'Rating': '9.4'},
 {'Title': 'Band of Brothers',
  'Year': '2001',
  'IMDB page': 'https://www.imdb.com/title/tt0185906/',
  'Rating': '9.4'},
 {'Title': 'Chernobyl',
  'Year': '2019',
  'IMDB page': 'https://www.imdb.com/title/tt7366338/',
  'Rating': '9.3'},
 {'Title': 'The Wire',
  'Year': '2002',
  'IMDB page': 'https://www.imdb.com/title/tt0306414/',
  'Rating': '9.3'},
 {'Title': 'Avatar: The Last Airbender',
  'Year': '2005',
  'IMDB page': 'https://www.imdb.com/title/tt0417299/',
  'Rating': '9.2'},
 {'Title': 'Blue Planet II',
  'Year': '2017',
  'IMDB page': 'https://www.imdb.com/title/tt6769208/',
  'Rating': '9.2'},
 {'Title': 'The 

#### Now we have the information of all the top 250 shows in the form of list of dictionaries. Lets write this information in a .csv file

The input to our function will be a list of dictionary of the form and the path where we want to create the .csv file

```
[
  {'title': 'abc', 'year': 'def', 'imdb_page': 'ghi','rating`:grt},
  {'title': 'jkl', 'year': 'mno', 'imdb_page': 'pqr',rating:'eqw'},
  {'title': 'stu', 'year': 'vwx', 'imdb_page': 'yza',rating:'xfv'}
  ...
]
```

The function will create a file with a given name containing the following data:

```
title,year,imdb_page,rating
abc,def,ghi,grt
jkl,mno,pqr,eqw
stu,vwx,yza,xfv

```

In [53]:
def write_to_csv(list_of_shows,name):
    if len(list_of_shows)==0:
        raise Exception("The input list is empty")
    with open(name,'w') as d:
        headers = list(list_of_shows[0].keys())
        d.write(','.join(headers)+'\n')
        for show in list_of_shows:
            values=[]
            for header in headers:
                values.append(str(show.get(header, "")))
            d.write(','.join(values)+'\n')
    
    

In [54]:
write_to_csv(top_shows,'imdb_top_250_show.csv')

In [55]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "hussainmansuri12345/imdb-tmdb-top-content-scraping" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/hussainmansuri12345/imdb-tmdb-top-content-scraping[0m


'https://jovian.com/hussainmansuri12345/imdb-tmdb-top-content-scraping'

Now lets view our .csv file

In [56]:
with open('imdb_top_250_show.csv',"r") as g:
    csv_file=g.read()
print(csv_file)

Title,Year,IMDB page,Rating
Planet Earth II,2016,https://www.imdb.com/title/tt5491994/,9.4
Breaking Bad,2008,https://www.imdb.com/title/tt0903747/,9.4
Planet Earth,2006,https://www.imdb.com/title/tt0795176/,9.4
Band of Brothers,2001,https://www.imdb.com/title/tt0185906/,9.4
Chernobyl,2019,https://www.imdb.com/title/tt7366338/,9.3
The Wire,2002,https://www.imdb.com/title/tt0306414/,9.3
Avatar: The Last Airbender,2005,https://www.imdb.com/title/tt0417299/,9.2
Blue Planet II,2017,https://www.imdb.com/title/tt6769208/,9.2
The Sopranos,1999,https://www.imdb.com/title/tt0141842/,9.2
Cosmos: A Spacetime Odyssey,2014,https://www.imdb.com/title/tt2395695/,9.2
Cosmos,1980,https://www.imdb.com/title/tt0081846/,9.2
Our Planet,2019,https://www.imdb.com/title/tt9253866/,9.2
Game of Thrones,2011,https://www.imdb.com/title/tt0944947/,9.1
The World at War,1973,https://www.imdb.com/title/tt0071075/,9.1
Rick and Morty,2013,https://www.imdb.com/title/tt2861424/,9.1
Fullmetal Alchemist: Brotherhood,2009,ht

Lets bring all of the above in a single function that will take in the imdb url and write the csv file for it.

In [57]:
def imdb_to_csv(url,filename):
    doc = get_series_page(url) ##This will create a beautifulSoup4 document of the given URL
    
    tbody_tag = doc.find('tbody',class_="lister-list") ##Accessing the maine <tbody> tag
    tr = tbody_tag.find_all('tr')  #Get the list of all the <tr> tags having the series data
    
    top_shows = [get_show_info(show) for show in tr] #Parsing all the 250 <tr> tags 
    
    write_to_csv(top_shows,filename) # Writing the data into a csv file
    
    print(f"Data scraped and written successfully to {filename}")
    

In [61]:
url = 'https://www.imdb.com/chart/toptv/'
imdb_to_csv(url,'imdb_top_250_show.csv')

Data scraped and written successfully to imdb_top_250_show.csv


Lets check the csv file.

In [59]:
with open('imdb_top_250_show.csv','r') as y:
    print(y.read())

Title,Year,IMDB page,Rating
Planet Earth II,2016,https://www.imdb.com/title/tt5491994/,9.4
Breaking Bad,2008,https://www.imdb.com/title/tt0903747/,9.4
Planet Earth,2006,https://www.imdb.com/title/tt0795176/,9.4
Band of Brothers,2001,https://www.imdb.com/title/tt0185906/,9.4
Chernobyl,2019,https://www.imdb.com/title/tt7366338/,9.3
The Wire,2002,https://www.imdb.com/title/tt0306414/,9.3
Avatar: The Last Airbender,2005,https://www.imdb.com/title/tt0417299/,9.2
Blue Planet II,2017,https://www.imdb.com/title/tt6769208/,9.2
The Sopranos,1999,https://www.imdb.com/title/tt0141842/,9.2
Cosmos: A Spacetime Odyssey,2014,https://www.imdb.com/title/tt2395695/,9.2
Cosmos,1980,https://www.imdb.com/title/tt0081846/,9.2
Our Planet,2019,https://www.imdb.com/title/tt9253866/,9.2
Game of Thrones,2011,https://www.imdb.com/title/tt0944947/,9.1
The World at War,1973,https://www.imdb.com/title/tt0071075/,9.1
Rick and Morty,2013,https://www.imdb.com/title/tt2861424/,9.1
Fullmetal Alchemist: Brotherhood,2009,ht

#### With that we have finished scrapping  IMDB top 250 Tv shows page using BeautifulSoup4 library and saved the data to a .csv file

In [60]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "hussainmansuri12345/imdb-tmdb-top-content-scraping" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/hussainmansuri12345/imdb-tmdb-top-content-scraping[0m


'https://jovian.com/hussainmansuri12345/imdb-tmdb-top-content-scraping'

## 2. Scrapping data from TMDB using REST API to get the 100 top rated movies of all time.

- TMDB does not allow manual scraping instead it gives us free API to connect to its various end points and get the required data.
- We will need an `API KEY`or an `API read access token` to scrap the TMDB website.
- You can get your own `KEYS` by creating an account  [here](https://www.themoviedb.org/signup)
- You can learn more about the APIs provided by TMDB [here](https://developer.themoviedb.org/reference/intro/getting-started)

#### These are the steps we will follow to scrape TMDB 100 all time highest rated movies using REST API


- First we will use our API  key and get the `response` from the TMDB website using the `request.get()` method.
- We will get the output in the form of JSON object, which we can easily convert into a pythond dictionary.
- After converting it into a dictionary, we will extract the information we need and store it in another list of dictionaries.
- Finally, we will write this list of dictionary into a .csv file.
- The final csv file should look something like this :
    ```
    title,release_year,rating
    The Godfather,1972,8.7
    The Shawshank Redemption,1994,8.7
    The Godfather Part II,1974,8.6
    ```

Lets first use or key securely by using `getpass`

In [1]:
from getpass import getpass

In [2]:
key= getpass()

········


Now we have the API key, lets get the API end point for getting the top rated movies.
- An API end point is a point of communication between the application and the web page.
- An API end point is like a URL but it cannot be accessed without using the API and the key.
- TMDB provides us various API end points, here we are using the "Top Rated" End point from the "Movie List" category.
- You can check all the API end points provided by the TMDB developer website [here](https://developer.themoviedb.org/reference/movie-top-rated-list)
- Using API's helps keep track of the information being pulled and limit any unusually high amount of  scraping. This restriction is  known as `rate limiting`. 
- TMDB have a limit of `50` requests per second post which you will get rate limited and get `response.status_code` as `429`

In [5]:
import requests

In [6]:
#URL will be the API endpoint
tmdb_url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page=1"

We will again use the requests library to get the response but this time we will put in the API endpoint instead of web page URL

- `requests` via an API usually consist of 2 parts:
    - a. `url`-This is the actual end point we got from the API documentation
    - b. `headers` -This is extra data passed through the https header to the API end point, this is where we will pass our API access `key`.
- Different APIs have different headers, you can usually get the headers from the API documentation itself. In our case we got the below header from the TMDB website.


In [7]:
##Getting the Top rated Movies information using restAPI access token

url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page=1"

headers = {"accept": "application/json","Authorization": key}

response = requests.get(url, headers=headers)
print(response.text)

{"page":1,"results":[{"adult":false,"backdrop_path":"/tmU7GeKVybMWFButWEGl2M4GeiP.jpg","genre_ids":[18,80],"id":238,"original_language":"en","original_title":"The Godfather","overview":"Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge.","popularity":121.499,"poster_path":"/3bhkrj58Vtu7enYsRolD1fZdja1.jpg","release_date":"1972-03-14","title":"The Godfather","video":false,"vote_average":8.7,"vote_count":17989},{"adult":false,"backdrop_path":"/kXfqcdQKsToO0OUXHcrrNCHDBzO.jpg","genre_ids":[18,80],"id":278,"original_language":"en","original_title":"The Shawshank Redemption","overview":"Framed in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he p

Now that we have gotten the response, we can see that it is in `JSON` format, we can easily convert this into python dictionary using the built-in `json` library and calling the `jsonloads()` function to convert a `JSON` object into a Python dictionary

In [8]:
import json

In [9]:

top_movies = json.loads(response.text)
top_movies

{'page': 1,
 'results': [{'adult': False,
   'backdrop_path': '/tmU7GeKVybMWFButWEGl2M4GeiP.jpg',
   'genre_ids': [18, 80],
   'id': 238,
   'original_language': 'en',
   'original_title': 'The Godfather',
   'overview': 'Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge.',
   'popularity': 121.499,
   'poster_path': '/3bhkrj58Vtu7enYsRolD1fZdja1.jpg',
   'release_date': '1972-03-14',
   'title': 'The Godfather',
   'video': False,
   'vote_average': 8.7,
   'vote_count': 17989},
  {'adult': False,
   'backdrop_path': '/kXfqcdQKsToO0OUXHcrrNCHDBzO.jpg',
   'genre_ids': [18, 80],
   'id': 278,
   'original_language': 'en',
   'original_title': 'The Shawshank Redemption',
   'overview': 'Framed in the 1940s for the double murder of his

Now lets try to get the information of the first movie from the dictionary.
We can see that the actual data we require has the `results` key.

In [10]:
results=top_movies['results']

Now that we have `results` we can see that its in the form of list of dictionaries similar to what we got in the end using BeautifulSoup4. APIs make extracting data way easier and quicker then manually extraction using BeautifulSoup4

Lets, grab the first dictionary from the `results` list and see which movie is No. 1 in ratings

In [11]:
results[0]

{'adult': False,
 'backdrop_path': '/tmU7GeKVybMWFButWEGl2M4GeiP.jpg',
 'genre_ids': [18, 80],
 'id': 238,
 'original_language': 'en',
 'original_title': 'The Godfather',
 'overview': 'Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. When organized crime family patriarch, Vito Corleone barely survives an attempt on his life, his youngest son, Michael steps in to take care of the would-be killers, launching a campaign of bloody revenge.',
 'popularity': 121.499,
 'poster_path': '/3bhkrj58Vtu7enYsRolD1fZdja1.jpg',
 'release_date': '1972-03-14',
 'title': 'The Godfather',
 'video': False,
 'vote_average': 8.7,
 'vote_count': 17989}

Lets try and extract the relevant information we need and put it inside a dictionary

In [12]:
movie_dict_1={
    'title':results[0]['title'],
    'release date': results[0]['release_date'],
    'Ratings':results[0]['vote_average']
}
movie_dict_1

{'title': 'The Godfather', 'release date': '1972-03-14', 'Ratings': 8.7}

Similarly, we can get data for different movies by changing the `index` we pass to the `results` list.

We can see that the `results` list only has the result for top 20 highest rated movies, our requirement is to get 100. The reason we get only top 20 movies is that TMDB displays only 20 movies per page. 
We can change which page we want by changing the `page=` value in the API end point. Lets try and get movies ranked 21-40 from the second page.

In [13]:
 #Changed the value from`1` to `2`
url_2 = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page=2"
headers = {"accept": "application/json","Authorization": key}

response2 = requests.get(url_2, headers=headers)
print(response2.text)

{"page":2,"results":[{"adult":false,"backdrop_path":"/qvZ91FwMq6O47VViAr8vZNQz3WI.jpg","genre_ids":[28,18],"id":346,"original_language":"ja","original_title":"七人の侍","overview":"A samurai answers a village's request for protection after he falls on hard times. The town needs protection from bandits, so the samurai gathers six others to help him teach the people how to defend themselves, and the villagers provide the soldiers with food.","popularity":27.108,"poster_path":"/8OKmBV5BUFzmozIC3pPWKHy17kx.jpg","release_date":"1954-04-26","title":"Seven Samurai","video":false,"vote_average":8.5,"vote_count":3063},{"adult":false,"backdrop_path":"/w2uGvCpMtvRqZg6waC1hvLyZoJa.jpg","genre_ids":[10749],"id":696374,"original_language":"en","original_title":"Gabriel's Inferno","overview":"An intriguing and sinful exploration of seduction, forbidden love, and redemption, Gabriel's Inferno is a captivating and wildly passionate tale of one man's escape from his own personal hell as he tries to earn the

Now since we want 100 results and we know that each page has 20 results, we will have to code a loop to go through 5 pages and get 20 movies from each page. Lets code this out.

In [14]:
#Base url to interact with all the pages of top rated movies
tmdb_base_url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page={}"

# Initialize an empty list to store the movie information
movie_info_list = []

# Loop through the first 5 pages of the top-rated movies API
for page in range(1, 6):
    # Make a request to the top-rated movies API for the current page
    url = tmdb_base_url.format(page)
    headers = {"accept": "application/json","Authorization": key}
    response = requests.get(url, headers=headers)
    
    # Extract the movie information from the response
    top_movies = json.loads(response.text)
    list_of_movies = top_movies['results']
    for movie in list_of_movies:
        movie_info = {
            'title': movie['title'],
            'release_year': movie['release_date'][:4],
            'rating_average': movie['vote_average']
        }
        movie_info_list.append(movie_info)

# Print the extracted movie information
print(len(movie_info_list))

100


Now lets put this logic in a function.

#### Things to keep in mind while creating a function for the above
- We might want to extract information from other endpoints on the website such as `top rated shows` instead of `top rated movies`. 
- Down the road we might change our API ACCESS TOKEN and if we hardcode `key` in the function we might have to change it everywhere.
- Keeping the above things in mind, lets create a function that takes in 2 arguments, `endpoint` as `url` and `key`

In [17]:
def get_top_movies(url,key):
#     endpoint=url
#     token=key
    movie_info_list = []

    # Loop through the first 5 pages of the top-rated movies API
    for page in range(1, 6):
        # Make a request to the top-rated movies API for the current page
        url = tmdb_base_url.format(page)
        headers = {"accept": "application/json","Authorization": key}
        response = requests.get(url, headers=headers)
    
        # Extract the movie information from the response
        top_movies = json.loads(response.text)
        list_of_movies = top_movies['results']
        for movie in list_of_movies:
            movie_info = {
            'title': movie['title'],
            'release_year': movie['release_date'][:4],
            'rating': movie['vote_average']
                }
            movie_info_list.append(movie_info)
    return movie_info_list
    # Print the extracted movie information
    print(len(movie_info_list))

Lets test this out

In [18]:
tmdb_base_url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page={}"
top_movies=get_top_movies(tmdb_base_url,key)
top_movies


[{'title': 'The Godfather', 'release_year': '1972', 'rating': 8.7},
 {'title': 'The Shawshank Redemption', 'release_year': '1994', 'rating': 8.7},
 {'title': 'The Godfather Part II', 'release_year': '1974', 'rating': 8.6},
 {'title': 'Dilwale Dulhania Le Jayenge',
  'release_year': '1995',
  'rating': 8.6},
 {'title': "Schindler's List", 'release_year': '1993', 'rating': 8.6},
 {'title': 'Cuando Sea Joven', 'release_year': '2022', 'rating': 8.5},
 {'title': 'Spirited Away', 'release_year': '2001', 'rating': 8.5},
 {'title': '12 Angry Men', 'release_year': '1957', 'rating': 8.5},
 {'title': 'Your Name.', 'release_year': '2016', 'rating': 8.5},
 {'title': 'Parasite', 'release_year': '2019', 'rating': 8.5},
 {'title': 'The Dark Knight', 'release_year': '2008', 'rating': 8.5},
 {'title': 'The Green Mile', 'release_year': '1999', 'rating': 8.5},
 {'title': 'Pulp Fiction', 'release_year': '1994', 'rating': 8.5},
 {'title': 'The Good, the Bad and the Ugly',
  'release_year': '1966',
  'rating

In [19]:
len(top_movies)

100

Now that we have the list of top movies, lets try and write it down in a csv file

In [20]:
def write_movies_to_csv(list_of_movies,filename):
    if len(list_of_movies)==0:
        raise Exception("The input list is empty")
    with open(filename,'w') as d:
        headers = list(list_of_movies[0].keys())
        d.write(','.join(headers)+'\n')
        for movie in list_of_movies:
            values=[]
            for header in headers:
                values.append(str(movie.get(header, "")))
            d.write(','.join(values)+'\n')
    
    

In [21]:
write_movies_to_csv(top_movies,'tmdb_top_movies.csv')

Lets check out the csv file we created.

In [23]:
with open('tmdb_top_movies.csv','r') as t:
    print(t.read())

title,release_year,rating
The Godfather,1972,8.7
The Shawshank Redemption,1994,8.7
The Godfather Part II,1974,8.6
Dilwale Dulhania Le Jayenge,1995,8.6
Schindler's List,1993,8.6
Cuando Sea Joven,2022,8.5
Spirited Away,2001,8.5
12 Angry Men,1957,8.5
Your Name.,2016,8.5
Parasite,2019,8.5
The Dark Knight,2008,8.5
The Green Mile,1999,8.5
Pulp Fiction,1994,8.5
The Good, the Bad and the Ugly,1966,8.5
Forrest Gump,1994,8.5
The Lord of the Rings: The Return of the King,2003,8.5
Dou kyu sei – Classmates,2016,8.5
GoodFellas,1990,8.5
The Boy, the Mole, the Fox and the Horse,2022,8.5
Primal: Tales of Savagery,2019,8.5
Seven Samurai,1954,8.5
Gabriel's Inferno,2020,8.5
Cinema Paradiso,1988,8.5
Life Is Beautiful,1997,8.5
Grave of the Fireflies,1988,8.4
Psycho,1960,8.4
Once Upon a Time in America,1984,8.4
Gabriel's Inferno: Part II,2020,8.4
Fight Club,1999,8.4
Impossible Things,2021,8.4
The Legend of Hei,2019,8.4
One Flew Over the Cuckoo's Nest,1975,8.4
City of God,2002,8.4
The Quintessential Quintuple

Lets put this both in a single function

#### This function will take in 3 arguments:
- `url` which we are scraping data from.
- `key` for the API ACCESS TOKEN
- `filename` the path where we actually want to save the csv file

In [24]:
def scrape_top_movie_to_csv(url,key,filename):
    list_of_top_movies=top_movies=get_top_movies(url,key)
#     filename='tmdb_top_100_movies.csv'
    write_movies_to_csv(top_movies,filename)
    print(f'Data scraped and saved successfully in {filename}')

In [25]:
scrape_top_movie_to_csv(tmdb_base_url,key,"tmdb_top_movies.csv")

Data scraped and saved successfully in tmdb_top_movies.csv


Lets check the  file we created

In [26]:
with open('tmdb_top_movies.csv','r') as t:
    print(t.read())

title,release_year,rating
The Godfather,1972,8.7
The Shawshank Redemption,1994,8.7
The Godfather Part II,1974,8.6
Dilwale Dulhania Le Jayenge,1995,8.6
Schindler's List,1993,8.6
Cuando Sea Joven,2022,8.5
Spirited Away,2001,8.5
12 Angry Men,1957,8.5
Your Name.,2016,8.5
Parasite,2019,8.5
The Dark Knight,2008,8.5
The Green Mile,1999,8.5
Pulp Fiction,1994,8.5
The Good, the Bad and the Ugly,1966,8.5
Forrest Gump,1994,8.5
The Lord of the Rings: The Return of the King,2003,8.5
Dou kyu sei – Classmates,2016,8.5
GoodFellas,1990,8.5
The Boy, the Mole, the Fox and the Horse,2022,8.5
Primal: Tales of Savagery,2019,8.5
Seven Samurai,1954,8.5
Gabriel's Inferno,2020,8.5
Cinema Paradiso,1988,8.5
Life Is Beautiful,1997,8.5
Grave of the Fireflies,1988,8.4
Psycho,1960,8.4
Once Upon a Time in America,1984,8.4
Gabriel's Inferno: Part II,2020,8.4
Fight Club,1999,8.4
Impossible Things,2021,8.4
The Legend of Hei,2019,8.4
One Flew Over the Cuckoo's Nest,1975,8.4
City of God,2002,8.4
The Quintessential Quintuple

# Summary
- In this project, we used `request` and `BeautifulSoup4` library to scrape data from [Top 250 TV shows](https://www.imdb.com/chart/toptv/) by creating functions such as `get_show_info()`,` write_to_cs()` and `imdb_to_csv()`.

- Used **official REST API** provided by the [TMDB developers portal](https://developer.themoviedb.org/docs) to scrape data from the TMDB [top rated movies](https://www.themoviedb.org/movie/top-rated?language=en-US) page, using functions such as `get_top_movies()`, `write_movies_to_csv()` and `scrape_top_movie_to_csv()`

- Stored data consisting of 350 rows x 8 columns across the 2 csv files that are generated.

### Resource to learn more about Web Scraping:
- [Web Sraping wiki](https://en.wikipedia.org/wiki/Web_scraping)
- [BeautifulSoup4 Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)
- [Geeks For Geeks](https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/)

### Some famous websites that provide official API support:
- [Meta for Developers](https://developers.facebook.com/docs/pages/overview#overview)
- [reddit](https://www.reddit.com/dev/api/)
- [GitHub](https://docs.github.com/en/rest?apiVersion=2022-11-28)

### Future ideas:

- Create scraping script to get stock prices by scraping the page repeatedly after a fixed interval of time

- Keeping an eye on the cost of a product on an e-commerce website

- Getting daily news headlines by creating a script that scrapes news websites everyday

In [2]:
jovian.commit(project="IMDB-TMDB-TOP-CONTENT-SCRAPING")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "hussainmansuri12345/imdb-tmdb-top-content-scraping" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/hussainmansuri12345/imdb-tmdb-top-content-scraping[0m


'https://jovian.com/hussainmansuri12345/imdb-tmdb-top-content-scraping'