# Web Scraping for Famous TV Shows on themoviedb.org Using Python


![](https://camo.githubusercontent.com/951ab659ae6cff9e6560c00155feaf3a225aaf83f76111a068ff61b30cc6e02c/68747470733a2f2f63646e2e6c796e64612e636f6d2f636f757273652f323834383333312f323834383333312d313630373639383038373633392d313678392e6a7067)


## About Project
**WEB SCRAPING** is process for extracting useful data from a website. This information is collected and then exported into a format that is more useful like CVS file or spreadsheet. web scraping can be done manually, in most cases but it is not a simple task most of the times. Websites come in many shapes and forms, as a result, web scrapers vary in functionality and features. Data scraping tools or software can collect and import the data into a program to integrate it with your business workflow.


**Python** offers a variety of libraries to scrape and extract the information from the web such as *BeautifulSoup, Requests, Scrapy, Selenium*. Python enables smooth and automatic data scraping for different stages. This process includes interacting with the target destination to parse, extend, import, append and harvest data. Python allows you to automate the scripting, passing, and storage of data in one system.


In this **Project** we will scrap TMDB website: https://www.themoviedb.org/tv  and extract a list of popular TV shows and some related information about TV shows.
> It involves following main steps.

* Making an HTTP request to a server and loading the web page https://www.themoviedb.org/tv using request Library 
* Inspecting the HTML code of page 
* Identifying the data we want to extract 
* parsing the website’s code using BeautifulSoup library
* Extracting the list of shows from the main page. For each page, we'll get the\
    **show name, user rating, Show's URL and release date**    
* And finally transefering the extracted data into a CSV file using pandas library.




![](https://i.imgur.com/tLtNldX.jpg)


## To Run Code

You can execute the code by clicking the "Run" button or by selecting the "Run on Binder" option. 

## Installing required Libraries
Let’s start by installing the required packages.

In [1]:
# Install the libraries
!pip install beautifulsoup4 --upgrade --quiet

!pip install requests --upgrade --quiet

!pip install pandas --upgrade --quiet


## Importing required packages
Let’s start by installing the required packages.

In [2]:
# Let's import necessary packages 
import requests
import pandas as pd
from bs4 import BeautifulSoup


###  Making an HTTP request and Extracting information from HTML

When we access URL https://www.themoviedb.org/tv  using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. To extract information from the HTML source code of a webpage programmatically, we can use BeautifulSoup class from the bs4 module of the Beautiful Soup library.\

Steps involves in this process are

* Downloading the page using requests.get
* Getting response object with the page contents and using a status code check whether the request was successful
* Getting the contents of the web page which can be accessed using the .text property of the response.
* Creating a HTML file and saving the contents in that file.
* create a BeautifulSoup object to parse the content




In [3]:
#definig a  function to download a web page using `requests`
tvShow_url = 'https://www.themoviedb.org/tv'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48'}
def get_show_page(tvShow_url):
   # Access the webpage using `requests`
    response = requests.get(tvShow_url,headers=headers)
    # Check if the request was successful
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(tvShow_url))
   #get the page contents
    page_contents = response.text
     #creating a html file and writing the page contents in that file   
    with open('famous-tv-shows.html', 'w', encoding="utf-8") as file:
        file.write(response.text)
     # Parse the `response' text using BeautifulSoup this will give us a beautifulsoup object
    tv_doc = BeautifulSoup(page_contents, 'html.parser')

    return tv_doc

In [9]:
doc=get_show_page(tvShow_url)


## "Inspect" a website’s source code to extract information

Every browser offers a version of the Web Inspector in its Develop tab or simply right click in page and click inspect in chrome.\
With this, we’ll be able to find any hyperlinks and the source of any other materials embedded on the web page. We’ll also be able to read alt text — used to describe the function or content of an image or element on a page — and captions of images, which could include the information we are looking for.\
As we see in the image below, the Tv Shows names are embedded in the a tags under h2 tags.\
![](https://i.imgur.com/GadZKHN.jpg)

### Getting Show Names
The show's title  are embedded as part of the h2 tag under the text/title attribute.
We can use the h2.a.text to retrieve the name of the tvShows.
   

In [30]:
def get_show_names(tv_doc):

    " This Function takes BeautifulSoup object and extracts the show's names from HTML source code."
    show_names_tags = tv_doc.find_all('h2')[4:]  #Excluding initial h2 tags that dont contain shows' names
    show_names = []
    # get list of all the show names from the page 
    for tag in show_names_tags:
        show_names.append(tag.a.text.strip())
    return show_names

In [25]:
a= get_show_names(doc)
a

['El Kebeer Awi',
 'My Heart Relieved',
 'منهو ولدنا؟',
 'Melur Untuk Firdaus',
 'Aashiqana',
 'The Path',
 'Amor Perfeito',
 'London Class',
 'Till Death',
 'Ramadan Kareem',
 'Gaafar El Omda',
 'Teri Meri Doriyaann',
 'Al Maddah',
 'Al rojo vivo',
 'His All-Knowing Secret',
 'Al Aghar',
 'Baba Almgal',
 'Settohom',
 'gold house',
 "Al Zind: Thi'b Al Assi"]

### Getting Show's URL
The show's URLs  are embedded as part of the h2 tag under the href attribute.
We can use the h2.a.href to retrieve the sub url of the tvShows and getting full URL by simply concating base url with it.

In [6]:
def get_show_urls(tv_doc):
    " This Function takes BeautifulSoup object and extracts the show's URLs from HTML source code."
    show_urls = []
    base_url = 'https://www.themoviedb.org'
    show_names_tags = tv_doc.find_all('h2')[4:]   #Excluding initial h2 tags that dont contain show's URLs
     # get list of all the show URLs from the page 
    for tag in show_names_tags:
        show_urls.append(base_url + tag.a['href'])
    return show_urls

In [8]:
urls=get_show_urls(doc)
urls

['https://www.themoviedb.org/tv/52698',
 'https://www.themoviedb.org/tv/101604',
 'https://www.themoviedb.org/tv/196080',
 'https://www.themoviedb.org/tv/203057',
 'https://www.themoviedb.org/tv/203504',
 'https://www.themoviedb.org/tv/204370',
 'https://www.themoviedb.org/tv/209085',
 'https://www.themoviedb.org/tv/211352',
 'https://www.themoviedb.org/tv/121745',
 'https://www.themoviedb.org/tv/72205',
 'https://www.themoviedb.org/tv/218739',
 'https://www.themoviedb.org/tv/215103',
 'https://www.themoviedb.org/tv/122543',
 'https://www.themoviedb.org/tv/101463',
 'https://www.themoviedb.org/tv/218313',
 'https://www.themoviedb.org/tv/218886',
 'https://www.themoviedb.org/tv/218323',
 'https://www.themoviedb.org/tv/218320',
 'https://www.themoviedb.org/tv/220962',
 'https://www.themoviedb.org/tv/221018']

### Getting Show's Rating
The user ratings are embedded as part of the div tag under the user_score_chart class in the webpage.

In [12]:
def get_show_rating(tv_doc):
    
    " This Function takes BeautifulSoup object and extracts the show's rating from HTML source code."
    show_rating_tags = tv_doc.find_all('div', class_= 'user_score_chart')
    show_rating = []
    # get the list of ratings of all the shows in the page
    for tag in show_rating_tags:
        show_rating.append(tag.attrs['data-percent'])
    return show_rating


In [26]:
rating=get_show_rating(doc)
rating

['66.0',
 '58.0',
 '10',
 '51.0',
 '76.0',
 '54.0',
 '60',
 '10',
 '63.0',
 '40',
 '56.0',
 '49.0',
 '33.0',
 '17.0',
 '55.0',
 '100',
 '55.0',
 '10',
 '100',
 '90']

### Getting Show's Release Date
The show's release dates are embedded in p tags.


In [20]:

def get_release_dates(tv_doc):
    " This Function takes BeautifulSoup object and extracts the show's names from HTML source code."
    show_date_tags = tv_doc.find_all('p')[1:21]
    show_dates = []
    # get list of all the show's release dates from the page 
    for tag in show_date_tags:
        show_dates.append(tag.text.strip())
    return show_dates




In [21]:
dates=get_release_dates(doc)
dates

['Aug 11, 2010',
 'May 17, 2018',
 'Apr 02, 2022',
 'Jun 06, 2022',
 'May 27, 2022',
 'Apr 13, 2021',
 'May 27, 2017',
 'Apr 13, 2021',
 'Mar 23, 2023',
 'Mar 23, 2023',
 'Mar 20, 2023',
 'Mar 23, 2023',
 'Jan 04, 2023',
 'Mar 23, 2023',
 'Mar 23, 2023',
 'Mar 23, 2023',
 'Mar 23, 2023',
 'Mar 23, 2023',
 'Mar 23, 2023',
 'Mar 23, 2023']

### Putting  up all functions and Creating Database
let's make it single code where we put all defined functions and making a dictionary and creating a CSV file.

In [3]:
import requests
from bs4 import BeautifulSoup
tvShow_url = "https://www.themoviedb.org/tv"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48'}
import pandas as pd


#definig a  function to download a web page using `requests`

def get_show_page(tvShow_url):
   # Access the webpage using `requests`
    response = requests.get(tvShow_url, headers=headers)
    # Check if the request was successful
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(tvShow_url))
   #get the page contents
    page_contents = response.text
     #creating a html file and writing the page contents in that file   
    with open('famous-tv-shows.html', 'w', encoding="utf-8") as file:
        file.write(response.text)
     # Parse the `response' text using BeautifulSoup this will give us a beautifulsoup object
    tv_doc = BeautifulSoup(page_contents, 'html.parser')

    return tv_doc

# getting shows' names
def get_show_names(tv_doc):

    " This Function takes BeautifulSoup object and extracts the show's names from HTML source code."
    show_names_tags = tv_doc.find_all('h2')[4:]  #Excluding initial h2 tags that dont contain show names
    show_names = []
    # get list of all the show names from the page 
    for tag in show_names_tags:
        show_names.append(tag.a.text.strip())
    return show_names

#getting shows' urls
def get_show_urls(tv_doc):
    " This Function takes BeautifulSoup object and extracts the show's URLs from HTML source code."
    show_urls = []
    base_url = 'https://www.themoviedb.org'
    show_names_tags = tv_doc.find_all('h2')[4:]   #Excluding initial h2 tags that dont contain shows' URLs
     # get list of all the show URLs from the page 
    for tag in show_names_tags:
        show_urls.append(base_url + tag.a['href'])
    return show_urls

# getting shows' ratings
def get_show_rating(tv_doc):
    
    " This Function takes BeautifulSoup object and extracts the show's rating from HTML source code."
    show_rating_tags = tv_doc.find_all('div', class_= 'user_score_chart')
    show_rating = []
    # get the list of ratings of all the shows in the page
    for tag in show_rating_tags:
        show_rating.append(tag.attrs['data-percent'])
    return show_rating

#Getting release dates
def get_release_dates(tv_doc):
    " This Function takes BeautifulSoup object and extracts the show's names from HTML source code."
    show_date_tags = tv_doc.find_all('p')[1:21]  #Excluding some "p" tags that dont contain shows' release dates
    show_dates = []
    # get list of all the show's release dates from the page 
    for tag in show_date_tags:
        show_dates.append(tag.text.strip())
      
    return show_dates

def get_show_details(tv_doc):
    
    "Function to get the movie informations genre, runtime and creators"
    div1_tags = tv_doc.find('div', class_ = 'facts')
    genre = div1_tags.get_text(strip=True, separator=' ')
        
    div2_tags = tv_doc.find_all('div', class_= 'scroller_wrap should_fade is_fading')
    creator = div2_tags[0].text.strip().partition("\n")[0]
    
    return  genre, creator

def popular_tvShows():
    
    "Function to download web page using `requests` and to extract the HTML source code using BeautifulSoup."
    
    # Getting list of popular shows from the TMdb website
    page_count = 1 # Initializing the show page count to 1
    
    # Declaring lists for all the show's elements
    names, ratings, release_dates, urls = [],[],[],[]
        
    while page_count <= 20: # Looping for 20 pages of the TMdb web page
        
        tvShow_url = "https://www.themoviedb.org/tv?page=%d" %(page_count)
        
        doc= get_show_page(tvShow_url)
      
                      
        # Append each show attribute to respective lists
        names += get_show_names(doc)
        ratings += get_show_rating(doc)
        release_dates += get_release_dates(doc)
        urls += get_show_urls(doc) 
        page_count += 1

        # Defining a dictionary to store the show informations
    shows_dict = {
        'Name': names,
        'Rating': ratings,
        'Release_date': release_dates,
        'URL': urls
    }
    
    database = pd.DataFrame(shows_dict)
    database.to_csv('popular_shows.csv', index=None)
    
    return database

In [4]:
df=popular_tvShows()
df

Unnamed: 0,Name,Rating,Release_date,URL
0,Teri Meri Doriyaann,70,"Jan 04, 2023",https://www.themoviedb.org/tv/215103
1,Imlie,73.0,"Nov 16, 2020",https://www.themoviedb.org/tv/113218
2,Titlie,80,"Jun 06, 2023",https://www.themoviedb.org/tv/228093
3,Aashiqana,67.0,"Jun 06, 2022",https://www.themoviedb.org/tv/203504
4,Do Dil Mil Rahe Hai,77.0,"Jun 12, 2023",https://www.themoviedb.org/tv/228627
...,...,...,...,...
395,Sense8,78.0,"Jun 05, 2015",https://www.themoviedb.org/tv/61664
396,The Misfit of Demon King Academy,85.0,"Jul 04, 2020",https://www.themoviedb.org/tv/97617
397,Animal Kingdom,77.0,"Jun 14, 2016",https://www.themoviedb.org/tv/66025
398,"XO, Kitty",82.0,"May 18, 2023",https://www.themoviedb.org/tv/195670


## Summary

We've followed the following steps in this project:

* Downloaded web pages using the requests library
* Inspected the HTML source code of a web page
* Parsed parts of a website using Beautiful Soup
* Created dictionary with list of information
* Complied the shows' informations into Pandas lists and Dataframes
* Written dataframes information into CSV file

### References

1. https://www.youtube.com/watch?v=RKsLLG-bzEY
2. https://medium.com/geekculture/web-scraping-cheat-sheet-2021-python-for-web-scraping-cad1540ce21c
3. https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
4. https://realpython.com/beautiful-soup-web-scraper-python/



#### Saving Notebook on Jovian Profile

In [11]:
!pip install jovian --upgrade --quiet

In [12]:
import jovian

In [123]:
jovian.commit(files = ['popular_shows'],project="web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "madhumatti-7/web-scraping-project" on https://jovian.com[0m
[jovian] Uploading additional files...[0m


[31m[jovian] Error: Ignoring "popular_shows" (not found)[0m


[jovian] Committed successfully! https://jovian.com/madhumatti-7/web-scraping-project[0m


'https://jovian.com/madhumatti-7/web-scraping-project'

In [16]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "madhumatti-7/web-scraping-project" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/madhumatti-7/web-scraping-project[0m


'https://jovian.com/madhumatti-7/web-scraping-project'