# Scraping TV-shows on TMDB using python.


- The Movie Database (TMDB) is a community built movie and TV database.Also provides an API portal for 
researchers who wants access to movie data.
- https://www.themoviedb.org/tv this page provides us list of popular TV shows on TMDB, let's retrive information from this page using _web scrapping_.We are going to use Requests and Beautiful Soup to scrap data from this page.

![web1.png](https://i.imgur.com/HQBMrPp.png)

- After opening the website, we are going to navigate through the Tvshows tab on the top left and click on option popular to get the page of popular Tv shows. 



### Here are the steps we'll follow:

- We're going to scrape https://www.themoviedb.org/tv
- First step would be to download the webpage using `requests`.
- Parse the HTML source code using `Beautifulsoup`.
- We'll check out the page that has the list of TV-shows. For each show, we'll extract title, User Score, show's individual page URL and the premiered date.
- From each individual page URL, we'll extract different kind of information about the show. For each page, we'll grab the Current_season, Current_season_Episodes, Tagline, Genre, and Cast.
- Compile extracted information into python lists and dictionaries.
- Extract and combine data from multiple pages.
- Finally, we are going to save the extracted information to a CSV file.


:- Following is the format for how our data will look like in the tabular form after extraction:


Title, | User_rating, | Release_date, |Current_season, |Current_season_Episodes,| Tagline,| Genre,| Cast
:------ | :---------- | :----------- | :------------ | :----------------------|:-------|:-----|:--- 
The Snitch Cartel: Origins, |81.0, |"Jul 28, |2021", |Season 1, |60 Episodes,No Tagline, |"['Crime', 'Soap']", |['Juan Pablo Urrego']
Noovo Le Fil Québec, |Not rated yet, |"Mar 29, 2021", |Season 1, |110 Episodes, |No Tagline, |['News'], |['Lisa-Marie Blais']

- Fo|r each TV-show we'll create a CSV file in the following format:

   1. Title, User_rating, Release_date, Current_season, Current_season_Episodes, Tagline, Genre, Cast
   2. The Snitch Cartel: Origins, 81.0, "Jul 28,  2021", Season 1, 60 Episodes, No Tagline, "['Crime', 'Soap']", ['Juan Pablo Urrego']
   3. Noovo Le Fil Québec, Not rated yet, "Mar 29, 2021", Season 1, 110 Episodes, No Tagline, ['News'], ['Lisa-Marie Blais']






## Download the webpage using `requests` 
Let's visit the website first and then we can examin the information we need. Following are the steps we will take to get the information and put into a proper format.

- use `requests` library to downlaod the web page. The library can be installed using `pip`.

To download a page , we can use the `get` function from requests, which returns a response object.


In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

# The library is now installed and imported.

In [3]:
# sometimes websites stop you from extracting the data for some reason. It can be due to some authentication errors.

needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
response = requests.get("https://www.themoviedb.org/tv", headers = needed_headers )

`requests.get` returns a response object containing the data from the web page and some other information.

The `.status_code` can be used to check if the response was successful. A successful response will have an [HTTP status code] (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [4]:
response.status_code

200

The request was successful. We can get the contents of the page using `response.text`.

In [5]:
dwn_content = response.text
len(dwn_content)

189176

As we can see, the page has 189176 characters in total. 

Lets take a look at first 500 characters.

In [6]:
dwn_content[:500]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular TV Shows &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    \n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="'

what we are looking above is the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the web page in the form of text.

## Parse the HTML source code using beautiful soup.

- use `BS4` to import the BeautifulSoup library.


In [7]:
!pip install beautifulsoup4 --upgrade --quiet

In [8]:
from bs4 import BeautifulSoup

 To parse the HTML source code we are going to use `BeautifulSoup()` function that takes two arguments.  
 - content of the page.
 - 'html.parser'

In [9]:
test_doc = BeautifulSoup(dwn_content, 'html.parser')

In [10]:
type(test_doc)

bs4.BeautifulSoup

In [11]:
test_doc.find('title')

<title>Popular TV Shows — The Movie Database (TMDB)</title>

In [12]:
test_doc.find('img')

<img alt="The Movie Database (TMDB)" height="20" src="/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.svg" width="154"/>

 With all the information we have so far, let's create a function that will download a web page by using `request` and `BeautifulSoup`. Returns a beautifulSoup type object for any given link.

In [13]:

def get_page_content(url):
    # In this case , we are going to give request.get function headers to avoid the Status code Error 403

    get_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
    response_page = requests.get(url, headers = get_headers )
    # we are going to raise exception here if status code gives any value other than 200.
    if not response_page.ok:
        raise Exception ("Failed to request the data. Status Code:- {}".format(response_page.status_code))
    else:
        page_content = response_page.text
        doc_page = BeautifulSoup(page_content, "html.parser")
        return doc_page

In [14]:
popular_shows_url = "https://www.themoviedb.org/tv"
doc = get_page_content(popular_shows_url)

In [15]:
#let's try to get the title of the page to check if our function works. 

doc.title.text

'Popular TV Shows — The Movie Database (TMDB)'

Let's create some helper functions to parse information from the page.

To get show's titles and premiered date, we can pick 'h2' and 'p' tags  respectively with the class 'card style_1' ...

![Web 2.png](https://i.imgur.com/JFyb88G.png)

In [16]:
# Now that we know the class let's trty to get the title of the first movie. 

doc.find_all('div', {'class': 'card style_1'})[0].h2.text


'Chucky'

- For premiered date we just need to change the tag that is 'p' tag because it also comes under class 'card style_1
- Similarly, we can get the user rating of the show with the help of class "user_score_chart" and then we can get the value of attribute "data-percent".
 
 Below is the example to get user rating of first movie which we just extracted above.

In [17]:
doc.find_all('div', {'class': 'user_score_chart'})[0]['data-percent']


'80'

We are going to create a dictionary that will have the name of the columns of our CSV file in the form of keys, and the values would be the data that we are going to _scrape/extract_ from the web page. It is simply a dict type vairable in python that is going to store all of our data which we will use to create _Dataframe_ and  _CSV_ later  on.


In [18]:
def empty_dict():
    scraped_dict = {  
                    'Title': [],
                    'User_rating': [], 
                    'Release_date':[], 
                    'Current_season': [],
                    'Current_season_Episodes': [], 
                    'Tagline': [],
                    'Genre': [],
                    'Cast': []   
                    }
    return scraped_dict

- If we observe multiple shows, we will realize that not every show has been rated yet. 
- Let's create a function to deal with this problem. If a show is not rated yet, we can skip it or make it display a message. 
- The function will parse the user score into a dictionary. The function will come handy later on when we need to create a final function.

In [19]:
def user_score_info(tag_user_score, i, scraped_dict):
    if tag_user_score[i]['data-percent'] == '0':
        scraped_dict['User_rating'].append('Not rated yet')
    else:
        scraped_dict['User_rating'].append(tag_user_score[i]['data-percent'])

#### One more information that we need from this page is the url to get to the individual page of the Show, so that we can get the rest of the information.

This can be achieved from the class "card style_1" as well, we just need to get the value of attribute "href" in the h2 tag.

In [19]:
doc.find_all('div', {'class': 'card style_1'})[0].h2.a['href']

'/tv/90462'

### Now we have all the information that we need from this page. Let's put it all together in one function.

In [20]:
def get_show_info(doc_page):
    base_link_1 = "https://www.themoviedb.org"
    tag_title = tag_premired_date = tag_shows_page = doc_page.find_all('div', {'class': 'card style_1'})
    tag_user_score = doc_page.find_all('div', {"user_score_chart"}) 
    
    doc_2_list = []
    for link in tag_shows_page:
        # here we are creating the list of all the individual pages of the shows which will come handy in other functions. 
        doc_2_list.append(get_page_content("https://www.themoviedb.org" + link.h2.a['href']))
       # we are going to have the function to return the list of all the information as elements. 
    return tag_title, tag_user_score, doc_2_list

In [21]:
# lets see if the function returns the list of the information we tried to get earlier. 
len(get_show_info(doc))

3

 ### Lets scrape Current_season, Current_season_Episodes, Tagline, Genre, Cast.
 
 - From the list of the shows individual webpages we will try to get Current_season, Current_season_Episodes, Tagline, Genre, Cast.
  - Let's get the Genre of the Show and tagline. The classes we are going to use for this are "genres" and "tagline". Take a look at the image down below for better understanding.   
![web 3.png](https://i.imgur.com/PL4xjM9.png)

In [22]:
#lets download and get the html of the individual page of the show 'what if...?' with the function get_page_content(). 
doc_2 = get_page_content("https://www.themoviedb.org/tv/91363")

- As we can see 'a' tag under class 'genres' contains different values for the genre of the show. let's create a list of genres .


In [23]:
tag_genre = doc_2.find('span', {"class": "genres"})
tag_genre_list = tag_genre.find_all('a')

check_genre =[]
for tag in tag_genre_list:
    check_genre.append(tag.text)

check_genre


['Animation', 'Action & Adventure', 'Sci-Fi & Fantasy']

In [24]:
# lets create a function to get the genres for the show. 
# i here denotes the element of the list vairable ``doc2_page`` that contains different doc pages. Will come handy later on.
def get_genres(doc2_page, i):
    genres_tags = doc2_page[i].find('span', {"class": "genres"}).find_all('a')
    check_genre =[]
    
    for tag in genres_tags:
        check_genre.append(tag.text)
    return check_genre

- For the next piece of information we need Tagline of the show but, for some movies it is not yet available.
- Let's create a function for this and tackle this problem simply using if else statements.
- The function will parse the value into a dictionary.

In [25]:
tag_tagline = doc_2.find('h3',{"class": 'tagline'})

def tagline_info(doc_2_list, i, scraped_dict):
    if doc_2_list[i].find('h3',{"class": 'tagline'}):
        scraped_dict['Tagline'].append(doc_2_list[i].find('h3',{"class": 'tagline'}).text)
    else:
        scraped_dict['Tagline'].append("No Tagline")

- Just like we got the list of genres and then a function for it, if we inspect the HTML of TV show's wepbage, we can get the list of cast with the help of class 'card' and tag 'li'. 

![image5.png](https://i.imgur.com/vx1BxX2.png)



Let's create a function to get the cast of the show. 


In [27]:
# i here denotes the the element of the list type variable``doc2_page`` that contains different doc pages.

def get_show_cast(doc2_page, i):
    cast_tags = doc2_page[i].find_all('li', {'class': 'card'})
    cast_lis = []
    
    for t in cast_tags:
         cast_lis.append(t.p.text)
    
    return cast_lis

 - For few last pieces of information we are going to scrape 'current_season' and its episodes. We can access it under class "flex".  
 ![web 5.png](https://i.imgur.com/PQGeBQk.png)
 
 - For the "current_season" it is pretty easy and simple, we just have to get the text part of "h2" tag. 
 - But, if we look closely in the image above, we can see for the current seasons episodes it gets little tricky as we also have the year mentioned in front of number of the episodes. 
 - we are going to cut that part off so that we are left with just the number of episodes. check out the cells below for better understating.

In [31]:
tag_episodes = doc_2.find_all('div' , {'class': 'flex'})
# extracing current season from h2 tag under class flex.
tag_episodes[1].h2.text 

'Season 1'

- As we can see in the code cell below, it not just returns the number of episodes but also the year of its making along with it. 
- This format is same for each and every show on the website. 

In [32]:
tag_episodes[1].h4.text

'2021 | 9 Episodes'

- let's take the year part out. Thanks to list slicing it becomes very easy. 
- The "h4" tag under the class "flex" returns string on which 7th index marks the begining of the number of episodes. 
- This is same with each and every show on the website 


In [33]:
print('2021 | 9 Episodes'[7:])

tag_episodes[1].h4.text[7:]

9 Episodes


'9 Episodes'

### Now, we know how to get all the information we need from a TV-show. Lets create a function that extract all the information and returns a Dataframe.


In [34]:
import pandas as pd

def get_show_details(t_title, t_user_score, docs_2_list):
    # excuting a function here that empties the dictionary every time the function is called.
    scraped_dict =  empty_dict()
    for i in range (0, len(t_title)):
        scraped_dict['Title'].append(t_title[i].h2.text)
        user_score_info(t_user_score, i, scraped_dict)    
        scraped_dict['Release_date'].append(t_title[i].p.text)
        scraped_dict['Current_season'].append(docs_2_list[i].find_all('div' , {'class': 'flex'})[1].h2.text)
        scraped_dict['Current_season_Episodes'].append(docs_2_list[i].find_all('div' , {'class': 'flex'})[1].h4.text[7:])
        tagline_info(docs_2_list, i, scraped_dict)  
        scraped_dict['Genre'].append(get_genres(docs_2_list, i))        
        scraped_dict['Cast'].append(get_show_cast(docs_2_list, i))
        
    return pd.DataFrame(scraped_dict)

- we are going to try to execute our function
- but, before that we need to get the values that the function will take as arguments. Which is the information from first page. 
- get_show_info() is the function we created before so that we get the information from first page. 

In [35]:
tag_title_, tag_user_score_, doc_2_list_ = get_show_info(doc)

Before we execute our function `get_show_details()` we must understand how to create a _CSV_ file for this. Fortunately, it is really simple in python, We just have to call a function `.to_csv('path/csv_file_name')` on a dataframe.

Every file has to be stored in a directory that has a path, and if the path is not given to the `.to_csv()` it will take default path of your system. In this jypiter notebook, we just have to click on the file option on top left corner and open the file system, there we will be able to view or edit our files. Look at the picture below to have understand better.


In our case the function `get_show_details()` returns the dataframe which we are going to convert in to _CSV_ file.

In [36]:
# Let's excute our function to check if it works. We are going to take a look the data of dataframe.

x = get_show_details(tag_title_, tag_user_score_, doc_2_list_)
x.to_csv('check.csv')
pd.read_csv('check.csv',index_col=[0])

NameError: name 'user_score_info' is not defined

- To work with directories and files, we are going to import `OS` so that we can create or remove specific directories in our system. We can create a directory for our CSV files by using `.makedirs('directory_name', exist_ok = True)`.
- Now, we are going to create a function that is going to take a empty list, create a dataframe and  convert it into a CSV file and save it in a folder.
- The function will append the dataframe in the list.
- Now, this function will use the different functions we created and scrape the shows from the page which is normally not visible on the site until we click on a button that is supposed to load the list of the shows on the next page. 
- To do so, we need to add  " _**?page = page_number**_ "  at the end of the url.



In [None]:
import os
base_link = "https://www.themoviedb.org/tv"

# 'i' here means the number of page we want to extract
def create_page_df( i, dataframe_list):
    os.makedirs('shows-data', exist_ok = True)
    next_url = base_link + '?page={}'.format(i)
    doc_top = get_page_content(next_url)
    name_tag, viewer_score_tag, doc_2_lis = get_show_info(doc_top)
    print('scraping page {} :- {}'.format(i, next_url))
    dataframe_data = get_show_details(name_tag, viewer_score_tag, doc_2_lis)
    dataframe_data.to_csv("shows-data/shows-page-{}.csv".format(i) , index = None)
    print(" ---> a CSV file with name shows-page-{}.csv has been created".format(i))
    dataframe_list.append(dataframe_data)

In [None]:
test_list = []
create_page_df(9 , test_list)

## So far, we are able to extract the information and put in a CSV file. Lets look at the last few steps we need.

Last step that we need to finish our fuction is to create a final CSV file that is going to take all 200 shows. 

- In the final function, we will take a list of  dataframes and convert it into _CSV_ with the help of **_`concat()`_** function.
- `concat()` function takes a list of dataframes and convert it into a one big dataframe Which can be further converted into a _CSV_  file.
-  we can scrape hundereds of shows but, we are going to scrape top 200 just to keep it clean and simple. 


In [None]:
import pandas as pd
base_link = "https://www.themoviedb.org/tv"

def scrape_top_200_shows(base_link):
    dataframe_list = []
    # we are going to keep range up to 11 because we just need up to 200 TV shows for now. 
    for i in range(1,11):
        create_page_df(i, dataframe_list)
    # here we are using concat function so that we can merge the each dataframe that we got from the each page.    
    total_dataframe = pd.concat(dataframe_list, ignore_index = True)
    
    # with the simple command of to_csv() we can create a csv file of all the pages we extracted.
    csv_complete =  total_dataframe.to_csv('shows-data/Total-dataframe.csv', index= None)
    print(" \n a CSV file named Total-dataframe.csv with all the scraped shows has been created")


### Now that are done with all the functions, let's put our final function to test. 

In [None]:
scrape_top_200_shows(base_link)

### We were successfully able to create a function and all the csv files we needed. let's just  take a final look at how our data looks like with the help of pandas  `.read_csv()`

In [37]:
pd.read_csv('shows-data/Total-dataframe.csv')[0:50]

Unnamed: 0,Title,User_rating,Release_date,Current_season,Current_season_Episodes,Tagline,Genre,Cast
0,Chucky,80,"Oct 12, 2021",Season 1,10 Episodes,A classic coming of rage story.,"['Sci-Fi & Fantasy', 'Comedy', 'Crime']","['Brad Dourif', 'Zackary Arthur', 'Teo Briones..."
1,The Price Is Right,67.0,"Sep 04, 1972",Season 50,50 Episodes,No Tagline,[],"['Bob Barker', 'Johnny Olson', 'Drew Carey', '..."
2,Rachael Ray,53.0,"Sep 18, 2006",Season 16,45 Episodes,No Tagline,['Talk'],"['Rachael Ray', 'Ian Smith', 'Gretta Monahan',..."
3,Wheel of Fortune,71.0,"Sep 19, 1983",Season 39,50 Episodes,No Tagline,['Family'],"['Pat Sajak', 'Vanna White', 'Bob Goen', 'Chuc..."
4,Squid Game,78.0,"Sep 17, 2021",Season 1,9 Episodes,45.6 billion won is child's play.,"['Action & Adventure', 'Mystery', 'Drama']","['Lee Jung-jae', 'Park Hae-soo', 'Jung Ho-yeon..."
5,Days of Our Lives,63.0,"Nov 08, 1965",Season 57,45 Episodes,No Tagline,"['Soap', 'Drama']","['Deidre Hall', 'Bryan Dattilo', 'Alison Sween..."
6,Wer weiß denn sowas?,76.0,"Jul 06, 2015",Season 7,35 Episodes,No Tagline,"['Reality', 'Family']","['Kai Pflaume', 'Bernhard Hoëcker', 'Elton', '..."
7,Maradona: Blessed Dream,78.0,"Oct 29, 2021",Season 1,10 Episodes,No Tagline,"['Drama', 'Documentary']","['Julieta Cardinali', 'Juan Palomino', 'Merced..."
8,Alix and the Marvelous,Not rated yet,"Sep 09, 2019",Season 3,54 Episodes,No Tagline,"['Family', 'Sci-Fi & Fantasy', 'Drama']","['Rosalie Daoust', 'Jean-Philippe Lehoux', 'Al..."
9,Jeopardy!,70,"Sep 10, 1984",Season 38,50 Episodes,No Tagline,[],"['Alex Trebek', 'Ken Jennings', 'Brad Rutter',..."


## Summary of what we did.

- We scraped https://www.themoviedb.org/tv
- we downloaded the webpage using requests.
- We Parsed the HTML source code using Beautifulsoup.
- We checked out the page that has the list of TV-shows. For each show, we extracted title, User Score, show's individual page URL and the premiered date.
- From each individual page URL, we extracted information about the show. like Current_season, Current_season_Episodes, Tagline, Genre, and Cast.
- Compiled extracted information into python lists and dictionaries.
- Extracted and combined data from multiple pages.
- We saved the extracted information to a CSV file named Total-dataframe.csv
