# Course 1 : Foundation of Information

**Project Part B** - Using web scraping to build a database of movie related information from: The Movie Database (TMDB) movie data

**Problem statement:**

A common business requirement in the context of information gathering is to extract and filter relevant data from web pages that host this information. However, access to information spread over several web pages, hosted potentially on multiple websites is a cumbersome process and we cannot rely on manual procedures to execute this task. In this project, you will employ a programmatic approach to access, parse and extract relevant information from a website of interest.

**Objective:**

The project's goal is to extract data (from a chosen number of pages) from The Movie Database website (https://www.themoviedb.org/) into a tabular data format so that further analysis (e.g., details about a movie's genre, cast, and user rating) can be facilitated.

To execute this project, you will have to read the documentation links provided against each task in the assignment and adapt the code examples provided in the documentation for the task at hand

**Pre-requisites :**

Tools: Jupyter Notebook or Google Colab or Microsoft Visual Studio IDE

Languages: Python, HTML

Libraries: requests, beautifulSoup, pandas


**1. Establish a connection to the webpage - "https://www.themoviedb.org/movie" - and provide the following details.**

a. Import the requests library and formulate a get request to download the contents of the webpage.

In [1]:
import requests

needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
response = requests.get(("https://www.themoviedb.org/movie"),headers = needed_headers)

b. Verify the status code of the request and confirm that the request was executed appropriately.

In [2]:
if response.status_code == 200:
    print("Request Executed \nStatus Code =", response.status_code)
else: 
    print("Request Error \n Status Code =", response.status_code)

Request Executed 
Status Code = 200


c. Print the contents of the page obtained from the response and save it in a variable.

In [3]:
content = response.content
print(content)

b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" content="#032541">\n<meta name="theme-color" content="#032541">\n<link rel="apple-touch-icon" sizes="180x180" href

d. Infer the type of the variable created in part 1c and 
   display the first 200 characters of the content from the server’s response.

In [4]:
print("Type of the variable - content :", type(content),"\n")
print("First 200 characters:\n", content[:200])

Type of the variable - content : <class 'bytes'> 

First 200 characters:
 b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n  '


# 

**2. Parse the content of HTML response using the BeautifulSoup library and execute the tasks specified in the guidelines mentioned below.**

a. From the BeautifulSoup library (bs4) import the BeautifulSoup class. 
   Pass the contents of the webpage obtained from step 1c as an argument to create an instance of the BeautifulSoup class.

In [5]:
from bs4 import BeautifulSoup

BeautifulSoup_Instance = BeautifulSoup(content,"html.parser")
print(BeautifulSoup_Instance)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Popular Movies — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta content="#032541" name="msapplication-TileColor"/>
<meta content="#032541" name="theme-color"/>
<link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b82adaa1085c79bdde2f00ca8787b63d

b. Extract the title of the parsed web page content using an appropriate method or attribute of the document object created in part 2a.

In [6]:
print(BeautifulSoup_Instance.title.text)

Popular Movies — The Movie Database (TMDB)


 c. Write a user defined function to generalize the task presented in Q2a to any URL that retrieves the content of the webpage.Your function should take a URL string as an input and return a correctly formulated BeautifulSoup instance as the output.In your function definition, ensure that appropriate exceptions are raised to the user (through status codes) if they pass in malformed/incorrect URLs. Write two test cases for your function - one with a working URL and another with an URL that gets a 404 response.

In [7]:
def fetch_webpage(url):
    try:
        response = requests.get(url, headers= needed_headers)
        response.raise_for_status()
        html_content = BeautifulSoup(response.text, "html.parser")
        return html_content
    
    except requests.exceptions.RequestException as e:
        return f"Error: {e}"

working_url = "https://www.themoviedb.org/movie"
malformed_url = "https://www.themoviedb.org/movieees"

In [8]:
# Testing working url.

print(fetch_webpage(working_url))

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Popular Movies — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta content="#032541" name="msapplication-TileColor"/>
<meta content="#032541" name="theme-color"/>
<link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b82adaa1085c79bdde2f00ca8787b63d

In [9]:
# Testing malformed url.

print(fetch_webpage(malformed_url))

Error: 404 Client Error: Not Found for url: https://www.themoviedb.org/movieees


# 

**3. Extract the content of the webpage - https://www.themoviedb.org/movie - that hosts a current dated listing of popular movies.**

a. Write a function call to the user defined function created in 2c with the url https://www.themoviedb.org/movie as an input and store the response in a variable.

In [10]:
url = "https://www.themoviedb.org/movie"
moviedb = fetch_webpage(url)
print(moviedb)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Popular Movies — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta content="#032541" name="msapplication-TileColor"/>
<meta content="#032541" name="theme-color"/>
<link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b82adaa1085c79bdde2f00ca8787b63d

b. Print the HTML content associated with the first movie displayed on the web page using appropriate HTML tags to access this listing on the object created in part 3a.

In [11]:
firstmovie_info = moviedb.find("div", "card style_1")
print(f"The HTML content associated with the first movie displayed on the {url}:\n{firstmovie_info}")

The HTML content associated with the first movie displayed on the https://www.themoviedb.org/movie:
<div class="card style_1">
<div class="image">
<div class="wrapper">
<a class="image" href="/movie/901362" title="Trolls Band Together">
<img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/sEaLO9s7CIN3fjz8R3Qksum44en.jpg" srcset="/t/p/w220_and_h330_face/sEaLO9s7CIN3fjz8R3Qksum44en.jpg 1x, /t/p/w440_and_h660_face/sEaLO9s7CIN3fjz8R3Qksum44en.jpg 2x"/>
</a>
</div>
<div class="options" data-id="901362" data-media-type="movie" data-object-id="619bea97c0ae360089136cff">
<a class="no_click" href="#"><div class="glyphicons_v2 circle-more white"></div></a>
</div>
</div>
<div class="content">
<div class="consensus tight">
<div class="outer_ring">
<div class="user_score_chart 619bea97c0ae360089136cff" data-bar-color="#21d07a" data-percent="71.27" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r71"></span>
</div>
</div>
</div>
</div>
<h2><a href="/movie/9

c. Display the name of the first movie using appropriate HTML tags to access this listing on the object created in part 3a.

In [12]:
firstmovie_name_title = firstmovie_info.select_one(".media a")
firstmovie_name = firstmovie_name_title["title"] if firstmovie_name_title else "Not available"

print("The name of the first movie :", firstmovie_name)

The name of the first movie : Trolls Band Together


d. Display the user rating of the first movie by using appropriate HTML tags to access this listing on the object created in part 3a.

In [13]:
rating_element = firstmovie_info.select_one(".media .user_score_chart")
firstmovie_rating = rating_element["data-percent"] if rating_element else "Not available"

print("The user rating of the first movie :", firstmovie_rating)

The user rating of the first movie : 71.27


e. For the first movie, extract the part of the url following the string “https://www.themoviedb.org/” using the appropriate HTML tags to extract this portion on the object created in part 3a (do not use built-in string methods).

For example, if the first movie on the web page had the URL https://www.themoviedb.org/movie/779782 “ your output should be movie/779782.

In [14]:
base_url = "https://www.themoviedb.org/"

movie_url_element = firstmovie_info.select_one(".media a")
firstmovie_url = movie_url_element["href"] if movie_url_element else "Not available"

print(f"The part of the URL following the string {base_url}: {firstmovie_url}")

The part of the URL following the string https://www.themoviedb.org/: /movie/901362


# 

**4. Write user defined functions for each subsection below (i.e., Q4 a, Q4b, Q4c, Q4d, and Q4e) to return.**

a. Titles of all the movies on the page as a Python list.

In [15]:
def extract_titles(moviedb):
    titles_list = []
    movie_info = moviedb.find_all("div", class_="card style_1") 
    for movie_title in movie_info:
        title_element = movie_title.select_one(".media a")
        if title_element and "title" in title_element.attrs:
            titles_list.append(title_element["title"])
    return titles_list

display(extract_titles(moviedb))

['Trolls Band Together',
 'Oppenheimer',
 'The Creator',
 'Leo',
 "Five Nights at Freddy's",
 'Expend4bles',
 'Jawan',
 'Fast X',
 'Mission: Impossible - Dead Reckoning Part One',
 'Napoleon',
 'The Mercenary',
 'Believer 2',
 'The Equalizer 3',
 'The Hunger Games: The Ballad of Songbirds & Snakes',
 'The Survivor',
 'Meg 2: The Trench',
 'Blue Beetle',
 'The Marvels',
 'Dragon Ball: Mystical Adventure',
 'The Super Mario Bros. Movie']

b. User ratings of all the movies on the page as a Python list.

In [16]:
def extract_ratings(moviedb):
    ratings_list = []
    movie_info = moviedb.find_all("div", class_="card style_1")
    
    for card in movie_info:
        user_score_chart = card.find(class_="user_score_chart")
        
        if user_score_chart and "data-percent" in user_score_chart.attrs:
            data_percent = float(user_score_chart["data-percent"])
            
            if data_percent != 0:
                formatted_rating = "{:.2f}".format(data_percent)
                ratings_list.append(formatted_rating)
            else:
                ratings_list.append("Ratings not available")
        else:
            ratings_list.append("Ratings not available")
    
    return ratings_list

display(extract_ratings(moviedb))

['71.27',
 '81.56',
 '71.49',
 '79.00',
 '78.57',
 '64.32',
 '71.46',
 '72.03',
 '75.86',
 '64.86',
 '62.50',
 '71.22',
 '74.18',
 '72.93',
 '72.15',
 '67.42',
 '69.41',
 '65.69',
 '68.03',
 '77.48']

c. HTML content of all the individual pages of movies collected into a Python list.

In [17]:
def html_content_for_movies(moviedb):
    html = []
    movie_info = moviedb.find_all("div", class_="card style_1")
    
    for movie in movie_info:
        link = "https://www.themoviedb.org" + movie.select_one(".media a")["href"]
        html_content = fetch_webpage(link)
        html.append(html_content)
    
    return html

html_list = html_content_for_movies(moviedb)
print(html_list)

[<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Trolls Band Together (2023) — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="When Branch’s brother, Floyd, is kidnapped for his musical talents by a pair of nefarious pop-star villains, Branch and Poppy embark on a harrowing and emotional journey to reunite the other brothers and rescue Floyd from a fate even worse than pop-culture obscurity." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta co

d. Genres of all the movies on the page as a Python list.

In [18]:
def genres_from_html_list(html_list):
    genres_list = []
    for html_content in html_list:
        genres_text = html_content.find("span", "genres").text.replace("\xa0", "").strip()
        genres_list.append(genres_text)
    return genres_list

display(genres_from_html_list(html_list))

['Animation,Family,Music,Fantasy,Comedy',
 'Drama,History',
 'Science Fiction,Action,Thriller',
 'Animation,Comedy,Family',
 'Horror,Mystery',
 'Action,Adventure,Thriller',
 'Action,Adventure,Thriller',
 'Action,Crime,Thriller',
 'Action,Thriller',
 'Drama,History,War',
 'Action',
 'Crime,Action,Thriller',
 'Action,Thriller,Crime',
 'Action,Romance,Drama',
 'History,Action,Drama',
 'Action,Science Fiction,Horror',
 'Action,Science Fiction,Adventure',
 'Science Fiction,Adventure,Action',
 'Action,Animation',
 'Animation,Family,Adventure,Fantasy,Comedy']

e. Cast of all the movies on the page as a Python list.

In [19]:
def cast_from_html_list(html_list):
    cast_list = []
    for html_content in html_list:
        raw_cast = html_content.find("ol", "people scroller").text.strip().replace("\n\n\n\n\n", "")
        cast_items = [item.strip() for item in raw_cast.split("\n") if item.strip()]
        cast_dict = {}
        for i in range(0, len(cast_items)-1, 2):
            cast_dict[cast_items[i]] = cast_items[i+1]
        cast_list.append(cast_dict)
    return cast_list

display(cast_from_html_list(html_list))

[{'Anna Kendrick': 'Poppy (voice)',
  'Justin Timberlake': 'Branch (voice)',
  'Camila Cabello': 'Viva (voice)',
  'Eric André': 'John Dory (voice)',
  'Amy Schumer': 'Velvet (voice)',
  'Andrew Rannells': 'Veneer (voice)',
  'Daveed Diggs': 'Spruce (voice)',
  'Troye Sivan': 'Floyd (voice)',
  'Kid Cudi': 'Clay (voice)'},
 {'Cillian Murphy': 'J. Robert Oppenheimer',
  'Emily Blunt': 'Kitty Oppenheimer',
  'Matt Damon': 'Leslie Groves',
  'Robert Downey Jr.': 'Lewis Strauss',
  'Florence Pugh': 'Jean Tatlock',
  'Josh Hartnett': 'Ernest Lawrence',
  'Casey Affleck': 'Boris Pash',
  'Rami Malek': 'David Hill',
  'Kenneth Branagh': 'Niels Bohr'},
 {'John David Washington': 'Joshua',
  'Madeleine Yuna Voyles': 'Alphie',
  'Gemma Chan': 'Maya',
  'Allison Janney': 'Colonel Howell',
  'Ken Watanabe': 'Harun',
  'Sturgill Simpson': 'Drew',
  'Amar Chadha-Patel': 'Omni / SEK-ON / Sergeant Bui',
  'Marc Menchaca': 'McBride',
  'Robbie Tann': 'Shipley'},
 {'Adam Sandler': 'Leo (voice)',
  'Bill

# 

5. Write an user defined function that returns a pandas data frame with following data:

    a. Titles of the movies listed on the page. <br>
    b. User ratings of the movies listed on the page.  <br>
    c. Genres of the movies listed on the page. <br>
    d. Cast of the movies listed on the page.  <br>

In [20]:
import pandas as pd

def create_movie_dataframe(moviedb, html_list):
    titles_var = extract_titles(moviedb)
    ratings_var = extract_ratings(moviedb)
    genres_var = genres_from_html_list(html_list)
    cast_var = cast_from_html_list(html_list)
    
    all_elements = {"Titles": titles_var, "User Ratings": ratings_var, "Genres": genres_var, "Casts": cast_var}
    df = pd.DataFrame(all_elements)
    return df

display(create_movie_dataframe(moviedb, html_list))

Unnamed: 0,Titles,User Ratings,Genres,Casts
0,Trolls Band Together,71.27,"Animation,Family,Music,Fantasy,Comedy","{'Anna Kendrick': 'Poppy (voice)', 'Justin Tim..."
1,Oppenheimer,81.56,"Drama,History","{'Cillian Murphy': 'J. Robert Oppenheimer', 'E..."
2,The Creator,71.49,"Science Fiction,Action,Thriller","{'John David Washington': 'Joshua', 'Madeleine..."
3,Leo,79.0,"Animation,Comedy,Family","{'Adam Sandler': 'Leo (voice)', 'Bill Burr': '..."
4,Five Nights at Freddy's,78.57,"Horror,Mystery","{'Josh Hutcherson': 'Mike', 'Piper Rubio': 'Ab..."
5,Expend4bles,64.32,"Action,Adventure,Thriller","{'Sylvester Stallone': 'Barney Ross', 'Jason S..."
6,Jawan,71.46,"Action,Adventure,Thriller",{'Shah Rukh Khan': 'Vikram Rathore / Azaad Rat...
7,Fast X,72.03,"Action,Crime,Thriller","{'Vin Diesel': 'Dominic Toretto', 'Michelle Ro..."
8,Mission: Impossible - Dead Reckoning Part One,75.86,"Action,Thriller","{'Tom Cruise': 'Ethan Hunt', 'Hayley Atwell': ..."
9,Napoleon,64.86,"Drama,History,War","{'Joaquin Phoenix': 'Napoleon Bonaparte', 'Van..."


# 

**6. Scraping the data and combining the dataframes.**

(a) Write a function that scrapes data (mentioned in Q5) from page number 1, 2, 3, 4 and 5 on the URL https://www.themoviedb.org/movie and returns 5 data frames which can be exported to csv file by calling the functions defined in Q3a, Q4c and Q5.

In [21]:
import os

df_list = []
def create_and_save_dataframes(pages):
    for page_no in range(1, pages + 1):
        webpages = fetch_webpage(f"https://www.themoviedb.org/movie?page={page_no}")
        data_frame = create_movie_dataframe(webpages, html_content_for_movies(webpages))
        df_list.append(data_frame)
        file_location = os.path.join("TMDB_Database", f"TMDB_page_{page_no}.csv")
        data_frame.to_csv(file_location, index=False)
    return df_list

os.makedirs("TMDB_Database", exist_ok=True)

num_pages = 5
result_df_list = create_and_save_dataframes(num_pages)

print("Created and saved dataframes to CSV files in TMDB_Databse folder.")

Created and saved dataframes to CSV files in TMDB_Databse folder.


(b) Combine the data obtained from dataframes in Q6(a).

In [24]:
import pandas as pd

def concatenate_dataframes(df):
    return pd.concat(df, ignore_index=True)

display(concatenate_dataframes(df_list))

Unnamed: 0,Titles,User Ratings,Genres,Casts
0,Trolls Band Together,71.27,"Animation,Family,Music,Fantasy,Comedy","{'Anna Kendrick': 'Poppy (voice)', 'Justin Tim..."
1,Oppenheimer,81.56,"Drama,History","{'Cillian Murphy': 'J. Robert Oppenheimer', 'E..."
2,The Creator,71.49,"Science Fiction,Action,Thriller","{'John David Washington': 'Joshua', 'Madeleine..."
3,Leo,79.00,"Animation,Comedy,Family","{'Adam Sandler': 'Leo (voice)', 'Bill Burr': '..."
4,Five Nights at Freddy's,78.57,"Horror,Mystery","{'Josh Hutcherson': 'Mike', 'Piper Rubio': 'Ab..."
...,...,...,...,...
95,The Batman,77.01,"Crime,Mystery,Thriller",{'Robert Pattinson': 'Bruce Wayne / The Batman...
96,Genie,78.24,"Fantasy,Comedy","{'Melissa McCarthy': 'Flora', 'Paapa Essiedu':..."
97,Interstellar,84.21,"Adventure,Drama,Science Fiction","{'Matthew McConaughey': 'Joseph ""Coop"" Cooper'..."
98,When Evil Lurks,73.00,"Horror,Thriller","{'Ezequiel Rodríguez': 'Pedro', 'Demián Salomó..."
