**Course1: Foundation of Information**

**Part – B [Marks 35]**

**Project-1:**  
Using web scraping to build a database of movie related
information from: The Movie Database (TMDB) movie data

**Problem statement:**  
A common business requirement in the context of information gathering is to extract and filter relevant data from web pages that host this information. However, access to information spread over several web pages, hosted potentially on multiple websites is a cumbersome process and we cannot rely on manual procedures to execute this task. In this project, you will employ a programmatic approach to access, parse and extract relevant information from a website of interest.

**Objective:**  
The project's goal is to extract data (from a chosen number of pages) from The Movie Database website (https://www.themoviedb.org/) into a tabular data format so that further analysis (e.g., details about a movie's genre, cast, and user rating) can be facilitated.To execute this project, you will have to read the documentation links provided against each task in the assignment and adapt the code examples provided in the documentation for the task at hand

**Pre-requisites :**  
*Tools:* Jupyter Notebook or Google Colab or Microsoft Visual Studio IDE
*Languages:* Python, HTML  
*Libraries:* requests, beautifulSoup, pandas




**Task Questions (35 Marks)**

**1.Establish a connection to the webpage - "https://www.themoviedb.org/movie" - and provide the following details ( 4 marks )**

> a. Import the requests library (https://requests.readthedocs.io/en/latest/ ) and formulate a get request to download the contents of the webpage ( "https://www.themoviedb.org/movie" ) ( 1 mark )




In [33]:
import requests
from bs4 import BeautifulSoup

# Needed headers for get request
needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}

# make a get request to server
response = requests.get(("https://www.themoviedb.org/movie"), headers=needed_headers)

> b. Verify the status code of the request and confirm that the request was executed appropriately (https://requests.readthedocs.io/en/latest/user/quickstart/#responsestatus-code) ( 1 mark )



In [34]:
# Verify status code
print(response.status_code)

# Confirmation of Execution
if response.status_code == 200:
  print(f"Request is executed appropriately with a status code of '{response.status_code}'")

200
Request is executed appropriately with a status code of '200'


> c. Print the contents of the page obtained from the response and save it in a variable (https://requests.readthedocs.io/en/latest/user/quickstart/#response-content) ( 1 mark )



In [35]:
# Print the contents of the page obtained from the response
print(response.text)

# Save page content to variable
page_content = response.text

<!DOCTYPE html>
<html lang="en" class="no-js">
  <head>
    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>
    <meta http-equiv="cleartype" content="on">
    <meta charset="utf-8">
    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">
    <meta name="mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="viewport" content="width=device-width,initial-scale=1">
      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">
    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">
<meta name="msapplication-TileColor" content="#032541">
<meta name="theme-color" content="#032541">
<link rel="apple-touch-icon" sizes="180x180" href="/assets/2/appl



> d. Infer the type of the variable created in part 1c and display the first 200 characters of the content from the server’s response ( 1 Mark )



In [38]:
# Display variable type
print(type(page_content))

# Display first 200 characters of the content
print(page_content[:200])

<class 'str'>
<!DOCTYPE html>
<html lang="en" class="no-js">
  <head>
    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>
    <meta http-equiv="cleartype" content="on">
    <meta charset="utf-8">
  


**2.Parse the content of HTML response using the BeautifulSoup library and execute the tasks specified in the guidelines mentioned below ( 6 marks )**

> a. From the BeautifulSoup library (bs4) import the BeautifulSoup class. Pass the contents of the webpage obtained from step 1c as an argument to create an instance of the BeautifulSoup class ( 2 Marks )




In [39]:
# From the BeautifulSoup library (bs4) import the BeautifulSoup class
from bs4 import BeautifulSoup

# Create instance of BeautifuSoup class by passing the contents of the webpage obtained from step 1c
BS_Obj = BeautifulSoup(page_content, 'html.parser')



> b. Extract the title of the parsed web page content using an appropriate method or
attribute of the document object created in part 2a
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tagnames) ( 1 Mark )




In [40]:
# Extract title of parsed web page content using an attribute of the document object created in part 2a
BS_Obj.title.text

'Popular Movies — The Movie Database (TMDB)'



> c. Write a user defined function to generalize the task presented in Q2a to any URL that
retrieves the content of the webpage. Your function should take a URL string as an input
and return a correctly formulated BeautifulSoup instance as the output. In your
function definition, ensure that appropriate exceptions are raised to the user (through
status codes) if they pass in malformed/incorrect URLs. Write two test cases for your
function - one with a working URL and another with an URL that gets a 404 response. (
3 marks )



In [41]:
# User define function to return BeautifulSoup instance
def retrieve_content(url):

  # Send a get request to desired url
  response = requests.get((url), headers=needed_headers)

  # Raise exception if request is not sucessful
  try:
    response.raise_for_status()

    # Correctly formulated BeautifulSoup instance
    BS_Obj = BeautifulSoup(response.text, 'html.parser')
    return BS_Obj
  except requests.exceptions.RequestException as e:
    print(f"TEST CASE-2 : Error fetching content ({str(e)})")

# Test case 1: Working URL
url1 = "https://www.example-1.com"
soup = retrieve_content(url1)
print(f"TEST CASE-1 : Response of working url - {soup.title}")

# Test case 2: URL with 404 response
url2 = "https://www.example.com/nonexistent-page"
soup = retrieve_content(url2)
print(f"TEST CASE-2 : Response of invalid url - {soup}")





TEST CASE-1 : Response of working url - <title>Example-1 IT and Business Blog</title>
TEST CASE-2 : Error fetching content (404 Client Error: Not Found for url: https://www.example.com/nonexistent-page)
TEST CASE-2 : Response of invalid url - None


**3. Extract the content of the webpage - https://www.themoviedb.org/movie - that hosts a current dated listing of popular movies. ( 5 Marks )**



> a. Write a function call to the user defined function created in 2c with the url https://www.themoviedb.org/movie as an input and store the response in a variable ( 1 mark )



In [42]:
url = "https://www.themoviedb.org/movie"

# Make call to function with url and store the response
BS_response = retrieve_content(url)



> b. Print the HTML content associated with the first movie displayed on the web page using appropriate HTML tags to access this listing on the object created in part 3a ( 1 mark )




In [43]:
# Access HTML content associated with first movie
first_movie = BS_response.find('div', class_= 'card style_1')

# Print Accessed content
print(first_movie.prettify())

<div class="card style_1">
 <div class="image">
  <div class="wrapper">
   <a class="image" href="/movie/901362" title="Trolls Band Together">
    <img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/qV4fdXXUm5xNlEJ2jw7af3XxuQB.jpg" srcset="/t/p/w220_and_h330_face/qV4fdXXUm5xNlEJ2jw7af3XxuQB.jpg 1x, /t/p/w440_and_h660_face/qV4fdXXUm5xNlEJ2jw7af3XxuQB.jpg 2x"/>
   </a>
  </div>
  <div class="options" data-id="901362" data-media-type="movie" data-object-id="619bea97c0ae360089136cff">
   <a class="no_click" href="#">
    <div class="glyphicons_v2 circle-more white">
    </div>
   </a>
  </div>
 </div>
 <div class="content">
  <div class="consensus tight">
   <div class="outer_ring">
    <div class="user_score_chart 619bea97c0ae360089136cff" data-bar-color="#21d07a" data-percent="71.0" data-track-color="#204529">
     <div class="percent">
      <span class="icon icon-r71">
      </span>
     </div>
    </div>
   </div>
  </div>
  <h2>
   <a href="/movie/901362" title="Tr



> c. Display the name of the first movie using appropriate HTML tags to access this listing on the object created in part 3a (1 mark)



In [44]:
# Access name of first movie
name = BS_response.find('div', class_= 'card style_1').h2.a['title']

# Print accessed name
print(name)

Trolls Band Together




> d. Display the user rating of the first movie by using appropriate HTML tags to access this listing on the object created in part 3a (1 mark )



In [45]:
# Access user rating of the first movie
rating = BS_response.find('div', class_= 'user_score_chart')['data-percent']

# Print Accessed user rating
print(rating)

71.0




> e. For the first movie, extract the part of the url following the string
“https://www.themoviedb.org/” using the appropriate HTML tags to extract this portion on the object created in part 3a (do not use built-in string methods). (1 mark )
For example, if the first movie on the web page had the URL
https://www.themoviedb.org/movie/779782 “ your output should be movie/779782



In [46]:
# Extract part of url
movie_url = BS_response.find('div', class_= 'wrapper').a['href']

# Print extracted url
print(movie_url)


/movie/901362


**4. Write user defined functions for each subsection below (i.e., Q4 a, Q4b, Q4c, Q4d, and Q4e) to
return ( 10 marks )**

**Note :**  
A ) Input to the user defined functions :
*   For Q4a, Q4b, Q4c: the response object created in Q3a
*   For Q4d, Q4e: the list output from Q4c

B ) Note that some movies might not have a user rating. Your function should be able to parse
the numeric value of rating when it exists. When the rating is not available, the value “not rated” should
be appended to the list created in Q4b



> a. Titles of all the movies on the page as a Python list (2 marks )




In [60]:
# Function to extract all the titles of movies on the page
def movie_titles(soup_obj):
  title_list = []

  # Extract title containing HTML part of all movies
  movie_list = soup_obj.find_all('div', class_='card style_1')

  # Loop to iterate over movie list
  for movie in movie_list:

    # Make a list of extracted title from each movie
    title_list.append(movie.h2.a['title'])

  # Return the list
  return title_list


# print to check movie titles obtained by calling movie_titles function with the response object created in Q3a ie.BS_response as input
print(movie_titles(BS_response))

['Trolls Band Together', 'Leo', 'Freelance', 'Oppenheimer', 'Wonka', 'The Bad Guys: A Very Bad Holiday', "Five Nights at Freddy's", 'Reign of Chaos', 'The Creator', 'Come Out Fighting', 'Fast X', 'Expend4bles', 'Mission: Impossible - Dead Reckoning Part One', 'Inferno', 'Godzilla Minus One', 'Mousa', 'Candy Cane Lane', 'Killers of the Flower Moon', 'The Immortal Wars: Rebirth', 'Agent Jade Black']




> b. User ratings of all the movies on the page as a Python list (2 marks )



In [62]:
# Function to extract all the ratings of movies on the page
def user_ratings(soup_obj):
  rating_list = []

  # Extract rating containing HTML part of all movies
  movie_list = soup_obj.find_all('div', class_='user_score_chart')

  # Loop to iterate over movie list
  for movie in movie_list:

    # Make a list of extracted rating from each movie
    if(movie['data-percent'] or movie['data-percent'] == '0'):
      rating_list.append(movie['data-percent'])
    else:
      rating_list.append("not rated")

  # Return the list
  return rating_list

# print to check user ratings obtained by calling user_ratings function with the response object created in Q3a ie.BS_response as input
print(user_ratings(BS_response))

['71.0', '75.03', '64.14', '81.44', '67.0', '58.75', '78.4', '52.0', '71.23', '47.0', '71.98', '64.32000000000001', '75.92', '61.0', '84.0', '66.0', '63.3', '77.0', '35.0', '13.0']




> c. HTML content of all the individual pages of movies collected into a Python list. ( 2 marks
)




In [64]:
# Function to extract HTML content of all the individual pages of movies on page
def HTML_Content(soup_obj):
  content_list = []
  main_url = "https://www.themoviedb.org/"

  # Extract href containing HTML part of all movies
  movie_list = soup_obj.find_all('div', class_= 'wrapper')

  # Loop to iterate over movie list
  for movie in movie_list:

    # Extract href from each part
    end_url = movie.a['href']

    # Get the response of individual movie page
    content_response = requests.get(main_url + end_url, headers=needed_headers)

    # Store the response in list
    content_list.append(content_response.text)

  # Return the list
  return content_list

# print to check HTML content obtained by calling HTML_Content function with the response object created in Q3a ie.BS_response as input
print(HTML_Content(BS_response))

['<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Trolls Band Together (2023) &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="When Branch&#39;s brother, Floyd, is kidnapped for his musical talents by a pair of nefarious pop-star villains, Branch and Poppy embark on a harrowing and emotional journey to reunite the other brothers and rescue Floyd from a fate even worse than pop-culture obscurity.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb7



> d. Genres of all the movies on the page as a Python list ( 2 marks )



In [67]:
# Get contents of individual movie pages by calling HTML_Content function
contents = HTML_Content(BS_response)

# Function to extract genres of all movies from their individual page
def movie_genres(contents):
  final_genre = []

  # Iterate over contents
  for content in contents:
    individual_genre = []

    # Create a BeautifulSoup object for each movie page content
    content_obj = BeautifulSoup(content, 'html.parser')

    # Extract genre from content
    genres = content_obj.find('span', class_='genres').find_all('a')

    # Add all extracted genre of each movie to list
    for genre in genres:
      individual_genre.append(genre.text)

    # Add all lists of genre for each movie content to final genre list
    final_genre.append(individual_genre)

  # Return the list
  return final_genre


# print to check genres of all movies obtained by calling movie_genres function with the list output from Q4c ie.contents as input
print(movie_genres(contents))

[['Animation', 'Family', 'Music', 'Fantasy', 'Comedy'], ['Animation', 'Comedy', 'Family'], ['Action', 'Comedy'], ['Drama', 'History'], ['Comedy', 'Family', 'Fantasy'], ['Animation', 'Comedy', 'Family'], ['Horror', 'Mystery'], ['Action', 'Horror', 'Fantasy'], ['Science Fiction', 'Action', 'Thriller'], ['War', 'Drama', 'Action'], ['Action', 'Crime', 'Thriller'], ['Action', 'Adventure', 'Thriller'], ['Action', 'Thriller'], ['Action', 'Drama', 'Romance'], ['Drama', 'Science Fiction', 'Action', 'Horror'], ['Action', 'Adventure', 'Science Fiction'], ['Comedy', 'Fantasy', 'Family'], ['Crime', 'Drama', 'History'], ['Science Fiction'], ['Action']]




> e. Cast of all the movies on the page as a Python list ( 2 marks )



In [66]:
# Get contents of individual movie pages by calling HTML_Content function
contents = HTML_Content(BS_response)

# Function to extract cast of all movies from their individual page
def movie_cast(contents):
  all_cast = []

  # Iterate over contents
  for content in contents:
    movie_cast = []

    # Create a BeautifulSoup object for each movie page content
    content_obj = BeautifulSoup(content, 'html.parser')

    # Extract all cast from content
    cast_names = content_obj.find_all('li', class_='card')

    # Add all extracted cast of each movie to list
    for name in cast_names:
      movie_cast.append(name.p.a.text)

    # Add all lists of cast for each movie content to all cast list
    all_cast.append(movie_cast)

  # Return the list
  return all_cast


# print to check cast of all movies obtained by calling movie_cast function with the list output from Q4c ie.contents as input
print(movie_cast(contents))

[['Anna Kendrick', 'Justin Timberlake', 'Camila Cabello', 'Eric André', 'Amy Schumer', 'Andrew Rannells', 'Daveed Diggs', 'Troye Sivan', 'Kid Cudi'], ['Adam Sandler', 'Bill Burr', 'Cecily Strong', 'Jason Alexander', 'Rob Schneider', 'Allison Strong', 'Jo Koy', 'Sadie Sandler', 'Sunny Sandler'], ['John Cena', 'Alison Brie', 'Juan Pablo Raba', 'Alice Eve', 'Marton Csokas', 'Christian Slater', 'Julianne Arrieta', 'Molly McCann', 'Daniel Toro'], ['Cillian Murphy', 'Emily Blunt', 'Matt Damon', 'Robert Downey Jr.', 'Florence Pugh', 'Josh Hartnett', 'Casey Affleck', 'Rami Malek', 'Kenneth Branagh'], ['Timothée Chalamet', 'Calah Lane', 'Keegan-Michael Key', 'Olivia Colman', 'Rowan Atkinson', 'Hugh Grant', 'Sally Hawkins', 'Matt Lucas', 'Mathew Baynton'], ['Michael Godere', 'Ezekiel Ajeigbe', 'Raul Ceballos', 'Chris Diamantopoulos', 'Mallory Low', 'Zehra Fazal', 'Keith Silverstein', 'Kari Wahlgren'], ['Josh Hutcherson', 'Piper Rubio', 'Elizabeth Lail', 'Matthew Lillard', 'Mary Stuart Masterson'

**5. Write an user defined function that returns a pandas data frame with following data: ( 5 marks )**


> a. Titles of the movies listed on the page


> b. User ratings of the movies listed on the page


> c. Genres of the movies listed on the page


> d. Cast of the movies listed on the page



**Note:**

Input to the user defined function :

*  The response object created in Q3a
*  The list output from Q4c














In [74]:
import pandas as pd

# Function to return a pandas data frame of Titles, Ratings, Genres, and Cast
def convret_to_df(BS_response, contents):

  # Create a data object with keys as column names and values as lists returned from each function
  data = {'Titles' : movie_titles(BS_response), 'Ratings' : user_ratings(BS_response), 'Genres' : movie_genres(contents), 'Cast' : movie_cast(contents)}

  # Convert data to DataFrame
  df = pd.DataFrame(data)

  # Remove array brackets for Genres and Cast columns
  df['Genres'] = df['Genres'].apply(lambda x: ', '.join(x))
  df['Cast'] = df['Cast'].apply(lambda x: ', '.join(x))

  # Return DataFrame
  return df

# Print to check data frame obtained by calling convret_to_df with The response object created in Q3a i.e BS_response and The list output from Q4c i.e contents as inputs
print(convret_to_df(BS_response, contents))

                                           Titles            Ratings  \
0                            Trolls Band Together               71.0   
1                                             Leo              75.03   
2                                       Freelance              64.14   
3                                     Oppenheimer              81.44   
4                                           Wonka               67.0   
5                The Bad Guys: A Very Bad Holiday              58.75   
6                         Five Nights at Freddy's               78.4   
7                                  Reign of Chaos               52.0   
8                                     The Creator              71.23   
9                               Come Out Fighting               47.0   
10                                         Fast X              71.98   
11                                    Expend4bles  64.32000000000001   
12  Mission: Impossible - Dead Reckoning Part One              7

**6. Scraping the data and combining the dataframes ( 5 marks )**



> (a) Write a function that scrapes data (mentioned in Q5) from page number 1, 2, 3, 4 and 5 on the URL https://www.themoviedb.org/movie and returns 5 data frames which can be exported to csv file by calling the functions defined in Q3a, Q4c and Q5 (3 marks)



In [72]:
# Function to scrape data from 5 pages
def scrape_data():
  url = "https://www.themoviedb.org/movie"
  page_url = "?page="
  list_of_frames = []

  # Loop to extract data from five pages(each page containing 20 movies)
  for i in range(1,6):

    # Append necessary strings to main url
    url_to_extract = url + page_url + str(i)

    # Call to function retrieve_content to obtain BeautifulSoup instance of retrived content of given url
    BS_response = retrieve_content(url_to_extract)

    # Call to function HTML_Content to obtain content list of movies of particular page
    list_of_content = HTML_Content(BS_response)

    # Call to function convret_to_df to obtain DataFrames
    data_frame = convret_to_df(BS_response, list_of_content)

    # Make a list of five DataFrames
    list_of_frames.append(data_frame)

  # Return the list of DataFrames
  return list_of_frames


# Print to check List of data frames obtained by calling scrape_data function
print(scrape_data())

[                                           Titles            Ratings  \
0                            Trolls Band Together               71.0   
1                                             Leo              74.99   
2                                       Freelance              64.14   
3                                     Oppenheimer              81.44   
4                                           Wonka              67.14   
5                The Bad Guys: A Very Bad Holiday              58.75   
6                         Five Nights at Freddy's               78.4   
7                                  Reign of Chaos               52.0   
8                                     The Creator              71.23   
9                               Come Out Fighting               47.0   
10                                         Fast X              71.98   
11                                    Expend4bles  64.32000000000001   
12  Mission: Impossible - Dead Reckoning Part One              



> (b) Combine the data obtained from dataframes in Q6(a) (2 marks)



In [75]:
# Call to scrape_data function to obtain list of DataFrames
list_of_df = scrape_data()

# Concat all data frames along axis 0
combined_data = pd.concat(list_of_df , axis=0)

# Export combined data to CSV file
combined_data.to_csv('combine_data.csv', index=False)

print(f'combined data is exported to combine_data.csv file')

combined data is exported to combine_data.csv file
