# Course1 : Foundation of information

## Project-1b: Using web scraping to build a database of movie related information from: The Movie Database (TMDB) movie data

### Steps to be performed

### 1. Establish a connection to the webpage- "https://www.themoviedb.org/movie" and provide the following details

#### 1a.  Import the requests library and formulate a get request to download the contents of the webpage.

In [1]:
import requests
url = "https://www.themoviedb.org/movie"
needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
response = requests.get(("https://www.themoviedb.org/movie"),headers = needed_headers)

#### 1b. Verify the status code of the request and confirm that the request was executed appropriately

In [2]:
if response.status_code == 200:
    print("Request executed. Status code:", response.status_code)
else:
    print("Failed to retrieve. Status code:", response.status_code)

Request executed. Status code: 200


#### 1c. Print the contents of the page obtained from the response and save it in a variable

In [3]:
if response.status_code == 200:
    webpage_content = response.text
    print(webpage_content)

<!DOCTYPE html>
<html lang="en" class="no-js">
  <head>
    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>
    <meta http-equiv="cleartype" content="on">
    <meta charset="utf-8">
    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">
    <meta name="mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="viewport" content="width=device-width,initial-scale=1">
      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">
    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">
<meta name="msapplication-TileColor" content="#032541">
<meta name="theme-color" content="#032541">
<link rel="apple-touch-icon" sizes="180x180" href="/assets/2/appl

#### 1d. Infer the type of the variable created in part 1c and display the first 200 characters of the content from the server’s response 

In [4]:
print("First 200 characters of the content:")
print(webpage_content[:200])

First 200 characters of the content:
<!DOCTYPE html>
<html lang="en" class="no-js">
  <head>
    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>
    <meta http-equiv="cleartype" content="on">
    <meta charset="utf-8">
  


### 2. Parse the content of HTML response using the BeautifulSoup library and execute the tasks specified in the guidelines mentioned below 

#### 2a. From the BeautifulSoup library (bs4) import the BeautifulSoup class. Pass the contents of the webpage obtained from step 1c as an argument to create an instance of the BeautifulSoup class

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(webpage_content, "html.parser")

#### 2b. Extract the title of the parsed web page content using an appropriate method or attribute of the document object created

In [6]:
title = soup.title
print("Title of the webpage:", title.string)

Title of the webpage: Popular Movies — The Movie Database (TMDB)


#### 2c. Write a user defined function to generalize the task presented in 2a, to any URL that retrieves the content of the webpage. Your function should take a URL string as an input and return a correctly formulated BeautifulSoup instance as the output. In your function definition, ensure that appropriate exceptions are raised to the user (through status codes) if they pass in malformed/incorrect URLs. Write two test cases for your function- one with a working URL and another with an URL that gets a 404 response.

In [7]:
import requests
from bs4 import BeautifulSoup

def get_soup_from_url(url):
    try:
        # Send GET request
        response = requests.get(url,headers = needed_headers)
        # Check the status code
        response.raise_for_status()
        # Create BeautifulSoup instance
        soup = BeautifulSoup(response.content, "html.parser")
        return soup
    
    except requests.exceptions.HTTPError as e:
        print("HTTP error occurred:", e)
        return None
    except requests.exceptions.RequestException as e:
        print("Request exception occurred:", e)
        return None

# Test cases
# 1: Working URL
working_url = "https://books.toscrape.com"
soup_working = get_soup_from_url(working_url)
if soup_working:
    print("Title of the webpage:", soup_working.title.string)
else:
    print("Failed to retrieve webpage from", working_url)

# 2: URL with 404 response
bad_url = "https://books.toscrape.com/nonexistentpage"
soup_malformed = get_soup_from_url(bad_url)
if soup_malformed:
    print("Title of the webpage:", soup_malformed.title.string)
else:
    print("Failed to retrieve webpage from", bad_url)


Title of the webpage: 
    All products | Books to Scrape - Sandbox

HTTP error occurred: 404 Client Error: Not Found for url: https://books.toscrape.com/nonexistentpage
Failed to retrieve webpage from https://books.toscrape.com/nonexistentpage


### 3. Extract the content of the webpage- https://www.themoviedb.org/movie- that hosts a current dated listing of popular movies

#### 3a. Write a function call to the user defined function created in 2c with the url https://www.themoviedb.org/movie as an input and store the response in a variable

In [8]:
url = "https://www.themoviedb.org/movie"
soup = get_soup_from_url(url)

In [9]:
print(soup)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Popular Movies — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta content="#032541" name="msapplication-TileColor"/>
<meta content="#032541" name="theme-color"/>
<link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b82adaa1085c79bdde2f00ca8787b63d

#### 3b. Print the HTML content associated with the first movie displayed on the web page using appropriate HTML tags to access this listing on the object created in part 3a

In [10]:
first_movie_div = soup.find("div", class_="card style_1")

if first_movie_div:
    # Print the HTML content associated with the first movie
    print("HTML content associated with the first movie:")
    print(first_movie_div.prettify())  # prettify() method is used for better readability
else:
    print("First movie listing not found on the webpage.")

HTML content associated with the first movie:
<div class="card style_1">
 <div class="image">
  <div class="wrapper glyphicons_v2 picture grey no_image_holder">
   <a class="image" href="/movie/969492" title="Land of Bad">
    <img alt="Land of Bad" class="poster" loading="lazy" src="https://media.themoviedb.org/t/p/w220_and_h330_face/h3jYanWMEJq6JJsCopy1h7cT2Hs.jpg" srcset="https://media.themoviedb.org/t/p/w220_and_h330_face/h3jYanWMEJq6JJsCopy1h7cT2Hs.jpg 1x, https://media.themoviedb.org/t/p/w440_and_h660_face/h3jYanWMEJq6JJsCopy1h7cT2Hs.jpg 2x"/>
   </a>
  </div>
  <div class="options" data-id="969492" data-media-type="movie" data-object-id="626d6b49c56d2d11b1380252">
   <a class="no_click" href="#">
    <div class="glyphicons_v2 circle-more white">
    </div>
   </a>
  </div>
 </div>
 <div class="content">
  <div class="consensus tight">
   <div class="outer_ring">
    <div class="user_score_chart 626d6b49c56d2d11b1380252" data-bar-color="#21d07a" data-percent="71.0" data-track-col

#### 3c. Display the name of the first movie using appropriate HTML tags to access this listing on the object created in part 3a

In [11]:

movie_name_tag = soup.find("div", class_="card style_1").find("h2").find("a")
if movie_name_tag:
    movie_name = movie_name_tag.text.strip()
    print("Name of the first movie:", movie_name)
else:
    print("Movie name not found.")

Name of the first movie: Land of Bad


#### 3d. Display the user rating of the first movie by using appropriate HTML tags to access this listing on the object created in part 3a

In [12]:

user_rating_div = soup.find("div", class_="card style_1").find("div", class_="user_score_chart")
if user_rating_div:
    user_rating = user_rating_div["data-percent"]
    print("User rating of the first movie:", user_rating)
else:
    print("User rating not found within the first movie listing.")

User rating of the first movie: 71.0


#### 3e. For the first movie, extract the part of the url following the string “https://www.themoviedb.org/” using the appropriate HTML tags to extract this portion on the object created in part 3a (do not use built-in string methods). 
#### For example, if the first movie on the web page had the URL https://www.themoviedb.org/movie/779782 “ your output should be movie/779782

In [13]:
movie_url_tag = soup.find("div", class_="card style_1").find("div", class_="content").find("h2").find("a")
if movie_url_tag:
    movie_url = movie_url_tag["href"] 
    # Extract the part of the URL followed
    relative_url = movie_url.replace("https://www.themoviedb.org/", "", 1)
    print("Part of the URL following 'https://www.themoviedb.org/':", relative_url)
else:
    print("URL not found within the first movie listing.")

Part of the URL following 'https://www.themoviedb.org/': /movie/969492


### 4. Write user defined functions for each subsection below (i.e., Q4 a, Q4b, Q4c, Q4d, and Q4e) to return. 

#### 4a. Titles of all the movies on the page as a Python list

In [14]:
def get_movie_titles(soup):
    # Find all div elements containing movie listings
    movie_divs = soup.find_all("div", class_="card style_1")
    movie_titles = []
    for movie_div in movie_divs:
        title_tag = movie_div.find("div", class_="content").find("h2").find("a")
        if title_tag:
            movie_title = title_tag.text.strip()
            movie_titles.append(movie_title)
    return movie_titles
    
print("Titles of all movies on the page:\n", get_movie_titles(soup))

Titles of all movies on the page:
 ['Land of Bad', 'No Way Up', 'Anyone But You', 'Skal - Fight for Survival', 'Migration', 'Lights Out', "Miller's Girl", 'Wonka', 'The Marvels', 'Dune: Part Two', 'Badland Hunters', 'Dad or Mom', 'Wish', 'Trunk - Locked In', 'Dune', 'Aquaman and the Lost Kingdom', 'Poor Things', 'Through My Window 3: Looking at You', 'Mean Girls', 'Orion and the Dark']


#### 4b. User ratings of all the movies on the page as a Python list

In [15]:
def get_user_ratings(soup):
    # Extract ratings of all movies
    movie_divs = soup.find_all("div", class_="card style_1")
    movie_ratings = []
    for movie_div in movie_divs:
        rating_div = movie_div.find("div", class_="user_score_chart")
        if rating_div:
            movie_rating = rating_div["data-percent"]
            movie_ratings.append(movie_rating)
    return movie_ratings
    
print("Ratings of all movies on the page:",get_user_ratings(soup))

Ratings of all movies on the page: ['71.0', '57.12', '69.0', '53.550000000000004', '76.36999999999999', '69.14', '66.34', '72.07', '62.55', '87.0', '67.5', '73.68', '66.3', '57.13', '77.88', '68.7', '81.28', '65.36', '65.46000000000001', '66.65']


#### 4c.  HTML content of all the individual pages of movies collected into a Python list.

In [16]:
def get_movie_pages_html(soup):
    base_url = "https://www.themoviedb.org"
    movie_pages_html = []

    # Find all movie cards on the page
    movie_cards = soup.find_all("div", class_="card style_1")
    
    # Extract the URL of each movie and fetch its HTML content
    for card in movie_cards:
        movie_url_tag = card.find("div", class_="content").find("h2").find("a")
        if movie_url_tag:
            movie_url = movie_url_tag["href"]  
            full_movie_url = base_url + movie_url
            response = requests.get(full_movie_url,headers=needed_headers)
            movie_html = response.content
            movie_pages_html.append(movie_html)
    
    return movie_pages_html


movie_pages_html = get_movie_pages_html(soup)
print(movie_pages_html) 

[b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Land of Bad (2024) &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="When a Delta Force special ops mission goes terribly wrong, Air Force drone pilot Reaper has 48 hours to remedy what has devolved into a wild rescue operation. With no weapons and no communication other than the drone above, the ground mission suddenly becomes a full-scale battle when the team is discovered by the enemy.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icon

#### 4d.  Genres of all the movies on the page as a Python list 

In [17]:
def get_genres_from_html_contents(movie_html_contents):
    all_genres = []
    for html_content in movie_html_contents:
        try:
            soup = BeautifulSoup(html_content, "html.parser")
            genre_spans = soup.find_all("span", class_="genres")
            movie_genres = []
            for genre_span in genre_spans:
                genre_links = genre_span.find_all("a")
                genres = [genre.text.strip() for genre in genre_links]
                movie_genres.extend(genres)
            all_genres.append(movie_genres)
        except Exception as e:
            print("An error occurred:", e)
            all_genres.append(None)
    return all_genres


print(get_genres_from_html_contents(movie_pages_html))


[['Action', 'Thriller', 'War'], ['Action', 'Horror', 'Thriller'], ['Comedy', 'Romance'], ['Action', 'Horror', 'Comedy', 'Thriller'], ['Animation', 'Action', 'Adventure', 'Comedy', 'Family'], ['Action', 'Thriller'], ['Drama', 'Thriller'], ['Comedy', 'Family', 'Fantasy'], ['Science Fiction', 'Adventure', 'Action'], ['Science Fiction', 'Adventure'], ['Science Fiction', 'Action', 'Drama'], ['Comedy'], ['Animation', 'Family', 'Fantasy', 'Adventure'], ['Thriller', 'Action', 'Drama'], ['Science Fiction', 'Adventure'], ['Action', 'Adventure', 'Fantasy'], ['Science Fiction', 'Romance', 'Comedy'], ['Romance', 'Drama', 'Comedy'], ['Comedy'], ['Animation', 'Family', 'Comedy', 'Fantasy']]


#### 4e. Cast of all the movies on the page as a Python list 

In [19]:
from bs4 import BeautifulSoup

def get_cast_from_html_contents(movie_html_contents):
    all_cast = []
    for html_content in movie_html_contents:
        try:
            soup = BeautifulSoup(html_content, "html.parser")
            cast_items = soup.find_all("li", class_="card")
            movie_cast = []
            for item in cast_items:
                actor_name_tag = item.find("p").find("a")
                if actor_name_tag:
                    actor_name = actor_name_tag.text.strip()
                    movie_cast.append(actor_name)
            all_cast.append(movie_cast)
        except Exception as e:
            print("An error occurred:", e)
            all_cast.append(None)
    return all_cast


print(get_cast_from_html_contents(movie_pages_html))

[['Liam Hemsworth', 'Russell Crowe', 'Luke Hemsworth', 'Ricky Whittle', 'Milo Ventimiglia', 'Chika Ikogwe', 'Daniel MacPherson', 'Robert Rabiah', 'Jack Finsterer'], ['Sophie McIntosh', 'Will Attenborough', 'Jeremias Amoore', 'Manuel Pacific', 'Grace Nettle', 'Phyllis Logan', 'Colm Meaney', 'James Carroll Jordan', 'David J Biscoe'], ['Sydney Sweeney', 'Glen Powell', 'Alexandra Shipp', 'Michelle Hurd', 'Bryan Brown', 'Darren Barnet', 'Hadley Robinson', 'Dermot Mulroney', 'Rachel Griffiths'], ['Evan Marsh', 'Chris Sandiford', 'Mariah Inger', 'Darren Eisenhauer', 'Olivia Scriven', 'Trevor Hayes'], ['Kumail Nanjiani', 'Elizabeth Banks', 'Caspar Jennings', 'Tresi Gazal', 'Awkwafina', 'Carol Kane', 'Keegan-Michael Key', 'Danny DeVito', 'David Mitchell'], ['Frank Grillo', 'Mekhi Phifer', 'Scott Adkins', 'Dermot Mulroney', 'Jaime King', 'Kevin Gage', 'Amaury Nolasco', 'Jessica Medina', 'JuJu Chan'], ['Martin Freeman', 'Jenna Ortega', 'Bashir Salahuddin', 'Gideon Adlon', 'Dagmara Domińczyk', 'Ch

### 5 Write an user defined function that returns a pandas data frame with following data: 

#### 5a. Titles of the movies listed on the page
#### 5b. User ratings of the movies listed on the page
#### 5c. Genres of the movies listed on the page
#### 5d. Cast of the movies listed on the page

In [20]:
import pandas as pd

def create_movie_dataframe(soup, movie_html_contents):
    # Get titles of the movies
    movie_titles = get_movie_titles(soup)

    # Get user ratings of the movies
    user_ratings = get_user_ratings(soup)

    # Get genres of the movies
    genres = get_genres_from_html_contents(movie_html_contents)

    # Get cast of the movies
    cast = get_cast_from_html_contents(movie_html_contents)

    data = []
    for i in range(len(movie_titles)):
        data.append({
            "Title": movie_titles[i],
            "User Rating": user_ratings[i] if user_ratings else None,
            "Genres": genres[i] if genres else None,
            "Cast": cast[i] if cast else None
        })

    movie_df = pd.DataFrame(data)
    return movie_df


print(create_movie_dataframe(soup, movie_pages_html))


                                  Title         User Rating  \
0                           Land of Bad                71.0   
1                             No Way Up               57.12   
2                        Anyone But You                69.0   
3             Skal - Fight for Survival  53.550000000000004   
4                             Migration   76.36999999999999   
5                            Lights Out               69.14   
6                         Miller's Girl               66.34   
7                                 Wonka               72.07   
8                           The Marvels               62.55   
9                        Dune: Part Two                87.0   
10                      Badland Hunters                67.5   
11                           Dad or Mom               73.68   
12                                 Wish                66.3   
13                    Trunk - Locked In               57.13   
14                                 Dune               7

### 6 Scraping the data and combining the dataframes

#### 6a. Write a function that scrapes data (mentioned in Q5) from page number 1, 2, 3, 4 and 5 on the URL https://www.themoviedb.org/movie and returns 5 data frames which can be exported to csv file by calling the functions defined in Q3a, Q4c and Q5

In [21]:
def scrape_and_export_movie_data(base_url):
    
    dataframes = []

    for page_number in range(1, 6):

        page_url = f"{base_url}?page={page_number}"
        
        # Get HTML content of the current page
        movie_html_contents = get_movie_pages_html(soup)

        # Create DataFrame for the current page
        page_dataframe = create_movie_dataframe(soup, movie_html_contents)
        
        # Export DataFrame to CSV file
        page_dataframe.to_csv(f"movies_page_{page_number}.csv", index=False)

        # Append DataFrame to the list
        dataframes.append(page_dataframe)

    # Return the list of DataFrames
    return tuple(dataframes)


base_url = "https://www.themoviedb.org/movie"

df1, df2, df3, df4, df5 = scrape_and_export_movie_data(base_url)


#### 6b. Combine the data obtained from dataframes in Q6(a)

In [22]:
pd.concat([df1,df2,df3,df4,df5])

Unnamed: 0,Title,User Rating,Genres,Cast
0,Land of Bad,71.0,"[Action, Thriller, War]","[Liam Hemsworth, Russell Crowe, Luke Hemsworth..."
1,No Way Up,57.12,"[Action, Horror, Thriller]","[Sophie McIntosh, Will Attenborough, Jeremias ..."
2,Anyone But You,69.0,"[Comedy, Romance]","[Sydney Sweeney, Glen Powell, Alexandra Shipp,..."
3,Skal - Fight for Survival,53.550000000000004,"[Action, Horror, Comedy, Thriller]","[Evan Marsh, Chris Sandiford, Mariah Inger, Da..."
4,Migration,76.36999999999999,"[Animation, Action, Adventure, Comedy, Family]","[Kumail Nanjiani, Elizabeth Banks, Caspar Jenn..."
...,...,...,...,...
15,Aquaman and the Lost Kingdom,68.7,"[Action, Adventure, Fantasy]","[Jason Momoa, Patrick Wilson, Yahya Abdul-Mate..."
16,Poor Things,81.28,"[Science Fiction, Romance, Comedy]","[Emma Stone, Mark Ruffalo, Willem Dafoe, Ramy ..."
17,Through My Window 3: Looking at You,65.36,"[Romance, Drama, Comedy]","[Clara Galle, Julio Peña, Natalia Azahara, Eri..."
18,Mean Girls,65.46000000000001,[Comedy],"[Angourie Rice, Reneé Rapp, Auli'i Cravalho, J..."


## Project Summary: Web Scraping for Movie Data

### Objective:

The objective of this project is to programmatically extract data from The Movie Database website (https://www.themoviedb.org/) and format it into a tabular structure for analysis. The extracted data includes details such as Title of the movie, genres, members, and user ratings.


### Overview

#### Business Requirement:
The business requirement is to gather relevant data from web pages hosting movie information.
Manual procedures are not feasible due to the vast amount of information spread across multiple websites.

#### Approach:
The project employs a programmatic approach to access, parse, and extract information from the website of interest.
It utilizes web scraping techniques to automate the extraction process.
Implementation:

#### Data Source: 
The Movie Database website (https://www.themoviedb.org/)

#### Tasks: 
Each task involves accessing specific types of information from the website.

#### Documentation: 
Code examples provided in the documentation are adapted for each task.


### Potential Extensions:

The project can be extended to gather data from additional websites or to include more detailed information about movies.
Automation scripts can be developed to periodically update the dataset with the latest movie releases and information.


### Conclusion:

This project demonstrates the effectiveness of a programmatic approach to extract and format data from web pages, fulfilling the business requirement of gathering movie-related information for analysis. By leveraging web scraping techniques, it helps efficient access to vast amounts of data spread across multiple web pages, paving the way for significant insights.