**Objective**

The project's goal is to extract data (from a chosen number of pages) from The Movie Database website (https://www.themoviedb.org/) into a tabular data format so that further analysis (e.g., details about a movie's genre, cast, and user rating) can be facilitated. To execute this project, you will have to read the documentation links provided against each task in the assignment and adapt the code examples provided in the documentation for the task at hand

1.
Establish a connection to the webpage - "https://www.themoviedb.org/movie" - and provide the following details ( 4 marks ) bold text

a. Import the requests library (https://requests.readthedocs.io/en/latest/ ) and formulate a get request to download the contents of the webpage ( "https://www.themoviedb.org/movie" ) ( 1 mark )

In [1]:
#Installing the requests package
!pip install requests --upgrade



In [2]:
#Importing the requests package into this code file

import requests


In [3]:
#We are adding the headers to establish a secure connection request to the TMDB Website.

needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36  (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}

#Initating the request to the TMDB Website

response = requests.get(('https://www.themoviedb.org/movie'),headers= needed_headers,allow_redirects=False)

b. Verify the status code of the request and confirm that the request was executed appropriately (https://requests.readthedocs.io/en/latest/user/quickstart/#responsestatus-code) ( 1 mark )

In [4]:
# The status code shows the status of the request website connection
#403 - represents the request connect is not establish between TMDB Website and user
#200 - represents successful connection between TMDB Webite and User

response.status_code

200

c. Print the contents of the page obtained from the response and save it in a variable (https://requests.readthedocs.io/en/latest/user/quickstart/#response-content) ( 1 mark )

In [5]:
#Extracting the Contents of the websites

extract_content = response.text

extract_content

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" content="#032541">\n<meta name="theme-color" content="#032541">\n<link rel="apple-touch-icon" sizes="180x180" href=

d. Infer the type of the variable created in part 1c and display the first 200 characters of the content from the server’s response ( 1 Mark )

In [6]:
# To print the data type of the extract_content variable
print(type(extract_content))

<class 'str'>


In [7]:
#To display the first 200 characters of the content from the server
extract_content[:200]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n  '

2. Parse the content of HTML response using the BeautifulSoup library and execute the tasks specified in the guidelines mentioned below ( 6 marks )

a. From the BeautifulSoup library (bs4) import the BeautifulSoup class. Pass the contents of the webpage obtained from step 1c as an argument to create an instance of the BeautifulSoup class ( 2 Marks )

In [8]:
#Install the BeautifulSoup library
!pip install beautifulsoup4 --upgrade

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: beautifulsoup4
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.11.2
    Uninstalling beautifulsoup4-4.11.2:
      Successfully uninstalled beautifulsoup4-4.11.2
Successfully installed beautifulsoup4-4.12.2


In [9]:
#Import the BeautifulSoup pack into the code

from bs4 import BeautifulSoup
import re
#passing the extract_content into the beautifulSoup in order to access the HTML Elements

movies = BeautifulSoup(extract_content, 'html.parser')

b. Extract the title of the parsed web page content using an appropriate method or attribute of the document object created in part 2a

In [10]:
#To extract the title of TMDB Website by using the BeautifulSoup(Bs) element find

movies.find('title')

<title>Popular Movies — The Movie Database (TMDB)</title>

c. Write a user defined function to generalize the task presented in Q2a to any URL that retrieves the content of the webpage. Your function should take a URL string as an input and return a correctly formulated BeautifulSoup instance as the output. In your function definition, ensure that appropriate exceptions are raised to the user (through status codes) if they pass in malformed/incorrect URLs. Write two test cases for your function - one with a working URL and another with an URL that gets a 404 response. ( 3 marks )

In [11]:
#User defined function get_url to get the contents of the page

def get_url(url):


    needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36  (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
    response = requests.get(url, headers = needed_headers,allow_redirects=False)


    if not response.ok:

        raise Exception ("Failed to request the data. Status Code:- {}".format(response.status_code))

    else:

        page_content = response.text
        movies_page = BeautifulSoup(page_content, "html.parser")

        return movies_page

In [12]:
#First test case-------> Writing correct url
url = "https://www.themoviedb.org/movie"

get_url(url)

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<title>Popular Movies — The Movie Database (TMDB)</title>
<meta content="on" http-equiv="cleartype"/>
<meta charset="utf-8"/>
<meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
<meta content="yes" name="mobile-web-app-capable"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
<meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
<meta content="#032541" name="msapplication-TileColor"/>
<meta content="#032541" name="theme-color"/>
<link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b82adaa1085c79bdde2f00ca8787b63d

In [13]:
#Second test case------> Writing incorrect url("missed 'e' in the movie")
url = "https://www.themoviedb.org/movi"

get_url(url)

Exception: ignored

3. Extract the content of the webpage - https://www.themoviedb.org/movie - that hosts a current dated listing of popular movies. ( 5 Marks )

a. Write a function call to the user defined function created in 2c with the url https://www.themoviedb.org/movie as an input and store the response in a variable ( 1 mark )

In [14]:
# calling get_url with the above mentioned url and storing it in movie_content
url = "https://www.themoviedb.org/movie"

movie_content = get_url(url)

b. Print the HTML content associated with the first movie displayed on the web page using appropriate HTML tags to access this listing on the object created in part 3a ( 1 mark )

In [15]:
#Dispalying the HTML Tag associated with the first movie (i.e: Leo)
movie_content.find_all('div', class_='card style_1')[0]

<div class="card style_1">
<div class="image">
<div class="wrapper">
<a class="image" href="/movie/901362" title="Trolls Band Together">
<img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/bkpPTZUdq31UGDovmszsg2CchiI.jpg" srcset="/t/p/w220_and_h330_face/bkpPTZUdq31UGDovmszsg2CchiI.jpg 1x, /t/p/w440_and_h660_face/bkpPTZUdq31UGDovmszsg2CchiI.jpg 2x"/>
</a>
</div>
<div class="options" data-id="901362" data-media-type="movie" data-object-id="619bea97c0ae360089136cff">
<a class="no_click" href="#"><div class="glyphicons_v2 circle-more white"></div></a>
</div>
</div>
<div class="content">
<div class="consensus tight">
<div class="outer_ring">
<div class="user_score_chart 619bea97c0ae360089136cff" data-bar-color="#21d07a" data-percent="71.75" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r72"></span>
</div>
</div>
</div>
</div>
<h2><a href="/movie/901362" title="Trolls Band Together">Trolls Band Together</a></h2>
<p>Oct 12, 2023</p>
</div>
<div cl

c. Display the name of the first movie using appropriate HTML tags to access this listing on the object created in part 3a ( 1 mark )

In [16]:
#Displaying the title of the first movie
firstmovie_title = movie_content.find_all('div',{'class':'card style_1'})[0].h2.text
print(firstmovie_title)

Trolls Band Together


d. Display the user rating of the first movie by using appropriate HTML tags to access this listing on the object created in part 3a (1 mark )

In [17]:
#Displaying the rating of the first movie
firstmovie_rating = movie_content.find_all('div',{'class':'user_score_chart'})[0]['data-percent']
print(firstmovie_rating)

71.75


e. For the first movie, extract the part of the url following the string “https://www.themoviedb.org/” using the appropriate HTML tags to extract this portion on the object created in part 3a (do not use built-in string methods). (1 mark ) For example, if the first movie on the web page had the URL https://www.themoviedb.org/movie/779782 “ your output should be movie/779782

In [18]:
#displaying the Webpage URL of the first movie
firstmovie_url = movie_content.find_all('div',{'class':'card style_1'})[0].h2.a['href']
print(firstmovie_url)

/movie/901362


4. Write user defined functions for each subsection below (i.e., Q4 a, Q4b, Q4c, Q4d, and Q4e) to return ( 10 marks )

a. Titles of all the movies on the page as a Python list (2 marks )

In [19]:
#User defined function for pulling all the title of the movies
def get_titles(movie_content):
    content_class = movie_content.find_all('div', class_="content")
    #create an empty list
    movie_title = []

    for content in content_class:
        a_element = content.find('a')
        #If the particular a_elemnet is not having any value
        if a_element is not None and a_element.text.strip():
            movie_title.append(a_element.text.strip())
#get_titles returns back the list
    return movie_title

#assigning the value to the variable to validate the output
titles_list = get_titles(movie_content)
print("Movie Titles are :")
titles_list

Movie Titles are :


['Trolls Band Together',
 'Leo',
 'Freelance',
 'Oppenheimer',
 'Reign of Chaos',
 'The Bad Guys: A Very Bad Holiday',
 "Five Nights at Freddy's",
 'The Creator',
 'Come Out Fighting',
 'Fast X',
 'Expend4bles',
 'Mission: Impossible - Dead Reckoning Part One',
 'Wonka',
 'Candy Cane Lane',
 'Inferno',
 'Godzilla Minus One',
 'Mousa',
 'Family Switch',
 'Killers of the Flower Moon',
 'The Immortal Wars: Rebirth']

b. User ratings of all the movies on the page as a Python list (2 marks )

In [20]:
#User defined function for pulling all the user rating of the movies
def get_user_rating(movie_content):
    content_class = movie_content.find_all('div', class_="content")
    #create an empty list
    user_rating = []

    for content in content_class:
        span = content.find('span')
        #to check if span has class are present in the attributes
        if span and 'class' in span.attrs:
            span_class = span['class']
            #For example:if the displayed span class="icon icon-r72"
            #In the span_class[1] we are extracting the second element in the class(i.e icon-r72)
            #[-2:]here we do slicing operation to extract the last two digits(i.e 72)
            user_score = span_class[1][-2:]
            if user_score.isdigit() or user_score == 'NR':
                user_rating.append(user_score)
#user_rating returns back the list
    return user_rating

#assigning the value to the variable to validate the output
scores = get_user_rating(movie_content)
print("User Rating are: ")
scores


User Rating are: 


['72',
 '75',
 '64',
 '81',
 '55',
 '57',
 '78',
 '71',
 '47',
 '72',
 '64',
 '76',
 '68',
 '64',
 '61',
 '84',
 '66',
 '63',
 '77',
 '35']

c. HTML content of all the individual pages of movies collected into a Python list. ( 2 marks )

In [21]:
#User defined function for pulling all the HTML Content (i.e URL) of the individual movies
def get_detail_pages(movie_content):
    base_url = 'http://themoviedb.org'
    content_class = movie_content.find_all('div', class_='content')
    #create an empty list
    detail_pages = []
    for content in content_class:
        a_element = content.find('a')
        #To verify if the a_element returns non null value and href present in the attributes
        if a_element is not None and 'href' in a_element.attrs:
            href = a_element['href']
            #to check if the href starts with this particular string
            if href.startswith('/movie/'):
                detail_pages.append(base_url + href)
#get_detail_pages returns a list
    return detail_pages

#assigning the value to the variable to validate the output
result_detail_pages = get_detail_pages(movie_content)
print("Individual movies pages are: ")
result_detail_pages

Individual movies pages are: 


['http://themoviedb.org/movie/901362',
 'http://themoviedb.org/movie/1075794',
 'http://themoviedb.org/movie/897087',
 'http://themoviedb.org/movie/872585',
 'http://themoviedb.org/movie/951546',
 'http://themoviedb.org/movie/1046032',
 'http://themoviedb.org/movie/507089',
 'http://themoviedb.org/movie/670292',
 'http://themoviedb.org/movie/1047925',
 'http://themoviedb.org/movie/385687',
 'http://themoviedb.org/movie/299054',
 'http://themoviedb.org/movie/575264',
 'http://themoviedb.org/movie/787699',
 'http://themoviedb.org/movie/1022964',
 'http://themoviedb.org/movie/10908',
 'http://themoviedb.org/movie/940721',
 'http://themoviedb.org/movie/775244',
 'http://themoviedb.org/movie/798021',
 'http://themoviedb.org/movie/466420',
 'http://themoviedb.org/movie/920258']

d. Genres of all the movies on the page as a Python list ( 2 marks )

In [30]:
import requests
from bs4 import BeautifulSoup
import time
#User defined function for pulling all the Genres of the movies
def get_genres(result_detail_pages):
    #create a empty list
    all_genres = []

    for i in range(len(result_detail_pages)):
        #using the try except method to identify any request exception error
        try:
            #since the toomanyredirects error rarely pops-up while excecution. These header basically explains who we are to the client site.
            headers = {
                          'Accept-Encoding': 'gzip, deflate, sdch',
                          'Accept-Language': 'en-US,en;q=0.8',
                          'Upgrade-Insecure-Requests': '1',
                          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
                          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                          'Cache-Control': 'max-age=0',
                          'Connection': 'keep-alive',
                      }
            #request the URL of the individual page
            response = requests.get(result_detail_pages[i], headers=headers)

            #reponse is acknowledged by the page
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                genres_find = soup.find_all('span', {'class': 'genres'})
                genres = []

                for span in genres_find:
                    genres.extend(a.text.strip() for a in span.find_all('a'))

                #appending the genre to the empty list
                all_genres.append(genres)
            #reponse is not acknowledged by the page
            else:
                print(f"Failed to retrieve genres from {result_detail_pages[i]}. Status code: {response.status_code}")

            #sleep between network calls for upto 1 second
            time.sleep(1)

        except requests.exceptions.RequestException as e:
            print(f"Failed to retrieve genres from {result_detail_pages[i]}. Error: {e}")
#get_genres return a list
    return all_genres

# Assigning the value to the variable to validate the output
all_genres = get_genres(result_detail_pages)
print("All Genres Lists:")
all_genres


All Genres Lists:


[['Animation', 'Family', 'Music', 'Fantasy', 'Comedy'],
 ['Animation', 'Comedy', 'Family'],
 ['Action', 'Comedy'],
 ['Drama', 'History'],
 ['Action', 'Horror', 'Fantasy'],
 ['Animation', 'Comedy', 'Family'],
 ['Horror', 'Mystery'],
 ['Science Fiction', 'Action', 'Thriller'],
 ['War', 'Drama', 'Action'],
 ['Action', 'Crime', 'Thriller'],
 ['Action', 'Adventure', 'Thriller'],
 ['Action', 'Thriller'],
 ['Comedy', 'Family', 'Fantasy'],
 ['Comedy', 'Fantasy', 'Family'],
 ['Action', 'Drama', 'Romance'],
 ['Drama', 'Science Fiction', 'Action', 'Horror'],
 ['Action', 'Adventure', 'Science Fiction'],
 ['Comedy', 'Fantasy', 'Family', 'Crime'],
 ['Crime', 'Drama', 'History'],
 ['Science Fiction']]

In [25]:
#to verify the length of all_genres list
print(len(all_genres))

20


e. Cast of all the movies on the page as a Python list ( 2 marks )

In [32]:
import requests
from bs4 import BeautifulSoup
import time
#User defined function for pulling all the casts of the movies
def get_cast(result_detail_pages):
    #create a empty list
    all_casts = []

    for i in range(len(result_detail_pages)):
        #using the try except method to identify any request exception error
        try:
            #since the toomanyredirects error rarely pops-up while excecution. These header basically explains who we are to the client site.
            headers = {
                          'Accept-Encoding': 'gzip, deflate, sdch',
                          'Accept-Language': 'en-US,en;q=0.8',
                          'Upgrade-Insecure-Requests': '1',
                          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
                          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                          'Cache-Control': 'max-age=0',
                          'Connection': 'keep-alive',
                      }
            #request the URL of the individual page
            response = requests.get(result_detail_pages[i], headers=headers)

            #reponse is acknowledged by the page
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                cast_find = soup.find_all('li', {'class': 'card'})
                cast = []

                for li_find in cast_find:
                    cast.extend(a.text.strip() for a in li_find.find_all('a'))

                cast = list(filter(None, cast))

                #appending the genre to the empty list
                all_casts.append(cast)
            ##reponse is not acknowledged by the page
            else:
                print(f"Failed to retrieve cast from {result_detail_pages[i]}. Status code: {response.status_code}")

            #sleep between network calls for upto 1 second
            time.sleep(1)

        except requests.exceptions.RequestException as e:
            print(f"Failed to retrieve cast from {result_detail_pages[i]}. Error: {e}")

#get_Cast returns a list
    return all_casts

# Assigning the value to the variable to validate the output
all_casts = get_cast(result_detail_pages)
print("All Casts Lists:")
all_casts

All Casts Lists:


[['Anna Kendrick',
  'Justin Timberlake',
  'Camila Cabello',
  'Eric André',
  'Amy Schumer',
  'Andrew Rannells',
  'Daveed Diggs',
  'Troye Sivan',
  'Kid Cudi'],
 ['Adam Sandler',
  'Bill Burr',
  'Cecily Strong',
  'Jason Alexander',
  'Rob Schneider',
  'Allison Strong',
  'Jo Koy',
  'Sadie Sandler',
  'Sunny Sandler'],
 ['John Cena',
  'Alison Brie',
  'Juan Pablo Raba',
  'Alice Eve',
  'Marton Csokas',
  'Christian Slater',
  'Julianne Arrieta',
  'Molly McCann',
  'Daniel Toro'],
 ['Cillian Murphy',
  'Emily Blunt',
  'Matt Damon',
  'Robert Downey Jr.',
  'Florence Pugh',
  'Josh Hartnett',
  'Casey Affleck',
  'Rami Malek',
  'Kenneth Branagh'],
 ['Rebecca Finch',
  'Ray Whelan',
  'Kate Milner Evans',
  'Mark Sears',
  'Peter Cosgrove',
  'Carmina Cordelia',
  'Mike Kelson',
  'Kate Lush',
  'Georgia Woodthorpe'],
 ['Michael Godere',
  'Ezekiel Ajeigbe',
  'Raul Ceballos',
  'Chris Diamantopoulos',
  'Mallory Low',
  'Zehra Fazal',
  'Keith Silverstein',
  'Kari Wahlgren'

In [33]:
#To verify the length of all_casts list
len(all_casts)

20

5. Write an user defined function that returns a pandas data frame with following data: ( 5 marks ) a. Titles of the movies listed on the page b. User ratings of the movies listed on the page c. Genres of the movies listed on the page d. Cast of the movies listed on the page

In [34]:
import pandas as pd
#User defined function for pulling the titles, user rating, Genres and casts
def get_dataframe(movie_content, result_detail_pages):
      #assigning the title, user rating, Genres and Casts user defined function into the variables(lists) inorder for us to access the dictionary
      titles_list = get_titles(movie_content)
      scores = get_user_rating(movie_content)
      movie_genres = get_genres(result_detail_pages)
      movie_cast = get_cast(result_detail_pages)
      # #the length of all the list must be equal in order to convert to the pandas dataframe
      # #commenting out this line(only used for verification)
      # print(len(titles_list), len(scores), len(movie_genres), len(movie_cast))

      #when the all the size are equal it can be converted into dictionary
      if len(titles_list) == len(scores) == len(movie_genres) == len(movie_cast):
        data = {'Title': titles_list,'User Rating': scores,'Genres': movie_genres,'Cast': movie_cast}
        #dictionary is called in pandas dataframe
        df = pd.DataFrame(data)
        return df

# Assigning the value to the variable to validate the output
pandf = get_dataframe(movie_content, result_detail_pages)
print(pandf)

                                            Title User Rating  \
0                            Trolls Band Together          72   
1                                             Leo          75   
2                                       Freelance          64   
3                                     Oppenheimer          81   
4                                  Reign of Chaos          55   
5                The Bad Guys: A Very Bad Holiday          57   
6                         Five Nights at Freddy's          78   
7                                     The Creator          71   
8                               Come Out Fighting          47   
9                                          Fast X          72   
10                                    Expend4bles          64   
11  Mission: Impossible - Dead Reckoning Part One          76   
12                                          Wonka          68   
13                                Candy Cane Lane          64   
14                       

In [36]:
#Now let us store the dataframe values into csv file
store_page_1 = get_dataframe(movie_content, result_detail_pages)
store_page_1.to_csv('TMDB_Page1.csv')
pd.read_csv('TMDB_Page1.csv',index_col=[0])

Unnamed: 0,Title,User Rating,Genres,Cast
0,Trolls Band Together,72,"['Animation', 'Family', 'Music', 'Fantasy', 'C...","['Anna Kendrick', 'Justin Timberlake', 'Camila..."
1,Leo,75,"['Animation', 'Comedy', 'Family']","['Adam Sandler', 'Bill Burr', 'Cecily Strong',..."
2,Freelance,64,"['Action', 'Comedy']","['John Cena', 'Alison Brie', 'Juan Pablo Raba'..."
3,Oppenheimer,81,"['Drama', 'History']","['Cillian Murphy', 'Emily Blunt', 'Matt Damon'..."
4,Reign of Chaos,55,[],[]
5,The Bad Guys: A Very Bad Holiday,57,"['Animation', 'Comedy', 'Family']","['Michael Godere', 'Ezekiel Ajeigbe', 'Raul Ce..."
6,Five Nights at Freddy's,78,"['Horror', 'Mystery']","['Josh Hutcherson', 'Piper Rubio', 'Elizabeth ..."
7,The Creator,71,"['Science Fiction', 'Action', 'Thriller']","['John David Washington', 'Madeleine Yuna Voyl..."
8,Come Out Fighting,47,"['War', 'Drama', 'Action']","['Dolph Lundgren', 'Tyrese Gibson', 'Michael J..."
9,Fast X,72,"['Action', 'Crime', 'Thriller']","['Vin Diesel', 'Michelle Rodriguez', 'Tyrese G..."


6. Scraping the data and combining the dataframes ( 5 marks )

(a) Write a function that scrapes data (mentioned in Q5) from page number 1, 2, 3, 4 and 5 on
the URL https://www.themoviedb.org/movie and returns 5 data frames which can be exported
to csv file by calling the functions defined in Q3a, Q4c and Q5 (3 marks)

In [43]:
#defining the user defined function to get all the individual dataframes that we have defined previously
def get_all_dataframes(movie_content, result_detail_pages):
      #assigning the title, user rating, Genres ,Casts and movie link user defined function into the variables(lists) inorder for us to access the dictionary
      titles_list = get_titles(movie_content)
      scores = get_user_rating(movie_content)
      movie_genres = get_genres(result_detail_pages)
      movie_cast = get_cast(result_detail_pages)
      movie_link = get_detail_pages(movie_content)
      #when the all the size are equal it can be converted into dictionary
      if len(titles_list) == len(scores) == len(movie_genres) == len(movie_cast) == len(movie_link):
        data = {'Title': titles_list,'User Rating': scores,'Genres': movie_genres,'Cast': movie_cast,'Links': movie_link }
        df = pd.DataFrame(data)
        return df

In [48]:
import os
#defining the base url
base_url = "https://www.themoviedb.org/movie"
#create an empty list
all_dataframe = []
#creating a user defined for next pages
def next_page( i, all_dataframe):
    #Create a local directory
    os.makedirs('Web-scraping', exist_ok = True)
    #logic to navigate the next page as given in the question
    next_page_url = base_url + '?page={}'.format(i)
    #parsing Bs elements into next_page_url
    page_response = get_url(next_page_url)
    #calling all the dataframes
    dataframe_pooling = get_all_dataframes(page_response, result_detail_pages)
    #storing the dataframes into csv files for seperate pages
    dataframe_pooling.to_csv("Web-scraping/TMDB-page-{}.csv".format(i) , index = None)
    #appending the pooled dataframes into an empty list
    all_dataframe.append(dataframe_pooling)

In [49]:
#calling 4th page dataframes
next_page( 4, all_dataframe)

In [50]:
#Verifying the dataframes by calling 4th page csv files
pd.read_csv('Web-scraping/TMDB-page-4.csv',index_col=[0])

Unnamed: 0_level_0,User Rating,Genres,Cast,Links
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mega Lightning,45,"['Animation', 'Family', 'Music', 'Fantasy', 'C...","['Anna Kendrick', 'Justin Timberlake', 'Camila...",http://themoviedb.org/movie/988591
The Flash,68,"['Animation', 'Comedy', 'Family']","['Adam Sandler', 'Bill Burr', 'Cecily Strong',...",http://themoviedb.org/movie/298618
Puss in Boots: The Last Wish,83,"['Action', 'Comedy']","['John Cena', 'Alison Brie', 'Juan Pablo Raba'...",http://themoviedb.org/movie/315162
Believer 2,71,"['Drama', 'History']","['Cillian Murphy', 'Emily Blunt', 'Matt Damon'...",http://themoviedb.org/movie/958263
The Velveteen Rabbit,76,[],[],http://themoviedb.org/movie/1186957
The Grinch,68,"['Animation', 'Comedy', 'Family']","['Michael Godere', 'Ezekiel Ajeigbe', 'Raul Ce...",http://themoviedb.org/movie/360920
The Mercenary,63,"['Horror', 'Mystery']","['Josh Hutcherson', 'Piper Rubio', 'Elizabeth ...",http://themoviedb.org/movie/660521
Godzilla: King of the Monsters,67,"['Science Fiction', 'Action', 'Thriller']","['John David Washington', 'Madeleine Yuna Voyl...",http://themoviedb.org/movie/373571
Retribution,70,"['War', 'Drama', 'Action']","['Dolph Lundgren', 'Tyrese Gibson', 'Michael J...",http://themoviedb.org/movie/762430
Home Alone 2: Lost in New York,67,"['Action', 'Crime', 'Thriller']","['Vin Diesel', 'Michelle Rodriguez', 'Tyrese G...",http://themoviedb.org/movie/772


(b) Combine the data obtained from dataframes in Q6(a) (2 marks)

In [52]:
#define the base url
base_url = "https://www.themoviedb.org/movie"
#create user defined function for all the movies in 5 pages
def all_movies(base_url):
    #create an empty list
    all_dataframe = []
    #scraping the movies from 5 pages
    for i in range(1,6):
        next_page( i, all_dataframe)
    #concatenating all the dataframes
    merge_all_dataframe = pd.concat(all_dataframe, ignore_index = True)
    #converting all concatenated dataframes into csv
    csv_complete =  merge_all_dataframe.to_csv('Web-scraping/TMDB_all_movies.csv', index= None)

In [53]:
#calling the all movies
all_movies(base_url)

In [54]:
#displaying all movies by calling the csv file
pd.read_csv('Web-scraping/TMDB_all_movies.csv')[0:]

Unnamed: 0,Title,User Rating,Genres,Cast,Links
0,Trolls Band Together,72,"['Animation', 'Family', 'Music', 'Fantasy', 'C...","['Anna Kendrick', 'Justin Timberlake', 'Camila...",http://themoviedb.org/movie/901362
1,Leo,75,"['Animation', 'Comedy', 'Family']","['Adam Sandler', 'Bill Burr', 'Cecily Strong',...",http://themoviedb.org/movie/1075794
2,Freelance,64,"['Action', 'Comedy']","['John Cena', 'Alison Brie', 'Juan Pablo Raba'...",http://themoviedb.org/movie/897087
3,Oppenheimer,81,"['Drama', 'History']","['Cillian Murphy', 'Emily Blunt', 'Matt Damon'...",http://themoviedb.org/movie/872585
4,Reign of Chaos,55,[],[],http://themoviedb.org/movie/951546
...,...,...,...,...,...
95,Animal,70,"['Drama', 'Science Fiction', 'Action', 'Horror']","['Ryunosuke Kamiki', 'Hidetaka Yoshioka', 'Min...",http://themoviedb.org/movie/781732
96,One Piece: Episode of Skypiea,69,"['Action', 'Adventure', 'Science Fiction']","['Karim Mahmoud Abdel Aziz', 'Eyad Nassar', 'A...",http://themoviedb.org/movie/545742
97,Thriller Night,69,"['Comedy', 'Fantasy', 'Family']","['Jennifer Garner', 'Ed Helms', 'Emma Myers', ...",http://themoviedb.org/movie/118249
98,The School of Magical Animals 2,60,"['Crime', 'Drama', 'History']","['Leonardo DiCaprio', 'Lily Gladstone', 'Rober...",http://themoviedb.org/movie/926946


Citation

For toomany redirects error: exceeds more than 30 https://stackoverflow.com/questions/42237672/python-toomanyredirects-exceeded-30-redirects