In [None]:
# Jovian Commit Essentials
# Please retain and execute this cell without modifying the contents for `jovian.commit` to work
!pip install jovian --upgrade -q
import jovian
jovian.set_project('web-scraping-tmdb')
jovian.set_colab_id('1M40pNjjU7MgWi55Yqk8PcGyGZdVid0h7')

# Web Scraping Popular Television Shows on TMDB ([themoviedb.org](https://www.themoviedb.org))

data source: TMBd website ([https://www.themoviedb.org](https://www.themoviedb.org))

>A disclaimer before beginning, many websites restrict or outright bar scraping of data from their pages. Users may be subject to legal ramifications depending on where and how they attempt to scrape information. Many sites have a devoted page to noting restrictions on data scraping at **www.[site].com/robots.txt**. Be extremely careful if looking at sites that house user data — places like facebook, linkedin, even craigslist, do not take kindly to data being scraped from their pages. When in doubt, please contact with teams at sites.

![](https://i.imgur.com/qi5y4LB.png)

The Movie Database (TMDb) is a community-driven website about movies and television shows database. The community has added every piece of data since 2008. Users can search for their desired topics and discover what they like after browsing a large amount of data. Users can also contribute to the TMDb community by giving reviews and their scores to certain shows for the benefits of the community. In summary, TMDb is an excellent website for someone like me who wanted to practice web scraping skills.

### Project motivation
For the purpose of this project, we will retrieve information from the page of **’Popular TV Shows’** using _web scraping_: a process of extracting information from a website programmatically. Web scraping isn’t magic, and yet some readers may grab information on a daily basis. For example, a recent graduate may copy and paste information about companies they applied for into a spreadsheet for job application management.

#### Project goals

The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:

|#|movie_title|released_date|score|image_link|detail-page|
|-|-----------|-------------|-----|----------|----|
|1|The Falcon and the Winter Solider| 19 Mar 2021| 78%|..|...|
|2|The Good Doctor| 25 Sep 2017| 86%| ...|
....




## Project steps
Here is an outline of the steps we'll follow.

1. Doanload the webpage using `requests`
2. Parse the HTML source code using `BeautifulSoup` library
3. Building the scraper components
4. Complie extracted information into Python list and dictionaries
5. Write information to CSV files
6. Extract and combine data from multiple pages
7. Future work and references


### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

In [None]:
!pip install jovian --upgrade --quiet
import jovian
jovian.commit(project="210424-project001-web-scraping-tmbd")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/shenghongzhong/210424-project001-web-scraping-tmbd[0m


'https://jovian.ai/shenghongzhong/210424-project001-web-scraping-tmbd'

# Web Scraping TV shows from TMBd (themoviedb.org)


## Download the webpage using `requests`


### **Requests**

### **World Wide Web**

Before we explain what `requests` library is, we have to ask a question about **WHY we need to use requests**. This leads to the origin of the World Wide Web. 

Since 1989, [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee) proposed the concept of the [World Wide Web](http://info.cern.ch/hypertext/WWW/TheProject.html) as an open platform where users can share information quickly and locate it from anywhere in the world. This enables all scientists to continue their research without going back to their home countries at that time.

### **Three key components**

![](https://i.imgur.com/eAmOdX0.png)

### **HTML**

We can break down the web into three key things. The first one is the **HyperText Markup Language**, short for HTML. It’s the standard markup language for documents designed to be displayed in web browsers. What HTML does is to present content, just like a World document which describes paragraphs of texts, images, tables of data.

### **URL**

The second one is the URL. It stands for **Uniform Resource Locator**, which is what you would enter into your address bar in the Chrome browser every day. What a URL does is to take you to the same page every single time. It’s approximately what your phone number does. If someone phones your telephone number, they’re always going to contact you.

### **HTTP**

Last but at least, HTTP is a part of the web. It’s an invisible layer underneath the surface that is doing the communication with a server and your browser. For example, when you log into Twitter, you’ll type in your username and passwords. Then, you hit the button “Submit” and those details would be sent using an HTTP request to Twitter servers. Next, the servers will send **an HTTP response** after processing if the username and passwords are correct.

In a nutshell, **HTTP** is the fundamental way that websites (Your browser) **communicate** with **servers which are just giant computers.


### **What is `requests`**


Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

We use `pip`, a package-management system, to install and manage softwares. Since the platform we selected is **Google colab**, we would have to type a line of code `!pip install` to install `requests`. You will see lots codes of `!pip` when installing other packages.




When we attempt to use some prewritten functions from a certain library, we would use the `import` statement. e.g. When we would have to type `import requests` after installation, we are able to use any function from `requests` library.

In [None]:
!pip install requests --upgrade --quiet
import requests

[?25l[K     |█████▍                          | 10kB 22.7MB/s eta 0:00:01[K     |██████████▊                     | 20kB 22.7MB/s eta 0:00:01[K     |████████████████                | 30kB 27.7MB/s eta 0:00:01[K     |█████████████████████▍          | 40kB 21.9MB/s eta 0:00:01[K     |██████████████████████████▊     | 51kB 15.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.4MB/s 
[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[?25h

### **URL structure**

Since we focused on the popular TV shows, the URL we're landing on is `https://themoviedv.org/tv`. Having analyzed the URL structure, I found out a trick that you can access to the specific page.

![](https://i.imgur.com/N6BpDf7.png)

![](https://i.imgur.com/PuSDSHx.png)

#### **requests.get()**

In order to **download a web page**, we use `requests.get()` to **send the HTTP request** to the **TMBd server** and what the function returns is a **response object**, which is **the HTTP response**. 





Since **a URL** can always lead us to a certain page and we know how TMDb structured their website,  I assigned the variable `base_url` to a value of `https://themoviedb.org` and `tmbd_url` to a value of `https://themoviedb.org/tv?page=5`. To be explicit, I named the variable `response` to be assigned to the HTTP response containing page contents and other information.



Later on, I intended to **design a function** that asked people to input any page they wanted and **pass the input value** to replace the **number `5`**. Using this design thinking, people can achieve the outcome for either one page of data or **X pages of data** at TMDb. 

The reason why I said **"design thinking"** is that we are supposed to have **a product mindset**. The technology is used for human!

In [None]:
base_url = 'https://themoviedb.org'
tmbd_url = base_url + '/tv?page=' + '5'
print(tmbd_url)
response = requests.get(tmbd_url)


https://themoviedb.org/tv?page=5


### **Status code**

Another thing here is that we have to **check** if we succesfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers which we can't get **the feedback** straightforwardly if we didn't send HTTP requests successfully.

In general, the method to check out if the server sended a HTTP response back is the **status code**. In `requests` library, `requests.get` returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.


If the request was successful, `response.status_code` is set to a value between **200 and 299**.

The name of variable is  `response`  but the variable understood by Python is **a response object**. Don't be confused!

In [None]:
response

<Response [200]>

In [None]:
type(response)

requests.models.Response

*If you see the status_code is 200, then it means `requests.get(thmbd_url)` was successful

In [None]:
response.status_code

200

The HTTP response contains HTML that is ready to be displayed in browser. Here we can use `response.text` to retrive the HTML document.

As a result, we have 220K characters! Let's use `page_contents[:1000]` to preview what we have. 

In [None]:
page_contents = response.text
len(page_contents)

186253

What the HTML syntax looks like if we preview the 1000 characters

In [None]:
page_contents[:1000]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular TV Shows &#8212; The Movie Database (TMDb)</title>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    \n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="HandheldFriendly" content="True">\n    <meta name="MobileOptimized" content="320">\n    \n    <meta name="viewport" content="width=1120">\n    \n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" content="#032541">\n<meta name="theme-color" content="#032541">\n<link rel="apple-touch-i

- What you see above is the source code of the web page. It written in a language called HTML. 
- It defines and display the content and structure of the web page.

Let's save the text into a file with `open` statement

In [None]:
with open('tmbd_popular_tv.html',"w") as f:
  f.write(page_contents)

f.close()

Just in case you would like to view how HTML looks like, you can now view the file by clicking the folder-like icon using the side bar on your left. 

![](https://i.imgur.com/pCzoncM.png)

Here's what you'll see when you open the HTML file on browser:

![](https://i.imgur.com/nNyUaMx.png)


Here's what you'll see when you open the HTML file on text  editor:

![](https://i.imgur.com/XcKcSAA.png)

### Summary

- We know the origin of world wide web and basics about 3 key components ( HTML, URL and HTTP)

- We know how to use `requests.get`  to get the page contents of a URL and return a response object.

- We know how to check if a request is successful by using `response.status_code`.

#### Let's wrap the codes up into a helper function

In [None]:
def get_page(page_number):
    """Get the number of web page containing all the content for TV shows and retun a BeautifulSoup document"""
    page_url = 'https://www.themoviedb.org/tv' + '?page=' + str(page_number)
    response = requests.get(page_url)
    #check the status
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception ('Failed to fetch web page' + page_url)
    #return a BeautifulSoup object
    return BeautifulSoup(response.text)

# Parse the HTML source code using Beautiful Soup library

## What is Beautiful Soup?

You might wonder what `BeautifulSoup(response.text)` is as you look at each line of codes for my last helper function `get_page()`. It was a hint for this section. 

Beautiful Soup is **a Python package** for **parsing HTML and XML documents**. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It's a handy tool when it comes to web scraping. You can read more on their documentation site. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help

To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let's install the library and import **the BeautifulSoup class** from **the bs4 module.**

In [None]:
#Install Beautiful Soup package
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
# from...import meaning you don't need to type bs4.BeautifulSoup everytime
from bs4 import BeautifulSoup

In [None]:
# Open the file
with open('tmbd_popular_tv.html',"r") as f:
  tmbd = f.read()

f.close()

You can either use `open` statement to open the html file we just saved or to use `response.text` to get the data

In [None]:
# Check if it's the same
tmbd == response.text

True

### Inspecting the HTML source code of a web page

![](https://i.imgur.com/zF02MhD.png)

#### HTML basics

Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.

In the late 1980s, a British scientist Tim Bereners-Lee invented HTML, which stands for HyperText Markup Language, while working at a CERN laboratory in Switzerland. He didn't want to make the page contents displayed on the web just regular text files. In order to increase communication efficiency and liberate people's creativity, he tried to let authors have the ability to define each part of the texts. Hence, the content displayed on web pages is written in HTML.

In Beautiful Soup library, we can specify `html.parser` to ask Python to read components of the page, instead of reading it as a long string. 

We can use `<title>` tag as an example to demonstrate what **`tag'** is in HTML

In [None]:
document = BeautifulSoup(tmbd,'html.parser')

#### The `<title>` tag

In [None]:
document.title

<title>Popular TV Shows — The Movie Database (TMDb)</title>

![](https://i.imgur.com/5Lx3dbu.png)

What we can do with **a BeautifulSoup object** is to get **a specifc types of a tag in HTML** by calling the name of a tag, as shown in code cell below.

To be explicit, let's call the variable `'title_tag'`, and we can get the text inside the title tag.

In [None]:
title_tag = document.title
title_tag.text

'Popular TV Shows — The Movie Database (TMDb)'




#### **An HTML tag comprises of three parts:**

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


### Common tags and attributes

#### **Tags in HTML**

There are around 100 types of HTML tags but on a day to day basis, around 15 to 20 of them are the most common use, such as `<div>` tag, `<p>` tag, `<section>` tag, `<img>` tag, `<a>` tags.

![](https://i.imgur.com/sL4gp0l.png)


Of many tags, I wanted to highlight **`<a>` tag**, which  can contain attributes such as `href` (hyperlink reference), because `<a>` tag allows users to click and they would be directed to another site. That's why the name of `<a>` tag is  **anchor**.

#### **Attributes**

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)

## Building the scraper components







In this section, we are starting to build pieces of components for our scraper to extract movie titles,released date and detail URL. As mentioned eariler, what the outcome we want is a CSV file containing as follows:

|#|movie_title|released_date|score|image_link|detail_page|
|-|-----------|-------------|-----|----------|----|
|1|The Falcon and the Winter Solider| 19 Mar 2021| 78%|..|...|
|2|The Good Doctor| 25 Sep 2017| 86%| ...|
....


### Inspecting HTML in the Browser

To view the **source code** of any webpage right within **your browser**, you can **right click** anywhere on a page and **select** the **"Inspect"** option. You access the **"Developer Tools"** mode, where you can see the source code as **a tree**. You can expand and collapse various nodes and find the source code for a specific portion of the page

![](https://i.imgur.com/goG29IX.png)


As shown in the photo above, I've cursored over one of the TV programs to display how the entire content was presented. I found out the data on each page *is* held within a `<div>` tag with the attribute `class="page_wrapper"`. Its children tags is another `<div>` tag including the `class="content"`. That's a good sign. We will not need to know every attribute of every tag to extract our information, but it is helpful to analyze the structure of HTML source code. 

Since I've pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.

In [None]:
# Find the <div class="page_wrapper"
page_wrapper = document.find('div',class_='page_wrapper')

In [None]:
#Pull all <div class="content"> under the parent tag <div class="page_wrapper">
content_tags = page_wrapper.find_all('div',class_='content')

By looking at the page, there must be 20 tv programs on one page. Therefore, each function I write to withdraw a piece of information should yield 20 different items. 

If my output provides **fewer** than the total number of **20**, then it indicates something went **WRONG** and time to refer back to the page itself to debug the code.

Also, at the end of section, I wrote a for loop for each. It can help me see some unknown problems such as there is no data for images or no released date for a certain TV program. This happens in the real life. Some TV programs might announce "We're going to make it happen!" but it didn't as expected.

On the page, we know the total items is 20 and I put these values into a variable called `content_tags`.

In [None]:
# Check if the total number is 20
len(content_tags)

20

In [None]:
# Preview top 3 content tag
content_tags[:3]

[<div class="content">
 <div class="consensus tight">
 <div class="outer_ring">
 <div class="user_score_chart 5aea223992514172d9001199" data-bar-color="#21d07a" data-percent="84.0" data-track-color="#204529">
 <div class="percent">
 <span class="icon icon-r84"></span>
 </div>
 </div>
 </div>
 </div>
 <h2><a href="/tv/79242" title="Chilling Adventures of Sabrina">Chilling Adventures of Sabrina</a></h2>
 <p>Oct 26, 2018</p>
 </div>, <div class="content">
 <div class="consensus tight">
 <div class="outer_ring">
 <div class="user_score_chart 5256cfba19c2956ff60a01e4" data-bar-color="#21d07a" data-percent="77.0" data-track-color="#204529">
 <div class="percent">
 <span class="icon icon-r77"></span>
 </div>
 </div>
 </div>
 </div>
 <h2><a href="/tv/1418" title="The Big Bang Theory">The Big Bang Theory</a></h2>
 <p>Sep 24, 2007</p>
 </div>, <div class="content">
 <div class="consensus tight">
 <div class="outer_ring">
 <div class="user_score_chart 52589cbd760ee3466161068e" data-bar-color="#21

### 1. Movie titles

As noted above,the entire tv program is nested under `<div class="content">` tags. Having looked it into details, I could see that movie titles are listed under `<a>` tags with the attribute `title= "[THE MOVIE TITLES]"`, Here is the texts for `<a>` tags are the same as values of  `title` attributes.

Let's preview the first TV program we captured.

In [None]:
# To view what we have re content tag under the id page_1
first_content_tag = content_tags[0]
first_content_tag

<div class="content">
<div class="consensus tight">
<div class="outer_ring">
<div class="user_score_chart 5aea223992514172d9001199" data-bar-color="#21d07a" data-percent="84.0" data-track-color="#204529">
<div class="percent">
<span class="icon icon-r84"></span>
</div>
</div>
</div>
</div>
<h2><a href="/tv/79242" title="Chilling Adventures of Sabrina">Chilling Adventures of Sabrina</a></h2>
<p>Oct 26, 2018</p>
</div>

Having checked the first TV program in our collected data, we found the movie title is housed in the `<a>` tag. So we can use the method `Beautiful.find()` to find the same type of `<a>` tag.

In [None]:
#To get the first title via a tag
a_tag = first_content_tag.find('a')
a_tag

<a href="/tv/79242" title="Chilling Adventures of Sabrina">Chilling Adventures of Sabrina</a>

In [None]:
# To get text inside a tag
a_tag.text

'Chilling Adventures of Sabrina'

In [None]:
#Write a for loop to see
content_tags = page_wrapper.find_all('div',class_="content")
movie_title_list = [content_tag.find('a').text for content_tag in content_tags ]
print("the total number of tv program on the page is {}".format(len(movie_title_list)))
movie_title_list

the total number of tv program on the page is 20


['Chilling Adventures of Sabrina',
 'The Big Bang Theory',
 'Regular Show',
 'Lady, la vendedora de rosas',
 'Arrow',
 'Tokyo Revengers',
 'Once',
 'Rebelde',
 'Money Heist',
 'New Amsterdam',
 'Young Sheldon',
 'Teresa',
 'The Nevers',
 'Batwoman',
 'Friends',
 'Rebelde Way',
 'SEAL Team',
 'American Gods',
 'Endless Love',
 'Chicago P.D.']

### 2. Released date

The data about the release date is actually living in the same tag `<p>` tag.

In [None]:
#To get the first released date via p tag
p_tag = content_tags[0].find('p')
p_tag.text

'Oct 26, 2018'

You might not notice here is a small **issue** with the date. Since our output is a CSV file, and CSV stands for Comma-separated values. It is a delimited text file that uses **a comma** to **separate values**. Each line of the file is a data record.

In this case, we will mess up if we don't think of ways to clean up the data. Fortunately, we have **a module** called `datetime` to help us. Also, I developed if/else statement to avoid some tv programs with NO released date. AKA, no value to be found.

In [None]:
import datetime as dt
#To get a datetime object
date = dt.datetime.strptime(content_tags[0].find('p').text, "%b %d, %Y")
#To format the string with YYYY-MM-DD
rel_date = '{}-{}-{}'.format(date.year,date.month,date.day)
print(rel_date)



2018-10-26


In [None]:
# Write a helper function to clean
def convert_date(p_tag):
  if p_tag =='':
    return "NotFound"
  elif p_tag is None:
    return "NotFound"
  else:
    date = dt.datetime.strptime(p_tag, "%b %d, %Y")
    released_date = '{}-{}-{}'.format(date.year,date.month,date.day)
    return released_date

Let's write a for loop to see if we can capture the same type of data.

In [None]:
import datetime as dt
#Write a for loop to see all of dates on the same page
content_tags = page_wrapper.find_all('div',class_="content")
p_tags = [content_tag.find('p') for content_tag in content_tags]
released_date = [convert_date(p_tag.text) for p_tag in p_tags ]
print("the total number of releasted dates on the page is {}".format(len(released_date)))
released_date

the total number of releasted dates on the page is 20


['2018-10-26',
 '2007-9-24',
 '2010-9-6',
 'NotFound',
 '2012-10-10',
 '2021-4-11',
 '2017-6-19',
 '2004-10-4',
 '2017-5-2',
 '2018-9-25',
 '2017-9-25',
 '2010-8-2',
 '2021-4-11',
 '2019-10-6',
 '1994-9-22',
 '2002-5-27',
 '2017-9-27',
 '2017-4-30',
 '2015-10-14',
 '2014-1-8']

Did you see we have a type of data called **Not found**. 

**Missing data is worse than no data!**

### 3. User score

User score are located under the **`<span>` tags**. Initially, I tried getting the `<div>` tag using the attribute  `date-percent="79.0"`. You can find  the value of a tag's attribute with `tag["attribute"]`. But I decided to go with `<span>` tag simply because it makes my job easier. 

Be smart with you tasks!

In [None]:
#To get the span tag 
span_tag = content_tags[0].find('span')
span_tag

<span class="icon icon-r84"></span>

In [None]:
#To find the value of the attribute name "class"
user_score = span_tag['class'][1][-2:]
user_score

'84'

In [None]:
#To get a list of user score
content_tags = page_wrapper.find_all('div',class_="content")
span_tags = [content_tag.find('span') for content_tag in content_tags]
user_scores = [span_tag['class'][1][-2:] for span_tag in span_tags ]
print("the total number of user_score on the page is {}".format(len(user_scores)))
user_scores

the total number of user_score on the page is 20


['84',
 '77',
 '87',
 '74',
 '66',
 '90',
 '88',
 '85',
 '83',
 '84',
 '80',
 '75',
 '87',
 '73',
 '84',
 '84',
 '78',
 '71',
 '77',
 '84']

### 4. Image link

Images were a bit tricky. Since it is housed in different types of tags under the `<div class="paper_wrapper">`, we have to modify it. So we create a variable `a_tags_for_imgs` to capture all `<a>` tags containing the images. 

In [None]:
#To get the first image for the website
a_tags_for_img_tags = page_wrapper.find_all('a', class_='image')
a_tags_for_img_tags[0]

<a class="image" href="/tv/79242" title="Chilling Adventures of Sabrina">
<img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg" srcset="/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg 1x, /t/p/w440_and_h660_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg 2x"/>
</a>

Once we get the list of `<a>` tags, we can see `<img>` tags with the `class="image"`. 

Let's get one image to see if we are successful.

In [None]:
image = a_tags_for_img_tags[0].find('img',class_='poster')
image

<img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg" srcset="/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg 1x, /t/p/w440_and_h660_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg 2x"/>

As said eariler, our **intention** is to get the **URL** for the **image**. Having examined the tag strucure, we could find out that **the value of attribute `src`** can help us to get the image. But we have to concatenate  it with our `base_url` which is `http://themoviedb.org`.

In [None]:
imageURL = base_url + a_tags_for_img_tags[0].find('img',class_='poster')['src']
print(imageURL)

https://themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg


So far, we have successfully captured data from one tag at one time, whereas we have 20 tags. On the other hand, we have to consider one situation where the site did not have the image for a tv program. I developed if/else statements to avoid this situation. Anyway, let's wrap this up into a function.


In [None]:
# To get one img tag
def parse_image(img_tag):
    """To get one <img> tag value"""
    if img_tag != None:
      img_url ={'image_url':'https://www.themoviedb.org'+ img_tag['src']}
    else:
      img_url = {'image_url':"No photo"}
    return img_url

Let's test the function we just wrote to see if we can get a list of 20 image URLs.

In [None]:
# To get all a tags within the class='image'
base_url = 'https://themoviedb.org'
a_tags_for_img_tags = page_wrapper.find_all('a', class_='image')
img_tags = [a_tag_for_img.find('img',class_='poster') for a_tag_for_img in a_tags_for_img_tags]
images_list = [base_url+img_tag['src'] for img_tag in img_tags ]
print("the total number of images on the page is {}".format(len(images_list)))
images_list

the total number of images on the page is 20


['https://themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/ooBGRQBdbGzBxAVfExiO8r7kloA.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/mS5SLxMYcKfUxA0utBSR5MOAWWr.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/hkiNJTUqgltJvqEpFP1QpzuujO2.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/gKG5QGz5Ngf8fgWpBsWtlg5L2SF.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/qSEKyf0fWhrCEQ3LTwLqe41eSvR.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/d4vPg3QsTJJh6C5MHARTb5CyqOu.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/iehrrb9CmYCiV1hXp5pdGQQmGNe.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/MoEKaPFHABtA1xKoOteirGaHl1.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/wKTAz8fkoXJoHqPpi4ArAUGDtco.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/aESxB2HblKlDzma39xVefa20pbW.jpg',
 'https://themoviedb.org/t/p/w220_and_h330_face/17QQBTahkRE23bxJ4PmQ7tjMjyX.jpg',
 'https://themovi

### 5. Detailed page

For the detail page, I've tried many methods to figure out how to get this URL. It turns out each TV program has its **unique identifier**. Each identifier actually lives within **`<a>` tags**. It's reasonable to see how it works. Because vistors usually click on images to find out more.

The attribute `href` stands for hyperlink reference and usually comes with `<a>` tags.

In [None]:
#To get the value of <a href=XXX>
one_a_href= a_tags_for_img_tags[0]['href']
one_a_href

'/tv/79242'

You can concatenate them with `base_url`. The link can direct you to the specif detailed page about the TV program. 

In [None]:
#To concatate base url with the attribute href of a tags
details_page = base_url + content_tags[0].find('a')['href']
print(details_page)

https://themoviedb.org/tv/79242


In [None]:
# To ge a list of details page on the same pages
base_url = 'http://themoviedb.org'
content_tags = page_wrapper.find_all('div',class_="content")
a_tags_for_details = [content_tag.find('a') for content_tag in content_tags]
detail_pages = [base_url+a_tag_for_details['href'] for a_tag_for_details in a_tags_for_details ]
print("the total number of user_score on the page is {}".format(len(detail_pages)))
detail_pages

the total number of user_score on the page is 20


['http://themoviedb.org/tv/79242',
 'http://themoviedb.org/tv/1418',
 'http://themoviedb.org/tv/31132',
 'http://themoviedb.org/tv/92396',
 'http://themoviedb.org/tv/1412',
 'http://themoviedb.org/tv/105009',
 'http://themoviedb.org/tv/72637',
 'http://themoviedb.org/tv/12637',
 'http://themoviedb.org/tv/71446',
 'http://themoviedb.org/tv/80350',
 'http://themoviedb.org/tv/71728',
 'http://themoviedb.org/tv/12926',
 'http://themoviedb.org/tv/80828',
 'http://themoviedb.org/tv/89247',
 'http://themoviedb.org/tv/1668',
 'http://themoviedb.org/tv/9027',
 'http://themoviedb.org/tv/71789',
 'http://themoviedb.org/tv/46639',
 'http://themoviedb.org/tv/65555',
 'http://themoviedb.org/tv/58841']

### Summary

In this section, we know how to extract information from the HTMl document using Beautiful Soup library. 

- We have learned HTML basics

- We have analyzed the HTML structure and contents

- We have successfully extracted information about movie title, released date, image URL, details URL, user scores.

- We have written some helper functions such as `parse_images()`, `convert_date()`

### Let's wrap them up into a function

In this code cell, you will see how I combine the functions we have written so far into one single block. It's helpful to be progressive!

In [None]:
def parse_content(content_tag):
    # a tag contains title name
      a_tag = content_tag.find('a')
    # movie title name
      movie_title = a_tag.text
    # p tag contains the released date
      p_tag = content_tag.find('p')
    # released date
      rel_date=convert_date(p_tag.text)
    # detail page
      det_url ='https://www.themoviedb.org' + a_tag['href']
    # span tag containing user score
      span_tag = content_tag.find('span')
    # user score
      user_score = span_tag['class'][-1][-2:] + '%'

    # return a dictionary
      return {
          'movie_title': movie_title,
          'released_date': rel_date,
          'user_score': user_score,
          'detail_url': det_url
          } 

def parse_image(img_tag):
    """To get one <img> value"""
    if img_tag != None:
      img_url ={'image_url':'https://www.themoviedb.org'+ img_tag['src']}
    else:
      img_url = {'image_url':"Not_found"}
    #img_url ={'image_url':'https://www.themoviedb.org'+ img_tag['src']}
    return img_url

# Write a helper function to clean date data
def convert_date(p_tag):
  if p_tag =='':
    return "Not Found"
  elif p_tag is None:
    return "Not Found"
  else:
    date = dt.datetime.strptime(p_tag, "%b %d, %Y")
    released_date = '{}-{}-{}'.format(date.year,date.month,date.day)
    return released_date


# Complie extracted information into Python list and dictionaries

## Dictionary Concatenation 
So far we have successfully capture information from one single tag and what we get is dictionary.

In [None]:
content_tag = content_tags[0]
parse_content(content_tag)

{'detail_url': 'https://www.themoviedb.org/tv/79242',
 'movie_title': 'Chilling Adventures of Sabrina',
 'released_date': '2018-10-26',
 'user_score': '84%'}

In [None]:
img_tag = img_tags[0]
parse_image(img_tag)

{'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg'}

**How can we concatenate with these 2 dictionary?**


Right, we can use `dict.update()` to achieve what we want.

In [None]:
#To assign a variable d1 to the outcome of the function
d1 = parse_content(content_tag)
#To assign another variable to d2
d2 = parse_image(img_tag)
d3=dict(d1)
d3.update(d2)
d3

{'detail_url': 'https://www.themoviedb.org/tv/79242',
 'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg',
 'movie_title': 'Chilling Adventures of Sabrina',
 'released_date': '2018-10-26',
 'user_score': '84%'}

Let's wrap this up into a helper function. Whenever we pass `content_tags`, `img_tags` which is what `parse_content()` and `parse_image()` return, we can get a list of dictionary.

In [None]:
def all_content(content_tags,img_tags):
    all_content = []
    for i in range(20):
      d1 = parse_content(content_tags[i])
      d2 = parse_image(img_tags[i])
      d4 = dict(d1)
      d4.update(d2)
      all_content.append(d4)
    return all_content

Let's test this function out!

In [None]:
all_content_list = all_content(content_tags,img_tags)
print(type(all_content_list))
all_content_list[:5]

<class 'list'>


[{'detail_url': 'https://www.themoviedb.org/tv/79242',
  'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg',
  'movie_title': 'Chilling Adventures of Sabrina',
  'released_date': '2018-10-26',
  'user_score': '84%'},
 {'detail_url': 'https://www.themoviedb.org/tv/1418',
  'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/ooBGRQBdbGzBxAVfExiO8r7kloA.jpg',
  'movie_title': 'The Big Bang Theory',
  'released_date': '2007-9-24',
  'user_score': '77%'},
 {'detail_url': 'https://www.themoviedb.org/tv/31132',
  'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/mS5SLxMYcKfUxA0utBSR5MOAWWr.jpg',
  'movie_title': 'Regular Show',
  'released_date': '2010-9-6',
  'user_score': '87%'},
 {'detail_url': 'https://www.themoviedb.org/tv/92396',
  'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/hkiNJTUqgltJvqEpFP1QpzuujO2.jpg',
  'movie_title': 'Lady, la vendedora de rosas',
  'released_date': 'Not Found',
  'user_s

Great. we're getting close to what we wanted!

Until now, we can write another function to capture the output from the function `get_page()` combining with the function `all_content()`

In [None]:
def get_one_page(doc):
    """Parse all content on one page given a BeautifulSoup object"""
      # Get the div containing all contents
    page_wrappers = doc.find_all('div',class_='page_wrapper')
      # Get the content tags containing title, released date, user score
    content_tags = page_wrappers[0].find_all('div',class_="content")
      # Get a tag containing <img>
    a_tags = page_wrappers[0].find_all('a', class_='image')
      # Get a list of all <img>s
    img_tags = [tag.find('img',class_='poster') for tag in a_tags]
      # Put them all into a list of dictionary togeter 
    all_page_contents = all_content(content_tags,img_tags)
    return all_page_contents

# Write information to CSV files

We aim to get the CSV file. We can try to write a simple code to get this outcome into csv file.

Before we do that, let's test out how we can write one item into a file.

You can see `d3` just have one item.

In [None]:
d3

{'detail_url': 'https://www.themoviedb.org/tv/79242',
 'image_url': 'https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg',
 'movie_title': 'Chilling Adventures of Sabrina',
 'released_date': '2018-10-26',
 'user_score': '84%'}

Let's use `open` statement to create a text file called `test.csv` and we can save it in Google colab.

In [None]:
with open("test.csv",'w') as f:
        # Write the headers or filed names in the first line
        headers =list(d3.keys())
        f.write(','.join(headers)+'\n')

        # Write one item per line
      
        values = []
        for header in headers:
                values.append(str(d3.get(header,"")))
        f.write(','.join(values)+"\n")

f.close()

We can use `open` statement again to see if we successfully created `test.csv` and print it out.

In [None]:
with open('test.csv','r') as f:
  f=f.read()
  print(f)

movie_title,released_date,user_score,detail_url,image_url
Chilling Adventures of Sabrina,2018-10-26,84%,https://www.themoviedb.org/tv/79242,https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg



Let's write a helper function for writng and reading a CSV file.

In [None]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

def read_csv(path):
    with open(path,'r') as f:
      f=f.read()
    return f

Then we can write a for loop to loop a list of `all_content_list` into the file `another_test.csv` using this function we created above

In [None]:
write_csv(all_content_list,'another_test.csv')

In [None]:
read_csv('another_test.csv')

'movie_title,released_date,user_score,detail_url,image_url\nChilling Adventures of Sabrina,2018-10-26,84%,https://www.themoviedb.org/tv/79242,https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg\nThe Big Bang Theory,2007-9-24,77%,https://www.themoviedb.org/tv/1418,https://www.themoviedb.org/t/p/w220_and_h330_face/ooBGRQBdbGzBxAVfExiO8r7kloA.jpg\nRegular Show,2010-9-6,87%,https://www.themoviedb.org/tv/31132,https://www.themoviedb.org/t/p/w220_and_h330_face/mS5SLxMYcKfUxA0utBSR5MOAWWr.jpg\nLady, la vendedora de rosas,Not Found,74%,https://www.themoviedb.org/tv/92396,https://www.themoviedb.org/t/p/w220_and_h330_face/hkiNJTUqgltJvqEpFP1QpzuujO2.jpg\nArrow,2012-10-10,66%,https://www.themoviedb.org/tv/1412,https://www.themoviedb.org/t/p/w220_and_h330_face/gKG5QGz5Ngf8fgWpBsWtlg5L2SF.jpg\nTokyo Revengers,2021-4-11,90%,https://www.themoviedb.org/tv/105009,https://www.themoviedb.org/t/p/w220_and_h330_face/qSEKyf0fWhrCEQ3LTwLqe41eSvR.jpg\nOnce,2017-6-19,88%,https://w

In [None]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Capturing environment..[0m
[jovian] Committed successfully! https://jovian.ai/shenghongzhong/210424-project001-web-scraping-tmbd[0m


'https://jovian.ai/shenghongzhong/210424-project001-web-scraping-tmbd'

## Summary

In this section, we have written functions such as 

- `get_page()` for an HTTP request and returning with Beautiful Soup

- `parse_content()` for parsing all contents such as movie title, released date, detailed page, user scores on the web page.

- `convert_date()` for cleaning date data into a nice format

- `parse_image()` for parsing one image tag into a dictionary

- `all_contents()` for concatenating two outputs from `parse_content()` and `parse_image()` into a list that comprises of elements in the data structure of dictionary 

- `get_one_page()` is a nice function we designed to combine with `all_content()` and `get_page()` as well. It returns a list of dictionary

- `write_csv()` and `read_csv()` for write a list of dictionary as an output of `all_contents()`



Now, every piece of the scraper has assembled together. We need to write a function to help us to get one single page of data and output a CSV file.

# One page web scraper

So far, we can combine all pieces of scraper compontents into a single function to get one page of data if we specify a page number.

In [None]:
# To get one page of data
def web_scraper(base_url=None,page_number=None,path=None,get_content=False):
  """Get the content for no. of page and write them to a CSV file"""
    #if path isn't specified, the default goes with the current time
  if path is None:
        from datetime import datetime
        path = datetime.now().strftime("%Y-%b-%d %H:%M:%S") + '.csv'
    # if page isn't specified, the default of page number is 1
  if base_url is None:
        base_url = 'https://www.themoviedb.org'
  #if page_number is None or page_number = 1 or page_numbr = 0:
  if page_number is None:
        page_number = 1
  page_doc = get_page(page_number)
  page_content = get_one_page(page_doc)
  #print(page_content)
  if get_content:
        return page_content
  else:
        write_csv(page_content,path)

  
  print('You have successfully scraped data at the page {}, written to file {}'.format(page_number, path))
  return path


Import all libraries and modules


In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from datetime import datetime
import datetime as dt
import time

In [None]:
csv_file = web_scraper(page_number=5)

You have successfully scraped data at the page 5, written to file 2021-Apr-28 17:01:02.csv


Using `with` statement to check the data we capture from the page 5

In [None]:
with open(csv_file,'r') as f:
  f=f.read()
  print(f)

movie_title,released_date,user_score,detail_url,image_url
Chilling Adventures of Sabrina,2018-10-26,84%,https://www.themoviedb.org/tv/79242,https://www.themoviedb.org/t/p/w220_and_h330_face/yxMpoHO0CXP5o9gB7IfsciilQS4.jpg
The Big Bang Theory,2007-9-24,77%,https://www.themoviedb.org/tv/1418,https://www.themoviedb.org/t/p/w220_and_h330_face/ooBGRQBdbGzBxAVfExiO8r7kloA.jpg
Regular Show,2010-9-6,87%,https://www.themoviedb.org/tv/31132,https://www.themoviedb.org/t/p/w220_and_h330_face/mS5SLxMYcKfUxA0utBSR5MOAWWr.jpg
Lady, la vendedora de rosas,Not Found,74%,https://www.themoviedb.org/tv/92396,https://www.themoviedb.org/t/p/w220_and_h330_face/hkiNJTUqgltJvqEpFP1QpzuujO2.jpg
Arrow,2012-10-10,66%,https://www.themoviedb.org/tv/1412,https://www.themoviedb.org/t/p/w220_and_h330_face/gKG5QGz5Ngf8fgWpBsWtlg5L2SF.jpg
Tokyo Revengers,2021-4-11,90%,https://www.themoviedb.org/tv/105009,https://www.themoviedb.org/t/p/w220_and_h330_face/qSEKyf0fWhrCEQ3LTwLqe41eSvR.jpg
Once,2017-6-19,88%,https://www.themo

# Extract and combine data from multiple pages 

Since we can get one page of data, we can simply write another function on top of the helper function. I named it `ultra_scraper()`

In [None]:
def ultra_scraper(base_url=None,page_number=None,path=None,get_content=False):
  """Get the content for no. of page and write them to a CSV file"""
    #if path isn't specified, the default goes with the current time
  if path is None:
        from datetime import datetime
        path = datetime.now().strftime("%Y-%b-%d %H:%M:%S") + '.csv'
    # if page isn't specified, the default of page number is 1
  if base_url is None:
        base_url = 'https://www.themoviedb.org'
  #if page_number is None or page_number = 1 or page_numbr = 0:
  if page_number is None:
        page_number = 1
  page_contents = []
  for i in range (1,page_number+1):
      time.sleep(1)
      print('Downloading page {}...Please patiently wait...'.format(str(i)))
      page_contents+= web_scraper(base_url=base_url,page_number=i,path=path,get_content=True)
  time.sleep(1)
  print('Downloading is done. Thank you for your patience!')
  import pandas
  dataframe = pandas.DataFrame(page_contents)
  dataframe.to_csv(path, index=None)
  print('You have successfully scraped data at the {} pages, written to file {}'.format(page_number, path))
  return path

Using `ultra_scraper()` to extrat data from the 20 pages.

In [None]:
data = ultra_scraper(page_number=20)


Downloading page 1...Please patiently wait...
Downloading page 2...Please patiently wait...
Downloading page 3...Please patiently wait...
Downloading page 4...Please patiently wait...
Downloading page 5...Please patiently wait...
Downloading page 6...Please patiently wait...
Downloading page 7...Please patiently wait...
Downloading page 8...Please patiently wait...
Downloading page 9...Please patiently wait...
Downloading page 10...Please patiently wait...
Downloading page 11...Please patiently wait...
Downloading page 12...Please patiently wait...
Downloading page 13...Please patiently wait...
Downloading page 14...Please patiently wait...
Downloading page 15...Please patiently wait...
Downloading page 16...Please patiently wait...
Downloading page 17...Please patiently wait...
Downloading page 18...Please patiently wait...
Downloading page 19...Please patiently wait...
Downloading page 20...Please patiently wait...
...
Downloading is done. Thank you for your patience!
You have succes

In [None]:
data

'2021-Apr-28 17:03:34.csv'

Check if extracted data is successful.

In [None]:
pd.read_csv(data)

Unnamed: 0,movie_title,released_date,user_score,detail_url,image_url
0,The Falcon and the Winter Soldier,2021-3-19,79%,https://www.themoviedb.org/tv/88396,https://www.themoviedb.org/t/p/w220_and_h330_f...
1,The Good Doctor,2017-9-25,86%,https://www.themoviedb.org/tv/71712,https://www.themoviedb.org/t/p/w220_and_h330_f...
2,Luis Miguel: The Series,2018-4-22,81%,https://www.themoviedb.org/tv/79008,https://www.themoviedb.org/t/p/w220_and_h330_f...
3,The Flash,2014-10-7,77%,https://www.themoviedb.org/tv/60735,https://www.themoviedb.org/t/p/w220_and_h330_f...
4,Van Helsing,2016-9-23,69%,https://www.themoviedb.org/tv/65820,https://www.themoviedb.org/t/p/w220_and_h330_f...
...,...,...,...,...,...
395,NOVA,1974-3-3,71%,https://www.themoviedb.org/tv/3562,https://www.themoviedb.org/t/p/w220_and_h330_f...
396,Demon Slayer Academy Valentine Chapter,2021-2-14,NR%,https://www.themoviedb.org/tv/118405,https://www.themoviedb.org/t/p/w220_and_h330_f...
397,ZDF-Mittagsmagazin,1989-10-2,NR%,https://www.themoviedb.org/tv/105002,Not_found
398,Batman: The Animated Series,1992-9-5,83%,https://www.themoviedb.org/tv/2098,https://www.themoviedb.org/t/p/w220_and_h330_f...


# Put everything into one code cell

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from datetime import datetime
import datetime as dt
import time


def ultra_scraper(base_url=None,page_number=None,path=None,get_content=False):
  """Get the content for no. of page and write them to a CSV file"""
    #if path isn't specified, the default goes with the current time
  if path is None:
        from datetime import datetime
        path = datetime.now().strftime("%Y-%b-%d %H:%M:%S") + '.csv'
    # if page isn't specified, the default of page number is 1
  if base_url is None:
        base_url = 'https://www.themoviedb.org'
  #if page_number is None or page_number = 1 or page_numbr = 0:
  if page_number is None:
        page_number = 1
  page_contents = []
  for i in range (1,page_number+1):
      time.sleep(1)
      print('Downloading page {}...Please patiently wait...'.format(str(i)))
      page_contents+= web_scraper(base_url=base_url,page_number=i,path=path,get_content=True)
  time.sleep(1)
  print('Downloading is done. Thank you for your patience!')
  import pandas
  dataframe = pandas.DataFrame(page_contents)
  dataframe.to_csv(path, index=None)
  print('You have successfully scraped data at the {} pages, written to file {}'.format(page_number, path))
  return path




# To get one page of data
def web_scraper(base_url=None,page_number=None,path=None,get_content=False):
  """Get the content for no. of page and write them to a CSV file"""
    #if path isn't specified, the default goes with the current time
  if path is None:
        from datetime import datetime
        path = datetime.now().strftime("%Y-%b-%d %H:%M:%S") + '.csv'
    # if page isn't specified, the default of page number is 1
  if base_url is None:
        base_url = 'https://www.themoviedb.org'
  #if page_number is None or page_number = 1 or page_numbr = 0:
  if page_number is None:
        page_number = 1
  page_doc = get_page(page_number)
  page_content = get_one_page(page_doc)
  #print(page_content)
  if get_content:
        return page_content
  else:
        write_csv(page_content,path)

  
  print('You have successfully scraped data at the page {}, written to file {}'.format(page_number, path))
  return path

# request and store page into a BeautifulSoup object
def get_page(page_number):
    """Get the number of web page containing all the content for TV shows and retun a BeautifulSoup document"""
    page_url = 'https://www.themoviedb.org/tv' + '?page=' + str(page_number)
    response = requests.get(page_url)
    #check the status
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception ('Failed to fetch web page' + page_url)
    return BeautifulSoup(response.text)

def get_one_page(doc):
    """Parse all content on one page given a BeautifulSoup object"""
      # Get the div containing all contents
    page_wrappers = doc.find_all('div',class_='page_wrapper')
      # Get the content tags containing title, released date, user score
    content_tags = page_wrappers[0].find_all('div',class_="content")
      # Get a tag containing <img>
    a_tags = page_wrappers[0].find_all('a', class_='image')
      # Get a list of all <img>s
    img_tags = [tag.find('img',class_='poster') for tag in a_tags]
      # Put them all into a list of dictionary togeter 
    all_page_contents = all_content(content_tags,img_tags)
    return all_page_contents
      

def parse_content(content_tag):
    # a tag contains title name
      a_tag = content_tag.find('a')
    # movie title name
      movie_title = a_tag.text
    # p tag contains the released date
      p_tag = content_tag.find('p').text
    # released date
      rel_date = convert_date(p_tag)
    # detail page
      det_url ='https://www.themoviedb.org' + a_tag['href']
    # span tag containing user score
      span_tag = content_tag.find('span')
    # user score
      user_score = span_tag['class'][-1][-2:] + '%'

    # return a dictionary
      return {
          'movie_title': movie_title,
          'released_date': rel_date,
          'user_score': user_score,
          'detail_url': det_url
          } 

def convert_date(p_tag):
  date = dt.datetime.strptime(p_tag, "%b %d, %Y")
  return '{}-{}-{}'.format(date.year,date.month,date.day)


def parse_image(img_tag):
    """To get one <img> value"""
    if img_tag != None:
      img_url ={'image_url':'https://www.themoviedb.org'+ img_tag['src']}
    else:
      img_url = {'image_url':"Not_found"}
    #img_url ={'image_url':'https://www.themoviedb.org'+ img_tag['src']}
    return img_url

def all_content(content_tags,img_tags):
    all_content = []
    for i in range(20):
      d1 = parse_content(content_tags[i])
      d2 = parse_image(img_tags[i])
      d4 = dict(d1)
      d4.update(d2)
      all_content.append(d4)
    return all_content

def write_csv(items,path):
    """ items is in dictionary type data structure
        path is the desired file path """
    # Open the file in write mode -'w'
    with open(path,'w') as f:
        # Return if there is nothing to write
        if len(items) ==0 :
          return None
        
        # Write the headers or filed names in the first line
        headers =list(items[0].keys())
        f.write(','.join(headers)+'\n')

        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header,"")))
            f.write(','.join(values)+"\n")

# Write a helper function to clean date data
def convert_date(p_tag):
  if p_tag =='':
    return "Not Found"
  elif p_tag is None:
    return "Not Found"
  else:
    date = dt.datetime.strptime(p_tag, "%b %d, %Y")
    released_date = '{}-{}-{}'.format(date.year,date.month,date.day)
    return released_date

# The end

It was an interesting project. It reminded me of the time at my previous job as a research associate. I was involved in the innovation project about building a data pipeline from Weibo, a Twitter-like Chinese social media platform. Also, the regular activities I did at previous work are collecting all kinds of data from the Internet. I remembered my boss and CTO were calculating the limit rate and how quickly we can get all data ready. The CTO often mentioned "farming". I love this word to describe how we collect data from the digital world.

Web scraping is the first step to get real data from the real world. It's absolutely exciting. For the limit of time, I wasn't able to do some analysis. However, good questions are better than meaningless action. I remember there is one day I showed my data visualization work to my boss. He messaged me back, " That's cool. but what is the insight for our clients?"



# Future work
As for future work, if any of you is interested in it, you can develop my code to get more data such as cast, crew and comments to answer these questions as follows:

1. Which year do we have the most TV shows?

2. What TV shows do users at TMBd comment on the most? If there wasn't sufficient comments, we can collect comments from another site like Rotten Tomatoes, https://www.rottentomatoes.com/ or Reddit, or Twitter using API if neccessary

3. Which factors( the number of black actors, female actors, genres) can determine those users to comment?

#Some ideas about new projects



## Project 1 - Looking for aspiring artists on foundation.app

As for my future project, I'm interested in doing something with Bitcoin as I'm a big fan of cryptocurrency. As the writing of this, the NFTs (Non-fungible tokens)is a hit topic. An artist called Beeple sold his digital artwork at the price of 6.6 million dollars. People started to realize this is going to be a big thing. In summary, NFTs could be the Internet of Intellectual Property. However, some started to question if NFTs actually solve the problem. What if a person screenshotted someone's digital artwork? Besides, it's interesting to think of the value of the original work comparing with fake. What's the real difference between Mona Lisa and fake Mona Lisa?

My opinion is quite simple and just answer key questions like

### 1. Who created?

This is because the original Mona Lisa is created by Leonardo da Vinci.

### 2. How long are they active in the market?
Do you know how long Leonardo da Vinci spent painting Mona Lisa? It took him 16 years to finish.

NFTs provides us with a new way to support artists. Wait, How on earth can I know who has the potentials for those upcoming artists?
So my idea is to get data from those platforms where artists hang out and sell their artwork. You can simply create multiple profiles for those aspiring artists.



## Project 2 - Correlation between inflation&corruption rate and volumes on localbitcoin.com

I went to a small gathering for bitcoiners in London 2 weeks ago. It was nice weather in Hydepark. One of the people I met was interested in my skills in data science and we are thinking of getting a project done together. It could be my next project.

It'd be interesting to know some factors drive people to trade bitcoins.
However, the first step would be always to collect data.


## Project 3 - Frontline reports for e-commerce business

You might know Kickstarter or Indiegogo. What about we can provide some sort of service for e-commerce business owners? For example, once we found something that is similar to their products, they got notification. It could be helpful for them to develop their next product planning.


Yet, the first step is to get(scraping) data!

# References


[1] Python offical documentation. https://docs.python.org/3/


[2] Requests library. https://pypi.org/project/requests/


[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/


[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api


[5] Salmon. M (2017 , Web Scraping Job Postings from Indeed. https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b


[6] Lazar. D(2020), Scraping Medium with Python& Beautiful Soup. https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5


[7] Arif Ul Islam(Ron), How to Become a Pro with Scraping Youtube Videoes in 3 minutes. https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961


[8] Hoekstra.D(2020), How to Scrape Wikipedia Articles with Python, https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/ 

[9] Macaraeg.R(2020), Web Scraping Yahoo Finance. https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852

[10] Mohan.M(2020), Web Scraping Python Tutorial – How to Scrape Data From A Website. https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/ 

[11] Pandas library documentation. https://pandas.pydata.org/docs/


In [None]:
jovian.commit(files=data)

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
[jovian] Capturing environment..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/shenghongzhong/210424-project001-web-scraping-tmbd[0m


'https://jovian.ai/shenghongzhong/210424-project001-web-scraping-tmbd'