# Web Scraping 101

<img src="image/header.jpg" width="500" height="300">

## Background
We saw the importance of data in the modern world in the last decade, from making business decisions to building groundbreacking AI system, massive datasets are used in to provide valuable insights in all fields around the world providing. However, gathering data are often times a challenging task, as it is time consuming and expensive. 

A place full of free data is the web, thus utilizing the web for gathering data has become a common practice. We use a powerful technique called Web scraping for such task, automating the manual extraction of data and information from websites online. Web scraping often comes in a two step process, collect the data in raw forms then parse the data for further use.


This guide will we will walkthrough an example of Python's HTTPX for client request and BeautifulSoup to parse HTML of a basic, static, non-blocking webpage.

1. HTTP Requests and web fundamentals. 
2. HTML and javascripts
3. Python libraries
4. Sending requests and parsing static web pages. Regular expressions.
5. Clean and store information.
6. Dealing with popups
7. Other things to consider.

## Web requests

Before diving into web scraping, it is important to understand HTTP fundamentals. The modern websites
are served through the HTTP protocol, which is used for transmitting hypertext requests and information
between servers and browsers. We, as clients, send requests to the webite (servers) for resources. The server 
processes our request and reply with a response of cooresponding web data (or error message in the event of a failure )

![exchange](./image/exchange.png)

In the process of web scraping, we primarily deal with HTTP requests and responses. These are the fundamental components of data communication on the web. Let's take a look into their structure and relevance to web scraping.

### HTTP Requests
An HTTP request is composed of three main parts:

1. *Method: one of several types that define the kind of action to be performed. eg. 
    - GET: This method requests a document.
    - POST: This method requests a document by sending additional data.
    - HEAD: This method requests meta information about a document, such as its last update time.
2. Headers: provide metadata about our request.
3. Location: specifies the resource we aim to retrieve. They are defined by URL (Uniform Resource Locator)

![url](./image/urlstructure.png)

*In web scraping, GET requests are predominantly used as we aim to retrieve documents. POST requests are also common when interacting with web page elements like forms, search bars, or pagination. HEAD requests can be used for optimization, allowing scrapers to request meta information and then decide whether downloading the entire page is worthwhile.

Some other methods that are used extensively in web communications are PATCH, PUT, and DELETE, 
Though these methods are less common in web scraping

PATCH: This method updates an existing document.

PUT: This method either creates a new document or updates an existing one.

DELETE: This method deletes a document.

### HTTP Responses
The server responds with an HTTP response, which includes:

1. Status code: A three-digit number indicate the status of a web request. eg.
    - 200 OK: Request Successful, expected contents returned
    - 400 Bad Request: Server could not understand the request (invalid syntax ...)
    - 401 Unauthorized: Understook the request, but need user authentication
    - 404 Not Found: The requested resource/path is not found. 
2. Headers: provide metadata about the response.
3. Content: the actual data of the page, such as HTML or JSON. It is this data that we will be parsing and collecting as we scrape the website. 


![exchange](image/http-exchange.svg)




In a nutshell, we are sending web requests to web pages, retriving their contents in the form of HTML, CSS, and parsing those contents into usable data. So now we understand web requests, lets take a quick look in HTML and CSS

## HTML CSS