There are mainly two ways to extract data from a website:
- Use of API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
- Access the HTML of the webpage and extract useful information/data from it. This technique is called web scrapping or web harvesting or web data extraction.


# BeautifulSoup

`BeautifulSoup` is a Python library used for web scrapping purposes to pull the data out of HTML and XML files. It creates parse trees from page source codes that can be used to extract data easily. BeautifulSoup provides Pythonic idioms for iterating, searching, and modifying the parse tree, which makes it easier to work with web data.

## Key Features:
- **HTML Parsing**: Handles broken HTML and creates a parse tree.
- **Navigation**: Navigate the parse tree using tag names, attributes, and text.
- **Search**: Find specific elements using various search methods.
- **Modification**: Modify the parse tree by adding, removing, or changing tags.


## Steps involves on How to use BeautifulSoup to scrape a web page.
1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.

2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.

3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.


**Step 1: Install BeautifulSoup and requests**
```python
pip install beautifulsoup4
pip install requests
```

**Step 2: Import Libraries**
```python
import requests
from bs4 import BeautifulSoup
```

**Step 3: Send a Request to a Web Page**
Send a request to the web page you want to scrape:
```python
url = 'https://example.com'
response = requests.get(url)
```

**Step 4: Parse the HTML Content**
Parse the HTML content using BeautifulSoup:
```python
soup = BeautifulSoup(response.content, 'html.parser')
```

**Step 5: Extract Data**
Extract specific data from the web page. For example, let's extract all the titles from the page:
```python
titles = soup.find_all('h1') # Assuming titles are within <h1> tags
for title in titles:
    print(title.get_text())
```

## Explanation:
- **requests.get(url)**: Sends a GET request to the specified URL and retrieves the content.
- **BeautifulSoup(response.content, 'html.parser')**: Parses the HTML content using the 'html.parser' parser.
- **soup,find_all('h1')**: Finds all the `<h1>` tags in the HTML and returns them as a list.
- **title.get_text()**: Extracts the text content from each `<h1>` tag.

<br>

# Responses 
## **1. <Response [200]>**:
A `200 OK` response means that the request was successful and the sever has returned the requested resource. This is the standard response for sucessful HTTP requests. 

## **2. <Response [403]>**: 
A `403 Forbidden` response status code indicates that the server understood the request but refuses to authorize it. This can happen for several reasons, such as insufficient permissions, IP blocking, or restrictions on automated access (common with web scrapping).

### Handling HTTP 403 Forbidden Error in Web Scraping with BeautifulSoup
#### 1. **Check the URL**:
Make sure the URL you are trying to access is correct and does not require additional parameters or headers.
   
#### 2. **User-Agent Header**:
Some websites block requests that do not have a proper User-Agent Header. Adding a User-Agent header can help bypass this restrition:
```python
import requests
from bs4 import BeautifulSoup

# URL of the page we want to scrape
url = 'https://example.com'

# Set the headers to make the request look like it is coming from a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# Send a GET request to the URL with headers
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all <h1> tags and extract their text
    titles = soup.find_all('h1')
    for title in titles:
        print(title.get_text())
else:
    print(f'Failed to retrieve the webpage. Status code: {response.status_code}')
```

#### 3. Verify Access Restrictions:
Some websites restrict access based on geographic location or IP addresses. You might try using a proxy or VPN to see if it resolves the issue.


#### 4. Check for Additional Headers or Cookies:
Some websites require additional headers or cookies to be set. Inspect the network requests in your browser's developer tools to see if there are any such requirements.

#### 5. Respect Robots.txt:
Ensure that the website allows web scrapping by checking its `robots.txt` file. Accessing restricted parts of the website might cause a `403 Forbidden` response.

#### 6. Use a Web Scraping API
If manual scraping is difficult, you might consider using a web scrapping API like ScraperAPI, ScrapingBee, or others that handle these issues for you.

#### Retry with Autnentication
If ethe website requires login credentials, you may need to handle authentication:
```python
# Example for basic authentication
response = requests.get(url, headers=headers, auth('username', 'password'))
```
