## Getting Data from the Web

In this section, we will cover the basics of web scraping, the tools we'll use, and the ethical considerations involved. We will also discuss using APIs when available.

### Web Scraping

Web scraping is the process of extracting data from websites. It involves parsing the HTML structure of a webpage and extracting the desired information. Python provides several libraries for web scraping, such as BeautifulSoup and Scrapy.

### Tools for Web Scraping

1. [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): A Python library for parsing HTML and XML documents. It provides a simple and intuitive API for navigating and searching the parsed data.
2. [Scrapy](https://scrapy.org/): A powerful and flexible web scraping framework written in Python. It allows you to define the structure of the website you want to scrape and provides tools for extracting data efficiently.

### Ethical Considerations

When web scraping, it is important to consider the ethical implications and legal restrictions. Here are some key points to keep in mind:

- Respect website terms of service: Make sure to review the terms of service of the website you are scraping and comply with any restrictions or guidelines.

- Don't overload the server: Avoid sending too many requests to a website in a short period of time, as it can put a strain on the server and disrupt the website's normal operation.

- Respect privacy: Be mindful of the data you are scraping and ensure that you are not violating any privacy laws or collecting sensitive information without consent.

### Using APIs

In some cases, websites provide APIs (Application Programming Interfaces) that allow developers to access and retrieve data in a structured and controlled manner. APIs provide a more reliable and efficient way to obtain data compared to web scraping. When available, it is recommended to use APIs instead of web scraping, as it ensures that you are accessing the data in a legitimate and authorized manner. In the next section, we will explore how to use APIs to retrieve data from the web.

### Other tools
Just grabbing data from the web is not enough. We need to store it, process it, and analyze it. For this, we will use the following tools:
* [pandas](https://pandas.pydata.org/): A powerful data manipulation and analysis library for Python. It provides data structures and functions for working with structured data, making it easy to clean, transform, and analyze data.
* [numpy](https://numpy.org/doc/): A fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
* [matplotlib](https://matplotlib.org/): A plotting library for the Python programming language and its numerical mathematics extension, NumPy. It provides a MATLAB-like interface for creating visualizations and plots. 
* [seaborn](https://seaborn.pydata.org/): A data visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.


## Inspecting the HTML Structure of a Webpage
Before we get to scraping it, we need to understand the structure of the webpage we want to scrape. We can do this by opening the page in a web browser and using the browser's developer tools to inspect the HTML structure.

Here are the steps to inspect the HTML structure of a webpage using Google Chrome:
Open the webpage in Google Chrome.
Right-click on the element you want to inspect and select "Inspect" from the context menu.
The developer tools panel will open, showing the HTML structure of the webpage and allowing you to inspect the elements and their attributes.

We'll be interested primarily in book title, availability and price.

## Installing the pre-requisites
We will use the following libraries for web scraping:
* requests
* BeautifulSoup
* pandas

We can install these libraries using pip, the Python package manager. Run the following commands in your terminal or command prompt to install the required libraries:

```python
pip install requests
pip install beautifulsoup4
pip install pandas
```

### Scraping the data
Let's start by getting the data from the website. We will use the requests library to send an HTTP request to the website and retrieve the HTML content of the page. We will then use BeautifulSoup to parse the HTML and extract the desired information.

What we are going to do is pretend our code is a web browser.  So this starts by sending an HTTP request to the website.  The website will respond with the HTML content of the page.  There are a variety of valid responses, but we are interested in the 200 response, which means the request was successful.  If the request was not successful, we will need to handle the error.

In [2]:
import requests
from bs4 import BeautifulSoup

base_url = 'https://books.toscrape.com/'
path = 'catalogue/category/books_1/'
page = 'index.html'

# We separate out the components of the full URL to allow us to adjust the page number
#  if we decide to scroll through the follow-on pages or categories
full_url = base_url + path + page

# Now let's go get the page
response = requests.get(full_url)

# Check to ensure that we go a good response
if response.status_code == 200:
    print(response.text)



<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    Books | 
     Books to Scrape - Sandbox

</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="
    
" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="../../../stat

We are going to repeat the request here just for clarity.  In reality, we would likely skip the part where we print the response, and just move on to the next step.

In [3]:
# Now let's go get the page
response = requests.get(full_url)

# Check to ensure that we go a good response
if response.status_code == 200:
    # We have a good response
    soup = BeautifulSoup(response.text, 'html.parser')
    # We can now use the soup object to extract the data we need
    # Let's start by finding all the article tags
    articles = soup.find_all('article')
    for article in articles:
        # We can now extract the data we need from each article
        # We'll start by getting the title and price
        title = article.h3.a['title']
        price = article.find('p', class_='price_color').text
        print(f'{title} - {price}')
    
    # Now let's find the next page link
    next_page = soup.find('li', class_='next')
    if next_page:
        # We have a next page
        next_page_url = next_page.a['href']
        print(f'Next page URL: {next_page_url}')
    else:
        print('No next page')

A Light in the Attic - Â£51.77
Tipping the Velvet - Â£53.74
Soumission - Â£50.10
Sharp Objects - Â£47.82
Sapiens: A Brief History of Humankind - Â£54.23
The Requiem Red - Â£22.65
The Dirty Little Secrets of Getting Your Dream Job - Â£33.34
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull - Â£17.93
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics - Â£22.60
The Black Maria - Â£52.15
Starving Hearts (Triangular Trade Trilogy, #1) - Â£13.99
Shakespeare's Sonnets - Â£20.66
Set Me Free - Â£17.46
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1) - Â£52.29
Rip it Up and Start Again - Â£35.02
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991 - Â£57.25
Olio - Â£23.88
Mesaerion: The Best Science Fiction Stories 1800-1849 - Â£37.59
Libertarianism for Beginners - Â£51.33
It's Only the Himalayas - Â£45.17
Next page URL: page-2.html


**TRY IT**
We'll leave an exercise for the reader to read the next page url and repeat the process for the next page.  This is a common pattern in web scraping.  We'll leave it as an exercise for the reader.

# Box Office Data
Let's take a look at another example.  In this case, we are going to scrape the box office data from the website boxofficemojo.com.  Box Office Mojo is a website that provides box office data for movies, including information about the movies' gross revenue, budget, and release date.  In this example, we'll see another even simpler approach for gathering the data from the page.  

## BoxOfficeMojo
Let's start by looking at the website at a whole.  We can see there is a ton of great data here.  We want to focus on the all-time best movies.  We can see that the URL for this page is https://www.boxofficemojo.com/chart/top_lifetime_gross/.  We can use this URL to get the data we want.

When we inspect this page, we see the data is in a very simple form - it's in a table with each row as a movie.  We can use the [pandas](https://pandas.pydata.org/) library to read this table directly into a dataframe.  This is a very simple way to get the data we want.  We have a great tool for reading data from a table, pandas, and we have a table on the page.  We can use the pandas [read_html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) function to read the table directly into a dataframe.  This is a very simple way to get the data we want.

The read_html function returns a list of dataframes, one for each table on the page.  We can see that the first table is the one we want.  We can use the [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) function to see the first few rows of the dataframe.  This is a great way to see if we have the data we want.

In [4]:
import pandas as pd

movies_df = pd.read_html('https://www.boxofficemojo.com/chart/top_lifetime_gross/')

# Get the first 5 rows of the first table on the page
print(movies_df[0].head())

   Rank                                       Title Lifetime Gross  Year
0     1  Star Wars: Episode VII - The Force Awakens   $936,662,225  2015
1     2                           Avengers: Endgame   $858,373,000  2019
2     3                     Spider-Man: No Way Home   $814,115,070  2021
3     4                                      Avatar   $785,221,649  2009
4     5                           Top Gun: Maverick   $718,732,821  2022


Brilliant!  It looks like the very first table on the page is the data we are after.  If it weren't we could cycle through the tables returned from the list and look at the head of each one to see which one we want.  And we can slice and dice as we see fit

In [5]:
# Since we know the first table is the right one, let's focus on it
all_time_movies_df = movies_df[0]

# To avoid confusion, let's use Rank as our index 
#all_time_movies_df.set_index('Rank', inplace=True)

# Find out which year had the biggest $ in blockbuster movies
# First, we need to remove the dollar sign and commas from the 'Lifetime Gross' column 
#   and convert it to integer
all_time_movies_df['Lifetime Gross'] = all_time_movies_df['Lifetime Gross'].replace({'\$': '', ',': ''}, regex=True).astype(int)

# Now, we can group by 'Year' and sum the 'Gross'
yearly_gross = all_time_movies_df.groupby('Year')['Lifetime Gross'].sum()
# print in descending order
print(yearly_gross.sort_values(ascending=False))


Year
2019    4657338928
2016    4506359053
2017    4255929449
2018    3990957778
2022    3727693513
2015    3605206784
2013    3397089092
2012    3313754690
2009    3091155547
2014    2737183276
2010    2659124856
2023    2578952746
2007    2501454761
2004    1976603465
2008    1838648350
2002    1792698590
2003    1590090193
2011    1510661370
2001    1420158755
2005    1414811841
2006    1370160662
1997    1154069826
2021    1038658362
1999    1013903148
1994     753239047
1993     626380318
1996     547999883
1990     503392549
2000     495047942
1984     478339275
1977     460998507
1982     437345144
1983     316566101
1980     292753960
1975     266567580
1989     251409241
1981     248159971
1973     233005644
1995     223225679
1991     218967620
1992     217350219
1998     217049603
Name: Lifetime Gross, dtype: int64


This is interesting, it looks like 2019 had some pretty big movies - let's take a look at what they were.

In [11]:
# Get the movies from 2019 that made the list
movies_2019 = all_time_movies_df[all_time_movies_df['Year'] == 2019][['Title', 'Lifetime Gross']]
print(f'Top Movies of 2019 by domestic box office')
print('_'*40)
print(movies_2019)

Top Movies of 2019 by domestic box office
________________________________________
                                            Title  Lifetime Gross
1                               Avengers: Endgame       858373000
15                                  The Lion King       543638043
18  Star Wars: Episode IX - The Rise of Skywalker       515202542
21                                      Frozen II       477373578
29                                    Toy Story 4       434038008
30                                 Captain Marvel       426829839
46                      Spider-Man: Far from Home       390532085
64                                        Aladdin       355559216
74                                          Joker       335477657
86                        Jumanji: The Next Level       320314960


## Conclusion
While this is only useful in situations where we have table data, using pandas read_html function is a very simple way to get tabular data from a webpage.  It's a great tool to have in your toolbox.

### A couple things to try
We only read the first 200 records of the top 1000 movies.  If you click the link to see the next page you'll see the URL changes to include a query parameter 'offset' (https://www.boxofficemojo.com/chart/top_lifetime_gross/?offset=200).  See if you can modify the query to get all 5 pages of results showing the top 1000 lifetime grossing movies.  