## DSCI 510 (Abramson)
## Lab 9
### Wednesday, March 20th, 2024 (12:00pm - 1:50pm)


---
Hello All,
     Welcome to the lab!


Rules for *this* Lab (**Note** different labs may have different submission procedures, due dates, etc.):
- You will be given the lab assignment at the start of the lab.

- You've to complete the assignments *individually*.  If you are having trouble completing the assignment, ask the TA for help.  The TA *will not write your code for you*, nor should anyone else (i.e. other students, ChatGPT, etc.).  The only way you will learn is if you try this yourself!
  
- DUE DATE: The lab will be submitted **via Blackboard by 12:00pm (noon), on Wednesday, March 27th** (i.e. a week from today)

- There will be **no** late submissions

- You will name this file '[FirstName]_[LastName]\_lab[Lab Number]'.  For example, Jeremy Abramson's submission would be `jeremy_abramson_lab4.ipynb` for this lab.

- You are encouraged to look up resources online like python docs and stackoverflow. But, you are encouraged to look up
the topics and not the questions themselves.

In this lab, we are going to
1. Use the `requests` library to access some web endpoints
3. Use `BeautifulSoup` to do some web scraping.

### Don't worry if you don't know how to do everything just yet.  You have a week to complete this particular lab.  You'll get there!


### Q1. API Access [5 points]

##### Note: There is no sample code for this problem.  Consult the lecture slides for examples.  

For this problem, we're going to use the Star Wars API.  The documentation (which explains what *endpoints* are available, and what each endpoint returns) is here: https://swapi.dev/documentation.

Download *all* the data from the `starships`, `vehicles` and `species` endpoints.  Build a dictionary, where each initial key is one of `starships`, `vehicles` or `species`, and each initial value is a list.  Each element in each list should be a dictionary containing the contents of each record returned from the API endpoint.

The API has pagination, so you'll need to construct your crawler to contact each page individually, and respect when the API has more data, and when it does not.  Your crawler should not visit pages needlessly!

You should structure your program modularly; for example, you might write functions that access an API endpoint (with the endpoint being a parameter), one that tests whether there is more data or not (and if so, calls the previous function to get it) and one that writes the resultant data to the data structure defined above.  


In [2]:
# Did you remember to install requests? :-) 
import requests

In [3]:

# Function to fetch data from a specific endpoint
def fetch_data(endpoint):
    url = f"https://swapi.dev/api/{endpoint}/"
    data = []

    # Fetch data from all pages
    while url:
        response = requests.get(url)
        if response.status_code == 200:
            result = response.json()
            data.extend(result['results'])
            url = result['next']  # URL of the next page, or None if no more pages
        else:
            print(f"Error fetching data from {url}. Status code: {response.status_code}")
            break

    return data

# Function to fetch data from all endpoints and organize it into a dictionary
def fetch_all_data():
    endpoints = ['starships', 'vehicles', 'species']
    data_dict = {}

    for endpoint in endpoints:
        data_dict[endpoint] = fetch_data(endpoint)

    return data_dict

# Main function
def main():
    data_dict = fetch_all_data()
    print(data_dict)

if __name__ == "__main__":
    main()


{'starships': [{'name': 'CR90 corvette', 'model': 'CR90 corvette', 'manufacturer': 'Corellian Engineering Corporation', 'cost_in_credits': '3500000', 'length': '150', 'max_atmosphering_speed': '950', 'crew': '30-165', 'passengers': '600', 'cargo_capacity': '3000000', 'consumables': '1 year', 'hyperdrive_rating': '2.0', 'MGLT': '60', 'starship_class': 'corvette', 'pilots': [], 'films': ['https://swapi.dev/api/films/1/', 'https://swapi.dev/api/films/3/', 'https://swapi.dev/api/films/6/'], 'created': '2014-12-10T14:20:33.369000Z', 'edited': '2014-12-20T21:23:49.867000Z', 'url': 'https://swapi.dev/api/starships/2/'}, {'name': 'Star Destroyer', 'model': 'Imperial I-class Star Destroyer', 'manufacturer': 'Kuat Drive Yards', 'cost_in_credits': '150000000', 'length': '1,600', 'max_atmosphering_speed': '975', 'crew': '47,060', 'passengers': 'n/a', 'cargo_capacity': '36000000', 'consumables': '2 years', 'hyperdrive_rating': '2.0', 'MGLT': '60', 'starship_class': 'Star Destroyer', 'pilots': [], '

### Q2. Web Scraping  [5 points]

##### Note: There is no sample code for this problem.  Consult the lecture slides for examples.  

Scrape book titles, number of "stars", and prices from http://books.toscrape.com/catalogue/category/books/fiction_10/index.html. You should store this in a list of dictionaries, where each dictionary correspnds to a book, and each key is one of "title", "stars" and "price".  

Note that this is also paginated (there are 4 pages).  Your scraper should grab the url of the "next" page from the current page (starting with the URL above, and detect if there are no more pages to scrape.

This code should also be modular (you might even use the same "go get a webpage" function from Q1!).  You might consider functions for each thing you want to scrape.  

In [3]:
# Example for .prettify().  Applicable to full HTML text or parts thereof.

# Did you remember to install Beautiful Soup?
from bs4 import BeautifulSoup
s = '<tr><td><a href="https://www.france.fr">France</a></td><td cat="capital city" pop="2,102,650">Paris</td></tr>'
table_row = BeautifulSoup(s, 'html.parser')
print(table_row.prettify())

<tr>
 <td>
  <a href="https://www.france.fr">
   France
  </a>
 </td>
 <td cat="capital city" pop="2,102,650">
  Paris
 </td>
</tr>


In [4]:
# Code goes here

In [5]:

import requests
from bs4 import BeautifulSoup

def parse_stars(star_class):
    """Function to parse star rating class."""
    ratings = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    return ratings.get(star_class, 0) 

def get_page_content(url):
    response = requests.get(url)
    response.raise_for_status()
    return response.content

def extract_books_data(page_content):
    """Function to extract book data from a page."""
    books_data = []
    soup = BeautifulSoup(page_content, 'html.parser')
    for book in soup.select('.product_pod'):
        title = book.select_one('h3 a')['title']
        stars_class = book.select_one('p.star-rating')['class'][1]
        stars = parse_stars(stars_class)
        price = book.select_one('.price_color').get_text().strip('£')
        books_data.append({'title': title, 'stars': stars, 'price': price})
    return books_data

def get_next_page_url(page_content, base_url):
    """Function to extract the URL of the next page."""
    soup = BeautifulSoup(page_content, 'html.parser')
    next_button = soup.select_one('.pager .next a')
    if next_button:
        return base_url + next_button['href']
    return None

def scrape_all_books(start_url):
    """Function to scrape book data from all pages."""
    base_url = 'http://books.toscrape.com/catalogue/category/books/fiction_10/'
    url = start_url
    all_books_data = []

    while url:
        print(f'Scraping {url}')
        page_content = get_page_content(url)
        books_data = extract_books_data(page_content)
        all_books_data.extend(books_data)
        url = get_next_page_url(page_content, base_url)

    return all_books_data

# Starting URL
start_url = 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html'

# Scrape book data
books_data = scrape_all_books(start_url)

# Print scraped data
for book in books_data:
    print(book)


Scraping http://books.toscrape.com/catalogue/category/books/fiction_10/index.html
Scraping http://books.toscrape.com/catalogue/category/books/fiction_10/page-2.html
Scraping http://books.toscrape.com/catalogue/category/books/fiction_10/page-3.html
Scraping http://books.toscrape.com/catalogue/category/books/fiction_10/page-4.html
{'title': 'Soumission', 'stars': 1, 'price': '50.10'}
{'title': 'Private Paris (Private #10)', 'stars': 5, 'price': '47.61'}
{'title': 'We Love You, Charlie Freeman', 'stars': 5, 'price': '50.27'}
{'title': 'Thirst', 'stars': 5, 'price': '17.27'}
{'title': 'The Murder That Never Was (Forensic Instincts #5)', 'stars': 3, 'price': '54.11'}
{'title': 'Tuesday Nights in 1980', 'stars': 2, 'price': '21.04'}
{'title': 'The Vacationers', 'stars': 4, 'price': '42.15'}
{'title': 'The Regional Office Is Under Attack!', 'stars': 5, 'price': '51.36'}
{'title': 'Finders Keepers (Bill Hodges Trilogy #2)', 'stars': 5, 'price': '53.53'}
{'title': 'The Time Keeper', 'stars': 5,