# Choose a Data Set

Create your own dataset by scraping one of the following websites *(level 5)*:
- [Wikipedia](https://www.wikipedia.org/)
- [OpenLibrary](https://openlibrary.org/)

**OR** 

Use data gathered from one of the following APIs *(level 4)*: 
- [TMDB](https://developer.themoviedb.org/reference/intro/getting-started)
- [College Scorecard](https://collegescorecard.ed.gov/data/api-documentation/)

**OR** 

Pick a JSON dataset *(level 3)*:
- [Food/Restaurant Data](https://drive.google.com/drive/folders/1V94S6WpclvQmbnW88KVMD4EruryA1oma?usp=drive_link)
- [Fashion Data](https://drive.google.com/drive/folders/1V8SbFjtRRW8WVf3xBzg0gzLjOtMhHea_?usp=drive_link)

**OR** 

Pick a CSV dataset *(level 2)*:
- [LA Parking Tickets](https://drive.google.com/drive/folders/1vaOfwMi6QmZEGsXr8VM0ulPGzvTTBCgm?usp=drive_link)
- [Hotels](https://drive.google.com/drive/folders/1IpVFxgwBJvJHKoOuBsk6WK2qYqFYP4hi?usp=drive_link)

# My Question
### When searching for books, you type in "Japan".  
### What is the expected amount of ratings that a book in the top rated category would get and what is the expected rating? Also, what is the probability of a book getting a rating of 4 stars or higher?
*To get here, just click the search, press enter, then when searching something, change the type to "Top Rated"

# My Answer

In [17]:
#Importing what we need
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re


In [20]:
url = "https://openlibrary.org/search?q=Japan&mode=everything&sort=rating&page=1"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
book_ratings = {"Title": [], "Author": [], "Amount_Ratings": [], "Rating": []}



In [36]:
titles = soup.find_all("h3", {"class": "booktitle"})
for title in titles:
    title = str(title.get_text().strip())
    book_ratings["Title"].append(title)

    
span_rate = soup.find_all('span', itemprop='ratingValue')
for span in span_rate:
        text = span.text.strip()
        match = re.search(r'\((\d+)\)', text)  

        if match:
            num_ratings = match.group(1)
            book_ratings["Amount_Ratings"].append(num_ratings)
            
            
just_rate = soup.find_all('span', itemprop='ratingValue')            
for span in just_rate:
    another_text = span.text.strip()
    match = re.search(r'(\d+\.\d+)', text)
    
    if match:
        numy_ratings = match.group(1)
        book_ratings["Rating"].append(numy_ratings)
            
authors = soup.find_all("a")  
            
for author in authors:
    if "/authors/" in author.get("href", ""): 
        book_ratings["Author"].append(author.get_text(strip=True))

min_length = min(len(book_ratings[key]) for key in book_ratings)

# Trim all lists in book_info to the minimum length
for key in book_ratings:
    book_ratings[key] = book_ratings[key][:min_length]

In [37]:
df = pd.DataFrame(book_ratings)

In [38]:
df

Unnamed: 0,Title,Author,Amount_Ratings,Rating


In [9]:
def scrape_page(page_number):
    url = f'https://openlibrary.org/search?q=Japan&mode=everything&sort=rating&page={page_number}'  
    response = requests.get(url)    
    soup = BeautifulSoup(response.content, 'html.parser')

    data = []
    rows = soup.find_all('div', class_='item')  

    for row in rows:
        title = row.find('h2').text.strip() 
        description = row.find('p').text.strip() 
        data.append({'Title': title, 'Description': description})
    
    return data



In [11]:
scrape_page(1)

[]

In [10]:
def scrape_all_pages(total_pages):
    all_data = []

    for page_number in range(1, total_pages + 1):
        page_data = scrape_page(page_number)
        all_data.extend(page_data)

    df = pd.DataFrame(all_data)
    return df
df = scrape_all_pages(5)
print(df)


Empty DataFrame
Columns: []
Index: []


In [None]:
  # Check if the request was successful
    if response.status_code != 200:
        print(f"Failed to retrieve page {page_number}")
        return []


***Describe analysis here.***

In [2]:
# Alternate between code for analysis and markdown descriptions of your analysis
# Add more code or markdown cells if needed to fully explain analysis

***Describe analysis here.***

In [3]:
# Add more code/markdown cells here if you need them.