# Choose a Data Set

Create your own dataset by scraping one of the following websites *(level 5)*:
- [Wikipedia](https://www.wikipedia.org/)
- [OpenLibrary](https://openlibrary.org/)

**OR** 

Use data gathered from one of the following APIs *(level 4)*: 
- [TMDB](https://developer.themoviedb.org/reference/intro/getting-started)
- [College Scorecard](https://collegescorecard.ed.gov/data/api-documentation/)

**OR** 

Pick a JSON dataset *(level 3)*:
- [Food/Restaurant Data](https://drive.google.com/drive/folders/1V94S6WpclvQmbnW88KVMD4EruryA1oma?usp=drive_link)
- [Fashion Data](https://drive.google.com/drive/folders/1V8SbFjtRRW8WVf3xBzg0gzLjOtMhHea_?usp=drive_link)

**OR** 

Pick a CSV dataset *(level 2)*:
- [LA Parking Tickets](https://drive.google.com/drive/folders/1vaOfwMi6QmZEGsXr8VM0ulPGzvTTBCgm?usp=drive_link)
- [Hotels](https://drive.google.com/drive/folders/1IpVFxgwBJvJHKoOuBsk6WK2qYqFYP4hi?usp=drive_link)

# My Question
### Of all the authors on the trending page, what is the average amount of their books on the trending page?

# My Answer

<h1 style="text-align: center;">Analysis Techniques</h1>
<p style="text-align: center;">Simple theoretical probability<br>
The Complement Rule<br>
Mutual Exclusivity vs. Independent Events<br>
Theoretical Probability with the Addition Rule<br>
Theoretical Probability with the Multiplication Rule<br>
Empirical Probability<br>
Drawing without Replacement<br>
Bayes Theorem<br>
Expected value<br>
Standard deviation of expected values<br>
Probability distributions (histogram of all possible values of a random variable)<br>
</p>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [8]:
books = {'Title':[], 'Author':[], 'Year Published':[], 'Number of Logs':[]}

for i in range(1, 11):
    url = f'https://openlibrary.org/trending/forever?page={i}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    trends = soup.find_all('div', class_="sri__main")
    
    for book in trends:
        Title = book.find('div', class_='resultTitle').text.strip()
        Author = book.find('span', class_='bookauthor').text.strip()
        
        detailed = book.find('span', class_='resultStats')
        pubdate = detailed.find('span', class_='resultDetails').text.strip()
        
        pubdate_cleaned = pubdate.replace("—", "").replace(" editions", "").strip()

        logged_text = None
        log_info = book.find(text="Logged")
        if log_info:
            logged_text = log_info.strip()
            
        books['Title'].append(Title)
        books['Author'].append(Author)
        books['Year Published'].append(pubdate_cleaned)
        books['Number of Logs'].append(logged_text)

df = pd.DataFrame(books)

df

Unnamed: 0,Title,Author,Year Published,Number of Logs
0,Atomic Habits,by James Clear,First published in 2016\n \n ...,
1,It Ends With Us,by Colleen Hoover,First published in 2012\n \n ...,
2,The 48 Laws of Power,by Robert Greene and Joost Elffers,First published in 1998\n \n ...,
3,The Subtle Art of Not Giving a F*ck,by Mark Manson,First published in 2016\n \n ...,
4,Um casamento arranjado,by Zana Kheiron,First published in 2019\n \n ...,
...,...,...,...,...
194,Things Fall Apart,by Chinua Achebe,First published in 1958\n \n ...,
195,"A child called ""it""",by David J. Pelzer,First published in 1987\n \n ...,
196,The Titan's Curse,by Rick Riordan,First published in 2007\n \n ...,
197,A Wrinkle in Time,by Madeleine L'Engle,First published in 1962\n \n ...,


In [9]:
author_counts = df.groupby('Author').size()
average_books = author_counts.mean()
average_books

1.251572327044025

<p style="color: #635b9b; text-align: center;">The average amount of books on the trending page from authors on the trending page is one.</p>