# Choose a Data Set

Create your own dataset by scraping one of the following websites *(level 5)*:
- [Wikipedia](https://www.wikipedia.org/)
- [OpenLibrary](https://openlibrary.org/)

**OR** 

Use data gathered from one of the following APIs *(level 4)*: 
- [TMDB](https://developer.themoviedb.org/reference/intro/getting-started)
- [College Scorecard](https://collegescorecard.ed.gov/data/api-documentation/)

**OR** 

Pick a JSON dataset *(level 3)*:
- [Food/Restaurant Data](https://drive.google.com/drive/folders/1V94S6WpclvQmbnW88KVMD4EruryA1oma?usp=drive_link)
- [Fashion Data](https://drive.google.com/drive/folders/1V8SbFjtRRW8WVf3xBzg0gzLjOtMhHea_?usp=drive_link)

**OR** 

Pick a CSV dataset *(level 2)*:
- [LA Parking Tickets](https://drive.google.com/drive/folders/1vaOfwMi6QmZEGsXr8VM0ulPGzvTTBCgm?usp=drive_link)
- [Hotels](https://drive.google.com/drive/folders/1IpVFxgwBJvJHKoOuBsk6WK2qYqFYP4hi?usp=drive_link)

# My Question
### What is the probability of getting five books on the trending page with the letter "a" in it?

# My Answer

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
books = {'Title':[], 'Author':[], 'Year Published':[], 'Number of Logs':[]}

for i in range(1, 11):
    url = f'https://openlibrary.org/trending/forever?page={i}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    trends = soup.find_all('div', class_="sri__main")
    
    for book in trends:
        Title = book.find('div', class_='resultTitle').text.strip()
        Author = book.find('span', class_='bookauthor').text.strip()
        
        detailed = book.find('span', class_='resultStats')
        pubdate = detailed.find('span', class_='resultDetails').text.strip()
        
        pubdate_cleaned = pubdate.replace("—", "").replace(" editions", "").strip()

        logged_text = None
        log_info = book.find(text="Logged")
        if log_info:
            logged_text = log_info.strip()
            
        books['Title'].append(Title)
        books['Author'].append(Author)
        books['Year Published'].append(pubdate_cleaned)
        books['Number of Logs'].append(logged_text)

df = pd.DataFrame(books)

df

  log_info = book.find(text="Logged")


Unnamed: 0,Title,Author,Year Published,Number of Logs
0,Atomic Habits,by James Clear,First published in 2016\n \n ...,
1,It Ends With Us,by Colleen Hoover,First published in 2012\n \n ...,
2,The 48 Laws of Power,by Robert Greene and Joost Elffers,First published in 1998\n \n ...,
3,The Subtle Art of Not Giving a F*ck,by Mark Manson,First published in 2016\n \n ...,
4,Um casamento arranjado,by Zana Kheiron,First published in 2019\n \n ...,
...,...,...,...,...
194,Things Fall Apart,by Chinua Achebe,First published in 1958\n \n ...,
195,"A child called ""it""",by David J. Pelzer,First published in 1987\n \n ...,
196,The Titan's Curse,by Rick Riordan,First published in 2007\n \n ...,
197,A Wrinkle in Time,by Madeleine L'Engle,First published in 1962\n \n ...,


In [56]:
def calc_probability(n_books_with_a, total_books, draws=5):
    probability = 1
    for i in range(draws):
        probability *= n_books_with_a / (total_books - i)
        n_books_with_a -= 1
    return probability

probability = calc_probability(books_with_a, total_books)
probability

0.1686577189472376

<p style="color: #FFBB00; text-align: center;"><b>The probability of all five books I pull containing the letter "a" in the title without replacement is about 17%.
This is surprising to me, since the letter "a" is one of the most common letters, including different variations, in all existing languages. However, my code most likely only accounts for the english variant of the letter, and looking at the dataframe, there seems to be book in other languages on the trending page.</b></p>

In [76]:
contains_a = df['Title'].apply(lambda x: 'a' in x.lower()).sum()

total_books = len(df)

p_empirical = contains_a / total_books

p_5_books = p_empirical ** 5

print(f"Empirical Probability — {p_empirical}")
print(f"5 books with 'a' in a row with replacement — {p_5_books}")

Empirical Probability — 0.7035175879396985
5 books with 'a' in a row with replacement — 0.17233551897604596


<p style="color: #BB88ED; text-align: center;"><b>There is about a 70% chance of getting a book on the trending page with an "a" in its title with one pull. This is expected, as "a" is a very common letter used in the English language. However, it is interesting how the probability of pulling five books in a row <i>with replacement</i> is lower. Despite there being 199 books in the dataframe, it seems the probability of all five books containing "a" is very low, as it is harder to get multiple successful events in a row.</b></p>

In [73]:
import numpy as np

n_draws = 5

p_empirical = contains_a / total_books

std_dev_with_replacement = np.sqrt(n_draws * p_empirical * (1 - p_empirical))

std_dev_without_replacement = np.sqrt(n_draws * p_empirical * (1 - p_empirical) * (total_books - n_draws) / (total_books - 1))

print(f"Standard Deviation with Replacement — {std_dev_with_replacement}")
print(f"Standard Deviation without Replacement — {std_dev_without_replacement}")

Standard Deviation with Replacement — 1.0212262026583705
Standard Deviation without Replacement — 1.0108581554254064


<p style="color: #FF7721; text-align: center;"><b>The standard deviation of pulling five books in a row containing "a" in the name <i>with replacement</i>is 1.0212262026583705, meaning that, including the empirical probability, no matter how many times you perform this simulation, the outcome will always be around 17% or 70% depending on whether or not the book is placed back in after being taken out.
The standard deviation of pulling five books in a row containing "a" in the name <i>without replacement</i> is 1.0108581554254064. This means that, similar to my previous analysis, the outcome of this will also be around 17%.</b></p>

<p style="color: #635b9b; text-align: center;"><b>The probability of getting five books on the trending page with the letter "a" in the title is about 17% with replacement. While there is a 70% chance of drawing a book once with "a" in it's title, the chances of all five containing "a" is much lower due to the fact that it is much harder for all five pulls to be identical. This applies to both the simulation with and without replacement.
This simuation, however, only accounts for books with the english letter a, and all other a's like á à â ã ä å ă ą are not included.
Note that I am unsure if these a's are all the same, but I do mean all variations of this letter.</b></p>