# Choose a Data Set

Create your own dataset by scraping one of the following websites *(level 5)*:
- [Wikipedia](https://www.wikipedia.org/)
- [OpenLibrary](https://openlibrary.org/)

**OR** 

Use data gathered from one of the following APIs *(level 4)*: 
- [TMDB](https://developer.themoviedb.org/reference/intro/getting-started)
- [College Scorecard](https://collegescorecard.ed.gov/data/api-documentation/)

**OR** 

Pick a JSON dataset *(level 3)*:
- [Food/Restaurant Data](https://drive.google.com/drive/folders/1V94S6WpclvQmbnW88KVMD4EruryA1oma?usp=drive_link)
- [Fashion Data](https://drive.google.com/drive/folders/1V8SbFjtRRW8WVf3xBzg0gzLjOtMhHea_?usp=drive_link)

**OR** 

Pick a CSV dataset *(level 2)*:
- [LA Parking Tickets](https://drive.google.com/drive/folders/1vaOfwMi6QmZEGsXr8VM0ulPGzvTTBCgm?usp=drive_link)
- [Hotels](https://drive.google.com/drive/folders/1IpVFxgwBJvJHKoOuBsk6WK2qYqFYP4hi?usp=drive_link)

# My Question
### What is the chances of pulling a book from the 1900s from the trending yearly section in OpenLibrary?

### Imagine you and your friend are gambling. Your friend makes a bet that the next person they talk to, has read a 1900s book this year, if your friend loses, he will pay 5 bucks. If the person has read a 1900s book this year, you pay 6 bucks. What is the expected value of profit?

# My Answer

In [16]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import requests
import random
from enum import Enum

In [2]:
pages = []
url = 'https://openlibrary.org/trending/yearly'
response = requests.get(url)
soup2 = BeautifulSoup(response.content, "html.parser")
pages.append(soup2)

In [3]:
page = 1
for i in range(page, 3):
    url = 'https://openlibrary.org/trending/yearly?page=' + str(page)
    response = requests.get(url)
    soup2 = BeautifulSoup(response.content, "html.parser")
    pages.append(soup2)

In [4]:
book_info = {"Book Title": []}
titles = soup2.find_all('a', class_='results')

In [5]:
for title in titles:
    book_info["Book Title"]. append(title.contents)

In [6]:
books = pd.DataFrame(book_info)

In [8]:
years = []
for element in soup2.find_all("span", class_="resultDetails"):  # Adjust tag and class as needed
    year = element.get_text(strip=True)  # Extract and clean text
    years.append(year)

# Step 3: Store in a DataFrame
df1 = pd.DataFrame({"Publication Year": years})

In [9]:
authors = []
for element in soup2.find_all("span", class_="bookauthor"):
    raw_text = element.get_text(strip=True)
    # Assuming text is "Author: John Doe"
    author = raw_text.split("by")[1].strip()  # Split by ":" and take the part after it
    authors.append(author)
# Step 3: Store in a DataFrame
df2 = pd.DataFrame({"Author": authors})

In [10]:
elements = soup2.find_all("div", class_="details")

# Initialize the list to store numbers
numbers = []

# Loop through all elements to extract numbers
for element in elements:
    text = element.get_text(strip=True)  # Get the text content from each element
    words = text.split()  # Split the text into a list of words
    
    # Loop through each word and check if it's a number
    for word in words:
        if word.isdigit():  # Check if the word is a number
            
            # Check if the number is not 48
            if word != "48":
                numbers.append(word)

            

# Print the list of numbers, excluding 48
print(numbers)

['11572', '11027', '9244', '5875', '5563', '5131', '4424', '3739', '3684', '3279', '3209', '3177', '3049', '3028', '3009', '2507', '2456', '2287', '1830', '1774']


In [11]:
df = pd.DataFrame(numbers, columns=["Times Logged In"])

In [12]:
trendingYearly = pd.concat([books, df1, df2, df], axis=1)

In [13]:
trendingYearly

Unnamed: 0,Book Title,Publication Year,Author,Times Logged In
0,[Atomic Habits],First published in 2016—41 editions,James Clear,11572
1,[Control Your Mind and Master Your Feelings],First published in 2019—3 editions,Eric Robertson - undifferentiated,11027
2,[The 48 Laws of Power],First published in 1998—52 editions,Robert GreeneandJoost Elffers,9244
3,[It Ends With Us],First published in 2012—34 editions,Colleen Hoover,5875
4,[I Don't Love You Anymore],First published in 2020—2 editions,Rithvik Singh,5563
5,[Um casamento arranjado],First published in 2019—15 editions,Zana Kheiron,5131
6,[Twisted Love],First published in 2021—13 editions,Ana Huang,4424
7,"[Rich Dad, Poor Dad]",First published in 1990—79 editions,Robert T. KiyosakiandSharon L. Lechter,3739
8,[The Psychology of Money],First published in 2020—9 editions,Morgan Housel,3684
9,[Haunting Adeline],First published in 2021—9 editions,H. D. Carlton,3279


In [53]:
trendingYearly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Book Title        20 non-null     object
 1   Publication Year  20 non-null     object
 2   Author            20 non-null     object
 3   Times Logged In   20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


#### My Theoretical Value for pulling a book from the 1900s is 14%, because...
##### 20-12=7 (12 comes from the amount of books that are in the 2000s)
##### 1/7 = .142857

In [52]:
rounds = 0
trials = 1000
publicationYear = 0
while rounds < trials:
    num = random.randint(1, 7)
    if num == 1:
        publicationYear += 1
    rounds += 1

print("Empirical Probability:", str((publicationYear/trials)))

Empirical Probability: 0.15


#### The Empirical Probability of pulling a book from the 1900's is around a 15% chance.

In [57]:
def calculate_expected_value(outcomes, probabilities):
    """
    Calculate the expected value.

    Parameters:
    - outcomes (list): A list of numerical outcomes.
    - probabilities (list): A list of probabilities corresponding to the outcomes.

    Returns:
    - float: The expected value.
    """
    if len(outcomes) != len(probabilities):
        raise ValueError("The lengths of outcomes and probabilities must be the same.")
    
    if not abs(sum(probabilities) - 1) < 1e-6:  
        raise ValueError("Probabilities must sum to 1.")
    
    # Calculate the expected value
    expected_value = sum(outcome * probability for outcome, probability in zip(outcomes, probabilities))
    
    return expected_value

outcomes = [5, 6]
probabilities = [.85, .15]

In [59]:
expected_value = calculate_expected_value(outcomes, probabilities)
print(f"Expected Value: {expected_value}")

Expected Value: 5.15


***Describe analysis here.***

### My Theoretical Value for pulling a book from the 1900s is 14%, because...
#### 20-12=7 (12 comes from the amount of books that are in the 2000s)
#### 1/7 = .142857

### The Empirical Probability of pulling a book from the 1900's is around a 15% chance.

***Describe analysis here.***

### In the little gambling session, I am guaranteed 5 dollars, considering he only has around a 15% chance of winning the bet.