# Choose a Data Set

Create your own dataset by scraping one of the following websites *(level 5)*:
- [Wikipedia](https://www.wikipedia.org/)
- [OpenLibrary](https://openlibrary.org/)

**OR** 

Use data gathered from one of the following APIs *(level 4)*: 
- [TMDB](https://developer.themoviedb.org/reference/intro/getting-started)
- [College Scorecard](https://collegescorecard.ed.gov/data/api-documentation/)

**OR** 

Pick a JSON dataset *(level 3)*:
- [Food/Restaurant Data](https://drive.google.com/drive/folders/1V94S6WpclvQmbnW88KVMD4EruryA1oma?usp=drive_link)
- [Fashion Data](https://drive.google.com/drive/folders/1V8SbFjtRRW8WVf3xBzg0gzLjOtMhHea_?usp=drive_link)

**OR** 

Pick a CSV dataset *(level 2)*:
- [LA Parking Tickets](https://drive.google.com/drive/folders/1vaOfwMi6QmZEGsXr8VM0ulPGzvTTBCgm?usp=drive_link)
- [Hotels](https://drive.google.com/drive/folders/1IpVFxgwBJvJHKoOuBsk6WK2qYqFYP4hi?usp=drive_link)

# My Question
 While exploring Europe, two friends make a bet. Out of the 15 highest populated cities in europe, if the next 2 people that person A talks to are from Madrid, then Friend B will pay him 1000 dollars, otherwise, Friend A will have to pay 20 dollars. What is the expected profit from the bet?

# My Answer

***Imports***

In [73]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import random
import seaborn as sns
import numpy as np

In [2]:
def getScraped(url : str, htmlIdentifier : str):
    URL = url
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
        
    data = soup.find_all(htmlIdentifier)
        
    return data

def getScrappedClass(url : str, htmlIdentifier : str, className : str):
    URL = url
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
        
    data = soup.find_all(htmlIdentifier, class_=className)
        
    return data

In [3]:
EuropeCities = getScraped("https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits", "tbody")

This scrapes the wikipedia page that lists European cities and their populations.

In [34]:
CityNames = []
CityCountry = []
CityPopulation = []
CityDate = []

for city in EuropeCities:
    for name in city.find_all("td"):
        cityName = name.find("a")
        if not(cityName == None) and len(cityName.attrs) == 2:
            if len(CityNames) == len(CityCountry) and len(CityNames) < 15:
                CityNames.append(cityName.text)
            elif len(CityCountry) < 15:
                CityCountry.append(cityName.text)
        CityData = name.find("span")
        if not(CityData == None):
            if not(CityData.text.strip() == "") and not(CityData.text.strip() == "["):
                if len(CityPopulation) == len(CityDate) and len(CityPopulation) < 15:
                    CityPopulation.append(CityData.text.strip().replace(",", ""))
                elif len(CityDate) < 15:
                    CityDate.append(CityData.text.strip())

This orgainizes, albeit shoddily, the data from the web scrape. It organizes it into the city's name, the city's country, the city's population, and the last time this data was checked.

In [36]:
cities = {
    "Name" : CityNames,
    "Country" : CityCountry,
    "Population" : CityPopulation,
    "Checked On" : CityDate
}

df = pd.DataFrame(cities)
df["Population"] = pd.to_numeric(df["Population"], errors="coerce")
df


Unnamed: 0,Name,Country,Population,Checked On
0,Istanbul,Turkey,15655924,31 Dec 2023
1,Moscow,Russia,13149803,1 Jan 2024
2,London,United Kingdom,8866180,30 Jun 2022
3,Saint Petersburg,Russia,5597763,1 Jan 2024
4,Berlin,Germany,3755251,31 Dec 2022
5,Madrid,Spain,3332035,1 Jan 2023
6,Kyiv,Ukraine,2952301,1 Jan 2022
7,Rome,Italy,2754719,1 Jan 2024
8,Baku,Azerbaijan,2344900,1 Jan 2024
9,Paris,France,2087577,1 Jan 2024


This gives us our data set that we can use. We'll be focusing in Madrid for the first part of the question.

In [39]:
overallPopulation = sum(df["Population"])
overallPopulation

69981692

This gives us our overall population which allows us to calculate a theoretical probability of one person. 
1 person : 3332035/69981692 = 0.476. 3 people = 3332035^2 / 69981692^2 = 0.002.
Now let's put this in a simulation

In [71]:
rounds = 0
trials = 1000
both = 0
while rounds < trials:
    madrid = 0
    for i in range(0, 2):
        num = random.randint(1, overallPopulation)
        if num in range(0, 3332035):
            madrid += 1
    if madrid == 2:
        both += 1
    rounds += 1
    
EmpiricalProb = both/trials
print("Empirical Probability:", str(EmpiricalProb))

Empirical Probability: 0.002


The Empirical probability confirms our theoretical probability. So there is only a 0.2% chance that both people would end up in Madrid. Now let's calcualate what happens if Person A loses.

In [82]:
def lossChance(theoreticalProb : float) -> float:
    return 1 - theoreticalProb

loss = lossChance(EmpiricalProb)
loss

0.998

Person B has a 99% chance to win this bet.

In [81]:
(EmpiricalProb * 1000) - (loss * 20)

-17.96

This yields an expected value of -19.76. Pretty much Person A is gonna lose

## Analysis

This bet between two friends gives Person A practically no chance to win. Person A has a theoretical probability of 0.2% chance to win, which was backed up by the exact same resulting number from an empirical probability simulation, while Person B has a theortical chance of 99.8% chance to win. This heavily favors one person, but due to the possible high payout if Person A wins, the expected value is -17.96. Person A is pretty much gonna be forced to pay 20 dollars.