Creating Data Frame
1. Collect the name of all the countries per continent from English Wikipedia.

2. Create countries-continents pandas dataframe. Dataframe should have two columns: country, continent.

3. Collect the happiness score, GDP per capital, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption per country in 2019 from English Wikipedia and put all collected information in a dataframe.

4. Create a new dataframe with all the information that you collected and save it in a CSV.

Links:
https://en.wikipedia.org/wiki/World_Happiness_Report#2019_report
https://simple.wikipedia.org/wiki/List_of_countries_by_continents

Recommended libraries to use: 
Beautifuisoup - https://www.crummy.com/software/BeautifulSoup/bs4/doc/  #For HTM parsing 
requests   - https://pypi.org/project/requests/ #For downloading the HTML code for the Wikipedia page, we need to import the requests' library

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
r = requests.get("https://simple.wikipedia.org/wiki/List_of_countries_by_continents", headers={"User-Agent": "Mozilla/5.0"})

In [4]:
r.status_code

200

In [5]:
soup = BeautifulSoup(r.text, "html.parser")
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of countries by continents - Simple English Wikipedia, the free encyclopedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-cont

In [None]:
# Optional: small mapping to improve joins (add more if you see mismatches)
COUNTRY_FIX = {
    "United States": "United States",
    "Russia": "Russia",
    "South Korea": "South Korea",
    "North Korea": "North Korea",
    "Czech Republic": "Czech Republic",
    "DR Congo": "Democratic Republic of the Congo",
    "Republic of the Congo": "Republic of the Congo",
    "Ivory Coast": "Côte d’Ivoire",
    "Cote d'Ivoire": "Côte d’Ivoire",
    "United Kingdom": "United Kingdom",
    "Great Britain": "United Kingdom",
    "Vatican City": "Vatican City",
}

In [7]:
def fetch_soup(url: str) -> BeautifulSoup:
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    resp.raise_for_status()
    return BeautifulSoup(resp.text, "html.parser")

In [8]:
def normalize_country(name: str) -> str:
    """Basic normalization for joining (keeps it simple)."""
    name = name.strip()
    name = re.sub(r"\s+", " ", name)
    return name

In [23]:
# -------------------------
# Task 1: Countries per continent (Simple Wikipedia)
# -------------------------
def scrape_countries_by_continent() -> dict:
    url = "https://simple.wikipedia.org/wiki/List_of_countries_by_continents"
    soup = fetch_soup(url)

    continent_ids = {
        "Africa": "Africa",
        "Asia": "Asia",
        "Europe": "Europe",
        "North America": "North_America",
        "South America": "South_America",
        "Oceania": "Oceania",
    }

    countries_by_continent = {c: [] for c in continent_ids.keys()}

    for cont_name, cont_id in continent_ids.items():
        h2 = soup.find("h2", id=cont_id)
        if not h2:
            continue
    
        table = h2.find_next("table", class_="wikitable")
        if not table:
            continue
    
        countries = []
        for r in table.find_all("tr")[1:]:
            tds = r.find_all("td")
            if len(tds) >= 3:
                name = tds[2].get_text(strip=True)
                if name and name not in countries:
                    countries.append(name)

        countries_by_continent[cont_name] = countries

    return countries_by_continent



In [24]:
countries_by_continent = scrape_countries_by_continent()
countries_by_continent

{'Africa': ['Algeria',
  'Angola',
  'Benin',
  'Botswana',
  'Burkina Faso',
  'Burundi',
  'Cameroon',
  'Cape Verde',
  'Central African Republic',
  'Chad',
  'Comoros',
  'Democratic Congo[n 1]',
  'Congo[n 2]',
  'Djibouti',
  'Egypt',
  'Equatorial Guinea',
  'Eritrea',
  'Eswatini',
  'Ethiopia',
  'Gabon',
  'Gambia',
  'Ghana',
  'Guinea[n 4]',
  'Guinea-Bissau',
  "Côte d'Ivoire",
  'Kenya',
  'Lesotho',
  'Liberia',
  'Libya',
  'Madagascar',
  'Malawi',
  'Mali',
  'Mauritania',
  'Mauritius',
  'Morocco',
  'Mozambique',
  'Namibia',
  'Niger',
  'Nigeria',
  'Rwanda',
  'São Tomé and Príncipe',
  'Senegal',
  'Seychelles',
  'Sierra Leone',
  'Somalia',
  'South Africa',
  'South Sudan',
  'Sudan',
  'Tanzania',
  'Togo',
  'Tunisia',
  'Uganda',
  'Zambia',
  'Zimbabwe'],
 'Asia': ['Afghanistan',
  'Armenia',
  'Azerbaijan[a][b]',
  'Bahrain',
  'Bangladesh',
  'Bhutan',
  'Brunei',
  'Cambodia',
  'China[b]',
  'Cyprus',
  'India',
  'Indonesia[a]',
  'Iran',
  'Iraq',

In [27]:
# -------------------------
# Task 2: DataFrame country, continent
# -------------------------
rows = []

for continent, countries in countries_by_continent.items():
    for country in countries:
        rows.append({
            "country": country,
            "continent": continent
        })

df_countries_continents = pd.DataFrame(rows)

df_countries_continents.head()

Unnamed: 0,country,continent
0,Algeria,Africa
1,Angola,Africa
2,Benin,Africa
3,Botswana,Africa
4,Burkina Faso,Africa


In [29]:
# -------------------------
# Task 3: Happiness 2019 data from Wikipedia
# -------------------------
import pandas as pd
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/World_Happiness_Report#2019_report"
soup = fetch_soup(url)

# Find the table for 2019 (first wikitable after the 2019 header)
tables = soup.find_all("table", class_="wikitable")

target_table = None
for table in tables:
    headers = [th.get_text(strip=True) for th in table.find_all("th")]
    if (
        "Country or region" in headers
        and "Score" in headers
        and "GDP per capita" in headers
    ):
        target_table = table
        break

if target_table is None:
    raise ValueError("2019 Happiness table not found")

# Extract data rows
rows = []
for tr in target_table.find_all("tr")[1:]:  # skip header
    tds = tr.find_all("td")
    if len(tds) < 8:
        continue

    rows.append({
        "country": tds[1].get_text(strip=True),
        "happiness_score": float(tds[2].get_text(strip=True)),
        "gdp_per_capita": float(tds[3].get_text(strip=True)),
        "social_support": float(tds[4].get_text(strip=True)),
        "healthy_life_expectancy": float(tds[5].get_text(strip=True)),
        "freedom_life_choices": float(tds[6].get_text(strip=True)),
        "generosity": float(tds[7].get_text(strip=True)),
        "perceptions_corruption": float(tds[8].get_text(strip=True)),
    })

# Create DataFrame
df_happiness_2019 = pd.DataFrame(rows)

df_happiness_2019.head()

Unnamed: 0,country,happiness_score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_life_choices,generosity,perceptions_corruption
0,Finland,7.809,1.285,1.5,0.961,0.662,0.16,0.478
1,Denmark,7.646,1.327,1.503,0.979,0.665,0.243,0.495
2,Switzerland,7.56,1.391,1.472,1.041,0.629,0.269,0.408
3,Iceland,7.504,1.327,1.548,1.001,0.662,0.362,0.145
4,Norway,7.488,1.424,1.495,1.008,0.67,0.288,0.434


In [31]:
# -------------------------
# Task 4: Merge all info + save CSV
# -------------------------
df_final = pd.merge(
    df_happiness_2019,
    df_countries_continents,
    on="country",
    how="left"
)
df_final.head()

Unnamed: 0,country,happiness_score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_life_choices,generosity,perceptions_corruption,continent
0,Finland,7.809,1.285,1.5,0.961,0.662,0.16,0.478,Europe
1,Denmark,7.646,1.327,1.503,0.979,0.665,0.243,0.495,Europe
2,Switzerland,7.56,1.391,1.472,1.041,0.629,0.269,0.408,Europe
3,Iceland,7.504,1.327,1.548,1.001,0.662,0.362,0.145,Europe
4,Norway,7.488,1.424,1.495,1.008,0.67,0.288,0.434,Europe


In [32]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   country                  153 non-null    object 
 1   happiness_score          153 non-null    float64
 2   gdp_per_capita           153 non-null    float64
 3   social_support           153 non-null    float64
 4   healthy_life_expectancy  153 non-null    float64
 5   freedom_life_choices     153 non-null    float64
 6   generosity               153 non-null    float64
 7   perceptions_corruption   153 non-null    float64
 8   continent                125 non-null    object 
dtypes: float64(7), object(2)
memory usage: 10.9+ KB


In [33]:
df_final.to_csv("./assignment_support/happiness_2019_full_dataset.csv", index=False)