
# STOR 320: Introduction to Data Science
## Lab 6

**Name:**

**PID:**

In [None]:
from datetime import datetime
from bs4 import BeautifulSoup
from io import StringIO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [None]:
%pip install html5lib

# Scraping, Merging, and Analyzing Datasets for Countries (25 points)

**Background:** Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website [Worldometers](https://www.worldometers.info/) contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist.

## 0. GDP by Country (7 Points)
Information at [Worldometer GDP](https://www.worldometers.info/gdp/gdp-by-country/) contains GDP data from 2022 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

### 0.0 Scraping the Data
We will walk through webscraping the data from https://www.worldometers.info/gdp/gdp-by-country/ using Pandas into a DataFrame called GDP. You should end up with a new object called GDP which is a DataFrame with 177 observations and 8 variables.

In [None]:
URL_GDP = "https://www.worldometers.info/gdp/gdp-by-country/"

# Send a GET request to the URL
response = requests.get(URL_GDP)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all tables and read into pandas DataFrame
tables = soup.find_all('table')

table_IO = StringIO(str(tables))
GDP = pd.read_html(table_IO, flavor='bs4', header=0)[0]  # Read the first table

GDP.shape

In [None]:
GDP.head(5)

### 0.1 Cleaning the Data (7 points)

Now that we scraped our data into a DataFrame, we need to clean it up. Perform the following tasks:

1.   Remove the first ('#') and fourth ('GDP (abbrev.)') columns from the DataFrame.
2.   Rename the columns 'GDP  (nominal, 2022)', 'GDP growth', 'Population  (2022)', 'GDP per capita', and 'Share of  World GDP' to 'GDP', 'Growth', 'Population', 'PerCapita', and 'Share', respectively.
3.   Remove all dollar signs, percent signs, and commas from 'GDP', 'Growth', 'PerCapita', and 'Share'.
4.  Update column data type of "Country" to be a string dtype and the remaining columns to be numeric. Hint: use `pd.to_numeric`
5. Rewrite over the original 'GDP' variable with a new variable called 'GDP' that is in trillions of dollars rather than in actual dollars. Rewrite over the original 'Population' variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names.

Be careful of the formatting and spacing in the original column names! Display the first five rows of the cleaned `GDP` DataFrame and the dtype info for `GDP`.



In [None]:
# Code Solution Here

## 1. Education Index Data by Country (3 Points)

Check out the [Wikipedia page](https://en.wikipedia.org/wiki/Education_Index), which contains the education index for all countries from 1990 to 2019.

### 1.0 Scraping the Education Index Data
The code provided scrapes the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame called EDU.

In [None]:
# URL to fetch data from
URL_EDU = "https://en.wikipedia.org/wiki/Education_Index"

# Fetch the HTML content
response = requests.get(URL_EDU)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table and read it into a DataFrame
table = soup.find_all('table')[0]  # Assuming the first table is the one we want
table_IO = StringIO(str(table))
EDU = pd.read_html(table_IO, flavor='bs4', header=0)[0]

EDU.head(5)

### 1.1 Cleaning the Education Data (3 points)
Perform the following tasks to clean the `EDU` DataFrame:

1. Modify the resulting DataFrame `EDU` to only keep 2 variables: 1) the country’s name and 2) its education index from 2019.
2. Rename the variable named “2019” to “EDIndex”.
3. Update the dtype of 'Country' to a string.  

Display the first 5 rows of `EDU` and the info of `EDU` after making these changes.

In [None]:
# Code Solution Here

## 2: Merging the Datasets (8 points)

Now, we are going to merge the datasets for maximum gains. Make sure you carefully read the instructions for each question. Be very careful in this part of the assignment.

### 2.0 Joining GDP and EDU (2 Points)
The dataset `GDP` is our primary dataset. Create a new DataFrame `GDP_EDU` that brings the the education data from `EDU` into the dataset `GDP` using a left join only. Display the first 12 rows of `GDP_EDU`.

In [None]:
# Code Solution Here

### 2.1 Missing Education Index (2 Points)

How many countries in `GDP_EDU` have missing values for Education Index? Show code that can be used to answer this question and then write your answer in complete sentences.

In [None]:
# Code Solution Here

Answer:

### 2.2 Data Inspection (3 Points)
Closely inspect the original datasets and answer the following questions about GDP_EDU in complete sentences. You can use the code if needed, but it is not required. Please show all work. If you don’t reference the appropriate dataset or you are not specific in your answers, you will get 0 points.

#### 2.2.0 Why is there no education index for Iran in the dataset `G_EDU`? (1 Point)

Answer:

#### 2.2.1 Why is there no education index for State of Palestine in the dataset `GDP_EDU`? (1 Point)

Answer:

#### 2.2.2 Why is there no education index for Laos in the dataset `GDP_EDU`? (1 point)

Answer:

### 2.2 Removing NA Values (1 point)

Instead of replacing or dropping all the countries with missing values by hand, we will just drop all rows that are missing the Education Index to move forward with the analysis portion. Drop all rows from `GDP_EDU` that are null for `EDIndex`.



In [None]:
# Code Solution Here

## 3. Analyzing the Merged Dataset (12 points)

In these questions, find the answer using code, and then answer the question using complete sentences below the code.

### 3.0 Above Average GDP PerCapita (2 Points)
How many countries have a GDP per capita above the global average?

In [None]:
# Code Solution Here

Answer:

#### 3.1 Highest GDP Growth Rate (4 Points)

*   Of the countries that have above average GDP PerCapita, what country has the highest GDP growth rate?
*   Of the countries that have below average GDP PerCapita, what country has the highest GDP growth rate?

In [None]:
# Code Solution Here

Answer:

#### 3.2 Lowest Education Index (4 Points)

*   Of the countries that have above average GDP PerCapita, what country has the lowest education index?
*   Of the countries that have below average GDP PerCapita, what country has the lowest education index?

In [None]:
# Code Solution Here

Answer:

#### 3.3 Critical Thinking (2 points)

State two additional questions you could answer with the merged dataset. Be creative. You do not need to find the answer, but are welcome to if you are curious.

Answer: