
# STOR 320: Introduction to Data Science
## Lab 6

**Name: Conor Jones**

**PID: 730665579**

In [2]:
from datetime import datetime
from bs4 import BeautifulSoup
from io import StringIO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [94]:
%pip install html5lib

Note: you may need to restart the kernel to use updated packages.




# Scraping, Merging, and Analyzing Datasets for Countries (25 points)

**Background:** Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website [Worldometers](https://www.worldometers.info/) contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist.

## 0. GDP by Country (7 Points)
Information at [Worldometer GDP](https://www.worldometers.info/gdp/gdp-by-country/) contains GDP data from 2022 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

### 0.0 Scraping the Data
We will walk through webscraping the data from https://www.worldometers.info/gdp/gdp-by-country/ using Pandas into a DataFrame called GDP. You should end up with a new object called GDP which is a DataFrame with 177 observations and 8 variables.

In [24]:
URL_GDP = "https://www.worldometers.info/gdp/gdp-by-country/"

# Send a GET request to the URL
response = requests.get(URL_GDP)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all tables and read into pandas DataFrame
tables = soup.find_all('table')

table_IO = StringIO(str(tables))
GDP = pd.read_html(table_IO, flavor='bs4', header=0)[0]  # Read the first table

GDP.shape

(177, 8)

In [4]:
GDP.head(5)

Unnamed: 0,#,Country,"GDP (nominal, 2022)",GDP (abbrev.),GDP growth,Population (2022),GDP per capita,Share of World GDP
0,1,United States,"$25,462,700,000,000",$25.463 trillion,2.06%,341534046,"$74,554",25.32%
1,2,China,"$17,963,200,000,000",$17.963 trillion,2.99%,1425179569,"$12,604",17.86%
2,3,Japan,"$4,231,140,000,000",$4.231 trillion,1.03%,124997578,"$33,850",4.21%
3,4,Germany,"$4,072,190,000,000",$4.072 trillion,1.79%,84086227,"$48,429",4.05%
4,5,India,"$3,385,090,000,000",$3.385 trillion,7.00%,1425423212,"$2,375",3.37%


### 0.1 Cleaning the Data (7 points)

Now that we scraped our data into a DataFrame, we need to clean it up. Perform the following tasks:

1.   Remove the first ('#') and fourth ('GDP (abbrev.)') columns from the DataFrame.
2.   Rename the columns 'GDP  (nominal, 2022)', 'GDP growth', 'Population  (2022)', 'GDP per capita', and 'Share of  World GDP' to 'GDP', 'Growth', 'Population', 'PerCapita', and 'Share', respectively.
3.   Remove all dollar signs, percent signs, and commas from 'GDP', 'Growth', 'PerCapita', and 'Share'.
4.  Update column data type of "Country" to be a string dtype and the remaining columns to be numeric. Hint: use `pd.to_numeric`
5. Rewrite over the original 'GDP' variable with a new variable called 'GDP' that is in trillions of dollars rather than in actual dollars. Rewrite over the original 'Population' variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names.

Be careful of the formatting and spacing in the original column names! Display the first five rows of the cleaned `GDP` DataFrame and the dtype info for `GDP`.



In [25]:
# Remove the first ('#') and fourth ('GDP (abbrev.)') columns from the DataFrame.
GDP.drop(['#', 'GDP  (abbrev.)'], axis=1, inplace=True)

In [26]:
# Rename the columns 'GDP  (nominal, 2022)', 'GDP growth', 'Population  (2022)', 'GDP per capita', 
# and 'Share of  World GDP' to 'GDP', 'Growth', 'Population', 'PerCapita', and 'Share', respectively.
GDP.columns = ['Country', 'GDP', 'Growth', 'Population', 'PerCapita', 'Share']


In [27]:
#Remove all dollar signs, percent signs, and commas from 'GDP', 'Growth', 'PerCapita', and 'Share'.
GDP['GDP'] = GDP['GDP'].str.replace('[^0-9]', '', regex=True)
GDP['Growth'] = GDP['Growth'].str.replace('[^0-9]', '', regex=True)
GDP['PerCapita'] = GDP['PerCapita'].str.replace('[^0-9]', '', regex=True)
GDP['Share'] = GDP['Share'].str.replace('[^0-9]', '', regex=True)

In [28]:
# Update column data type of "Country" to be a string dtype and the remaining columns to be numeric. Hint: use `pd.to_numeric`
GDP["Country"] = GDP["Country"].astype(str)
GDP['GDP'] = pd.to_numeric(GDP['GDP'])
GDP['Growth'] = pd.to_numeric(GDP['Growth'])
GDP['Population'] = pd.to_numeric(GDP['Population'])
GDP['PerCapita'] = pd.to_numeric(GDP['PerCapita'])
GDP['Share'] = pd.to_numeric(GDP['Share'])

In [29]:
#Rewrite over the original 'GDP' variable with a new variable called 'GDP' that is in trillions of dollars rather than
#in actual dollars. Rewrite over the original 'Population' variable with a new variable of the same name that is in millions 
# of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names.

GDP['GDP'] = GDP['GDP']/1_000_000_000_000
GDP['Population'] = GDP['Population']/1_000_000

In [30]:
GDP.head()

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share
0,United States,25.4627,206,341.534046,74554,2532
1,China,17.9632,299,1425.179569,12604,1786
2,Japan,4.23114,103,124.997578,33850,421
3,Germany,4.07219,179,84.086227,48429,405
4,India,3.38509,700,1425.423212,2375,337


## 1. Education Index Data by Country (3 Points)

Check out the [Wikipedia page](https://en.wikipedia.org/wiki/Education_Index), which contains the education index for all countries from 1990 to 2019.

### 1.0 Scraping the Education Index Data
The code provided scrapes the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame called EDU.

In [11]:
# URL to fetch data from
URL_EDU = "https://en.wikipedia.org/wiki/Education_Index"

# Fetch the HTML content
response = requests.get(URL_EDU)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table and read it into a DataFrame
table = soup.find_all('table')[0]  # Assuming the first table is the one we want
table_IO = StringIO(str(table))
EDU = pd.read_html(table_IO, flavor='bs4', header=0)[0]

EDU.head(5)

Unnamed: 0,Country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Afghanistan,0.122,0.133,0.145,0.156,0.168,0.179,0.19,0.202,0.213,...,0.372,0.374,0.39,0.398,0.403,0.405,0.406,0.408,0.413,0.414
1,Albania,0.583,0.588,0.557,0.542,0.528,0.55,0.557,0.569,0.579,...,0.671,0.714,0.739,0.749,0.758,0.753,0.745,0.747,0.743,0.746
2,Algeria,0.385,0.395,0.405,0.414,0.424,0.431,0.443,0.458,0.473,...,0.626,0.644,0.639,0.639,0.652,0.659,0.66,0.665,0.668,0.672
3,Andorra,,,,,,,,,,...,0.67,0.671,0.724,0.714,0.725,0.718,0.722,0.713,0.72,0.72
4,Angola,,,,,,,,,,...,0.398,0.423,0.435,0.447,0.46,0.472,0.487,0.498,0.5,0.5


In [16]:
GDP.head()

Unnamed: 0,GDP,Growth,Population,PerCapita,Share
0,25.4627,206,341.534046,74554,2532
1,17.9632,299,1425.179569,12604,1786
2,4.23114,103,124.997578,33850,421
3,4.07219,179,84.086227,48429,405
4,3.38509,700,1425.423212,2375,337


In [15]:
GDP.columns

Index(['GDP', 'Growth', 'Population', 'PerCapita', 'Share'], dtype='object')

### 1.1 Cleaning the Education Data (3 points)
Perform the following tasks to clean the `EDU` DataFrame:

1. Modify the resulting DataFrame `EDU` to only keep 2 variables: 1) the country’s name and 2) its education index from 2019.
2. Rename the variable named “2019” to “EDIndex”.
3. Update the dtype of 'Country' to a string.  

Display the first 5 rows of `EDU` and the info of `EDU` after making these changes.

In [19]:
EDU_clean = EDU[['Country', '2019']]

EDU_clean.rename(columns={'2019': 'EDIndex'}, inplace=True)

EDU_clean['Country'] = EDU_clean['Country'].astype(str)
EDU_clean.head(5), EDU_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Country  189 non-null    object 
 1   EDIndex  189 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.1+ KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  EDU_clean.rename(columns={'2019': 'EDIndex'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  EDU_clean['Country'] = EDU_clean['Country'].astype(str)


(       Country  EDIndex
 0  Afghanistan    0.414
 1      Albania    0.746
 2      Algeria    0.672
 3      Andorra    0.720
 4       Angola    0.500,
 None)

## 2: Merging the Datasets (8 points)

Now, we are going to merge the datasets for maximum gains. Make sure you carefully read the instructions for each question. Be very careful in this part of the assignment.

### 2.0 Joining GDP and EDU (2 Points)
The dataset `GDP` is our primary dataset. Create a new DataFrame `GDP_EDU` that brings the the education data from `EDU` into the dataset `GDP` using a left join only. Display the first 12 rows of `GDP_EDU`.

In [31]:
GDP_EDU = pd.merge(GDP, EDU_clean, how='left', on='Country')

GDP_EDU.head(12)


Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share,EDIndex
0,United States,25.4627,206,341.534046,74554,2532,0.9
1,China,17.9632,299,1425.179569,12604,1786,0.862
2,Japan,4.23114,103,124.997578,33850,421,0.851
3,Germany,4.07219,179,84.086227,48429,405,0.943
4,India,3.38509,700,1425.423212,2375,337,0.555
5,United Kingdom,3.07067,410,68.179315,45038,305,0.948
6,France,2.78291,256,66.277409,41989,277,0.817
7,Russia,2.24042,207,145.579899,15390,223,
8,Canada,2.13984,340,38.821259,55120,213,0.894
9,Italy,2.01043,367,59.619115,33721,200,0.793


### 2.1 Missing Education Index (2 Points)

How many countries in `GDP_EDU` have missing values for Education Index? Show code that can be used to answer this question and then write your answer in complete sentences.

In [32]:
missing_EDU = GDP_EDU['EDIndex'].isna().sum()
missing_EDU

19

There are 19 countries in GDP_EDU that have missing values for the Education Index.

### 2.2 Data Inspection (3 Points)
Closely inspect the original datasets and answer the following questions about GDP_EDU in complete sentences. You can use the code if needed, but it is not required. Please show all work. If you don’t reference the appropriate dataset or you are not specific in your answers, you will get 0 points.

#### 2.2.0 Why is there no education index for Iran in the dataset `G_EDU`? (1 Point)

 Iran is likely missing from the GDP_EDU dataset because the source dataset from Wikipedia may not contain information on Iran's education index for 2019.

#### 2.2.1 Why is there no education index for State of Palestine in the dataset `GDP_EDU`? (1 Point)

Palestine might not have an education index reported for 2019 in the Wikipedia due to political reasons or lack of available data.

#### 2.2.2 Why is there no education index for Laos in the dataset `GDP_EDU`? (1 point)

Laos is likely missing from the GDP_EDU dataset because it may not have reported education index data for 2019 on Wikipedia.

### 2.2 Removing NA Values (1 point)

Instead of replacing or dropping all the countries with missing values by hand, we will just drop all rows that are missing the Education Index to move forward with the analysis portion. Drop all rows from `GDP_EDU` that are null for `EDIndex`.



In [33]:
GDP_EDU_clean = GDP_EDU.dropna(subset=['EDIndex'])
GDP_EDU_clean.head(5)

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share,EDIndex
0,United States,25.4627,206,341.534046,74554,2532,0.9
1,China,17.9632,299,1425.179569,12604,1786,0.862
2,Japan,4.23114,103,124.997578,33850,421,0.851
3,Germany,4.07219,179,84.086227,48429,405,0.943
4,India,3.38509,700,1425.423212,2375,337,0.555


## 3. Analyzing the Merged Dataset (12 points)

In these questions, find the answer using code, and then answer the question using complete sentences below the code.

### 3.0 Above Average GDP PerCapita (2 Points)
How many countries have a GDP per capita above the global average?

In [34]:
global_avg_gdp_percapita = GDP_EDU_clean['PerCapita'].mean()
above_avg_gdp_countries = GDP_EDU_clean[GDP_EDU_clean['PerCapita'] > global_avg_gdp_percapita].shape[0]
above_avg_gdp_countries


49

There are 49 countries with a GDP per capita above the global average

#### 3.1 Highest GDP Growth Rate (4 Points)

*   Of the countries that have above average GDP PerCapita, what country has the highest GDP growth rate?
*   Of the countries that have below average GDP PerCapita, what country has the highest GDP growth rate?

In [35]:
above_avg_gdp = GDP_EDU_clean[GDP_EDU_clean['PerCapita'] > global_avg_gdp_percapita]
highest_growth_above = above_avg_gdp.loc[above_avg_gdp['Growth'].idxmax()]

below_avg_gdp = GDP_EDU_clean[GDP_EDU_clean['PerCapita'] <= global_avg_gdp_percapita]
highest_growth_below = below_avg_gdp.loc[below_avg_gdp['Growth'].idxmax()]

highest_growth_above, highest_growth_below


(Country         Guyana
 GDP           0.015358
 Growth            5780
 Population    0.821637
 PerCapita        18691
 Share                2
 EDIndex          0.601
 Name: 127, dtype: object,
 Country         Ukraine
 GDP            0.160503
 Growth             2910
 Population    41.048766
 PerCapita          3910
 Share                16
 EDIndex           0.799
 Name: 57, dtype: object)

The country with the highest GDP growth rate among those with above-average GDP per capita is Guyana.
The country with the highest GDP growth rate among those with below-average GDP per capita is Ukraine.

#### 3.2 Lowest Education Index (4 Points)

*   Of the countries that have above average GDP PerCapita, what country has the lowest education index?
*   Of the countries that have below average GDP PerCapita, what country has the lowest education index?

In [36]:
lowest_edu_above = above_avg_gdp.loc[above_avg_gdp['EDIndex'].idxmin()]
lowest_edu_below = below_avg_gdp.loc[below_avg_gdp['EDIndex'].idxmin()]

lowest_edu_above, lowest_edu_below


(Country         Guyana
 GDP           0.015358
 Growth            5780
 Population    0.821637
 PerCapita        18691
 Share                2
 EDIndex          0.601
 Name: 127, dtype: object,
 Country           Niger
 GDP             0.01397
 Growth             1150
 Population    25.311973
 PerCapita           552
 Share                 1
 EDIndex           0.249
 Name: 131, dtype: object)

The country with the lowest education index among those with above-average GDP per capita is Guyana.
The country with the lowest education index among those with below-average GDP per capita is Niger.

#### 3.3 Critical Thinking (2 points)

State two additional questions you could answer with the merged dataset. Be creative. You do not need to find the answer, but are welcome to if you are curious.

1. Is there a correlation between a country’s GDP growth rate and its education index?
2. How does population size relate to GDP growth or education index across different countries?