## Overview
Gapminder has collected a lot of information about how people live their lives in different countries, tracked across the years, and on a number of different indicators.

### Business Goal
We will use metrics from this data to help stakeholders identify the best countries for opening a retail business targeting middle-class consumers.

### Metrics to investigate:
1. **Income per capita (GDP per capita):** Gross domestic product per person adjusted for differences in purchasing power (in international dollars, fixed 2017 prices, PPP based on 2017 ICP). We will use it to determine which countries have high purchasing power and market potential.
    - File: `gdp_pcap_21.csv`
2. **Population Size:** Total population counts the number of inhabitants in the territory. We will use this to identify which countries have a big customer base.
    - File: `pop.csv` 
4. **Population Growth (annual %):** Annual population growth rate for year t is the exponential rate of growth of midyear population from year t-1 to t, expressed as a percentage. The population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. We will use this metric to identify which countries will have a growing market.
    - File: `population_growth_annual_percent.csv`
5. **Urban population (% of total):** Urban population refers to people living in urban areas as defined by national statistical offices. The data are collected and smoothed by United Nations Population Division. It will help determine the countries that are more likely to have better infrastructure and also urban areas are easy to target marketing and distribution.
    - File: `urban_population_percent_of_total.csv`
6. **Human Development Index (HDI):** Human Development Index is an index used to rank countries by level of "human development". It contains three dimensions: health level, educational level, and living standard. We are going to use this to identify countries that may have skilled laborers.
    - File: `hdi_human_development_index.csv`
7. **Ease of doing business score (0 = lowest performance to 100 = best performance):** It evaluates the regulatory environment and ease of starting and operating a business in a country.
    - File: `ic_bus_dfrn_xq.csv`
8. **Cost of Business Start-Up Procedures:** Cost to register a business is normalized by presenting it as a percentage of gross national income (GNI) per capita. It will help identify the countries with low-start costs.
    - File: `ic_reg_cost_pc_zs`

We are going to use the `ddf--entities--geo--country.csv` file, as it contains information about each country, such as the continent they belong to.
### Questions to Answer:
1. What are the top 10 countries with the highest GDP per capita?  
2. What are the top 10 countries with the largest population sizes?  
3. What are the top 10 countries with the highest population growth rates?  
4. What are the top 10 countries with the highest Human Development Index (HDI)?  
5. What are the top 10 countries with the best Ease of Doing Business scores?  
6. What are the top 10 countries with the lowest costs for business start-up procedures?
7. What are the top 10 countries with the highest Urban population rate?
8. What is the relationship between the Ease of Doing Business score and the cost of business start-up procedures by country?  
9. What is the relationship between GDP per capita and the Human Development Index by country?  
10. What is the relationship between population size and GDP per capita by country?  


In [507]:
# import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('dark')

### Data Gathering

In [509]:
df_gdp = pd.read_csv('gdp_pcap_21.csv')
df_pop_size = pd.read_csv('pop.csv')
df_pop_growth = pd.read_csv('population_growth_annual_percent.csv')
df_urban_pop = pd.read_csv('urban_population_percent_of_total.csv')
df_human_dev_idx = pd.read_csv('hdi_human_development_index.csv')
df_ease_business = pd.read_csv('ic_bus_dfrn_xq.csv')
df_startup_cost = pd.read_csv('ic_reg_cost_pc_zs.csv')
df_countries_info = pd.read_csv('ddf--entities--geo--country.csv')

In [510]:
df_list = {'df_gdp': df_gdp,
           'df_pop_size': df_pop_size,
           'df_pop_growth': df_pop_growth,
           'df_urban_pop': df_urban_pop,
           'df_human_dev_idx': df_human_dev_idx, 
           'df_ease_business': df_ease_business, 
           'df_startup_cost': df_startup_cost}

### Data Assessing

In [512]:
df_gdp.head(5)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,481,481,481,481,481,481,481,481,481,...,7320,7500,7680,7870,8060,8260,8460,8670,8880,9100
1,Angola,373,374,376,378,379,381,383,385,386,...,29.6k,30.2k,30.7k,31.3k,31.9k,32.5k,33k,33.6k,34.2k,34.8k
2,Albania,469,471,472,473,475,476,477,479,480,...,57.5k,58.1k,58.7k,59.2k,59.8k,60.4k,60.9k,61.5k,62.1k,62.6k
3,Andorra,1370,1370,1370,1380,1380,1380,1390,1390,1390,...,86.5k,86.8k,87k,87.3k,87.5k,87.7k,88k,88.2k,88.4k,88.6k
4,UAE,1140,1150,1150,1150,1160,1160,1170,1170,1180,...,92.3k,92.4k,92.4k,92.4k,92.5k,92.5k,92.5k,92.6k,92.6k,92.6k


In [513]:
df_pop_size.head(5)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,...,124M,125M,126M,126M,127M,128M,128M,129M,130M,130M
1,Angola,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,...,139M,140M,142M,143M,144M,145M,147M,148M,149M,150M
2,Albania,400k,402k,404k,405k,407k,409k,411k,413k,414k,...,1.34M,1.32M,1.3M,1.29M,1.27M,1.25M,1.23M,1.22M,1.2M,1.18M
3,Andorra,2650,2650,2650,2650,2650,2650,2650,2650,2650,...,52.8k,52.1k,51.5k,50.8k,50.2k,49.6k,49k,48.4k,47.8k,47.2k
4,UAE,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,...,24.1M,24.3M,24.5M,24.7M,25M,25.2M,25.4M,25.7M,25.9M,26.1M


In [514]:
df_pop_growth.head(5)

Unnamed: 0,country,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Aruba,2.18,1.55,1.39,1.22,1.03,0.862,0.389,−0.0857,−0.237,...,0.692,0.638,0.59,0.537,0.495,0.452,0.134,−0.045,−0.0864,−0.158
1,Afghanistan,1.93,2.01,2.08,2.14,2.22,2.25,2.29,2.35,2.38,...,3.66,3.12,2.58,2.87,2.89,2.91,3.13,2.85,2.53,2.67
2,Angola,1.56,1.46,1.41,1.3,1.11,0.876,0.697,0.696,1.02,...,3.68,3.62,3.59,3.55,3.46,3.4,3.27,3.17,3.1,3.03
3,Albania,3.12,3.06,2.95,2.88,2.75,2.63,2.63,2.84,2.9,...,−0.207,−0.291,−0.16,−0.092,−0.247,−0.426,−0.574,−0.927,−1.22,−1.15
4,Andorra,7.87,7.52,7.22,6.94,6.65,7.0,7.92,8.13,7.72,...,0.355,0.174,1.1,1.77,1.58,1.76,1.76,1.7,0.995,0.33


In [515]:
df_urban_pop.head(5)

Unnamed: 0,country,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Aruba,50.8,50.8,50.7,50.7,50.7,50.7,50.7,50.7,50.7,...,43.0,43.1,43.2,43.3,43.4,43.5,43.7,43.9,44.1,44.3
1,Afghanistan,8.4,8.68,8.98,9.28,9.59,9.9,10.2,10.6,10.9,...,24.6,24.8,25.0,25.3,25.5,25.8,26.0,26.3,26.6,26.9
2,Angola,10.4,10.8,11.2,11.6,12.1,12.5,13.0,13.4,13.9,...,62.7,63.4,64.1,64.8,65.5,66.2,66.8,67.5,68.1,68.7
3,Albania,30.7,30.9,31.0,31.1,31.2,31.2,31.3,31.4,31.4,...,56.4,57.4,58.4,59.4,60.3,61.2,62.1,63.0,63.8,64.6
4,Andorra,58.5,61.0,63.5,65.9,68.2,70.4,72.6,74.6,76.6,...,88.4,88.3,88.2,88.2,88.1,88.0,87.9,87.9,87.8,87.8


In [516]:
df_human_dev_idx.head(5)

Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Afghanistan,0.273,0.279,0.287,0.297,0.292,0.31,0.319,0.323,0.324,...,0.466,0.474,0.479,0.478,0.481,0.482,0.483,0.488,0.483,0.478
1,Angola,,,,,,,,,,...,0.541,0.552,0.563,0.582,0.596,0.597,0.595,0.595,0.59,0.586
2,Albania,0.647,0.629,0.614,0.617,0.624,0.634,0.645,0.642,0.657,...,0.778,0.785,0.792,0.795,0.798,0.802,0.806,0.81,0.794,0.796
3,Andorra,,,,,,,,,,...,0.869,0.864,0.871,0.867,0.871,0.868,0.872,0.873,0.848,0.858
4,UAE,0.728,0.739,0.742,0.748,0.755,0.762,0.767,0.773,0.779,...,0.846,0.852,0.859,0.865,0.87,0.897,0.909,0.92,0.912,0.911


In [517]:
df_ease_business.head(5)

Unnamed: 0,country,2015,2016,2017,2018,2019
0,Afghanistan,39.3,38.9,37.1,44.2,44.1
1,Angola,37.6,37.7,39.0,41.2,41.3
2,Albania,58.1,64.2,66.8,67.0,67.7
3,UAE,76.3,77.4,79.3,81.6,80.8
4,Argentina,56.7,57.2,57.3,58.2,59.0


In [518]:
df_startup_cost.head(5)

Unnamed: 0,country,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Afghanistan,,72.0,75.2,67.4,84.6,59.5,30.2,26.7,25.8,22.5,14.4,15.1,19.0,19.9,82.3,6.4,6.8
1,Angola,1320.0,910.0,654.0,498.0,344.0,197.0,151.0,227.0,163.0,143.0,130.0,119.0,17.0,20.7,17.4,13.9,11.1
2,Albania,57.1,32.3,31.3,22.5,46.1,42.5,32.2,31.8,29.3,22.4,21.2,10.1,10.3,12.5,12.0,11.3,10.8
3,UAE,18.8,17.5,15.8,13.0,13.9,11.6,11.1,12.6,13.4,11.0,11.4,11.3,11.2,13.0,13.4,22.8,17.2
4,Argentina,13.5,17.3,15.9,15.4,13.6,12.8,16.1,20.7,17.7,15.4,23.2,17.7,11.4,10.8,10.4,5.3,5.0


In [519]:
df_countries_info.head(5)

Unnamed: 0,country,g77_and_oecd_countries,income_3groups,income_groups,is--country,iso3166_1_alpha2,iso3166_1_alpha3,iso3166_1_numeric,iso3166_2,landlocked,...,name,un_sdg_ldc,un_sdg_region,un_state,unhcr_region,unicef_region,unicode_region_subtag,west_and_rest,world_4region,world_6region
0,abkh,others,,,True,,,,,,...,Abkhazia,,,False,,,,,europe,europe_central_asia
1,abw,others,high_income,high_income,True,AW,ABW,533.0,,coastline,...,Aruba,un_not_least_developed,un_latin_america_and_the_caribbean,False,unhcr_americas,,AW,,americas,america
2,afg,g77,low_income,low_income,True,AF,AFG,4.0,,landlocked,...,Afghanistan,un_least_developed,un_central_and_southern_asia,True,unhcr_asia_pacific,sa,AF,rest,asia,south_asia
3,ago,g77,middle_income,lower_middle_income,True,AO,AGO,24.0,,coastline,...,Angola,un_least_developed,un_sub_saharan_africa,True,unhcr_southern_africa,ssa,AO,rest,africa,sub_saharan_africa
4,aia,others,,,True,AI,AIA,660.0,,coastline,...,Anguilla,un_not_least_developed,un_latin_america_and_the_caribbean,False,unhcr_americas,,AI,,americas,america


In [520]:
# check for missing values
print("Number of missing values in each dataset")
for name, df in df_list.items():
    num_missing_values = df.isnull().sum().sum()
    
    print(f'{name}: counts: {num_missing_values}, props: {(num_missing_values/df.size):.3%}')

Number of missing values in each dataset
df_gdp: counts: 320, props: 0.543%
df_pop_size: counts: 100, props: 0.168%
df_pop_growth: counts: 32, props: 0.230%
df_urban_pop: counts: 0, props: 0.000%
df_human_dev_idx: counts: 541, props: 8.583%
df_ease_business: counts: 3, props: 0.263%
df_startup_cost: counts: 272, props: 7.953%


In [521]:
df_human_dev_idx.head()

Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Afghanistan,0.273,0.279,0.287,0.297,0.292,0.31,0.319,0.323,0.324,...,0.466,0.474,0.479,0.478,0.481,0.482,0.483,0.488,0.483,0.478
1,Angola,,,,,,,,,,...,0.541,0.552,0.563,0.582,0.596,0.597,0.595,0.595,0.59,0.586
2,Albania,0.647,0.629,0.614,0.617,0.624,0.634,0.645,0.642,0.657,...,0.778,0.785,0.792,0.795,0.798,0.802,0.806,0.81,0.794,0.796
3,Andorra,,,,,,,,,,...,0.869,0.864,0.871,0.867,0.871,0.868,0.872,0.873,0.848,0.858
4,UAE,0.728,0.739,0.742,0.748,0.755,0.762,0.767,0.773,0.779,...,0.846,0.852,0.859,0.865,0.87,0.897,0.909,0.92,0.912,0.911


In [522]:
df_human_dev_idx[df_human_dev_idx.isnull().any(axis=1)]

Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
1,Angola,,,,,,,,,,...,0.541,0.552,0.563,0.582,0.596,0.597,0.595,0.595,0.59,0.586
3,Andorra,,,,,,,,,,...,0.869,0.864,0.871,0.867,0.871,0.868,0.872,0.873,0.848,0.858
7,Antigua and Barbuda,,,,,,,,,,...,0.787,0.787,0.789,0.791,0.794,0.795,0.798,0.8,0.788,0.788
10,Azerbaijan,,,,,,0.59,0.59,0.594,0.604,...,0.734,0.741,0.745,0.748,0.75,0.753,0.757,0.761,0.73,0.745
14,Burkina Faso,,,,,,,,,,...,0.395,0.402,0.408,0.418,0.427,0.438,0.449,0.452,0.449,0.449
18,Bahamas,,,,,,0.781,0.787,0.784,0.785,...,0.815,0.816,0.82,0.82,0.823,0.825,0.827,0.816,0.815,0.812
19,Bosnia and Herzegovina,,,,,,,,,,...,0.745,0.751,0.756,0.761,0.77,0.772,0.776,0.783,0.781,0.78
20,Belarus,,,,,,0.679,0.686,0.692,0.697,...,0.806,0.808,0.812,0.812,0.813,0.817,0.818,0.817,0.807,0.808
26,Bhutan,,,,,,,,,,...,0.598,0.606,0.617,0.627,0.638,0.647,0.658,0.671,0.668,0.666
38,Comoros,,,,,,,,,,...,0.533,0.539,0.54,0.544,0.548,0.553,0.557,0.56,0.562,0.558


#### Quality Issues  
- Inconsistent time frames across datasets. We will use the time frame spanning from 2015 to 2019.
- There are missing values
- There are 'K' and 'M' used to describe thousands and millions in `df_gdp` and `df_pop_size`
- Incorrect datatypes
#### Tidiness Issues  
- Year values are used as column names instead of a single "Year" variable.
- The data need to be combined into a single dataframe with columns for country, year, region, indicator 1 value, indicator 2 value, and so on.

### Cleaning data

#### Missing data

##### Define
Drop missing values, as most of them fall outside the time frame used for our analysis.

##### Code

In [528]:
for name, df in df_list.items():
    if df.isnull().sum().sum() != 0:
        df.dropna(inplace=True, axis=0)

##### Test

In [481]:
for name, df in df_list.items():
    print(f'{name}: ', df.isnull().sum().sum())

df_gdp:  320
df_pop_size:  100
df_pop_growth:  32
df_urban_pop:  0
df_human_dev_idx:  541
df_ease_business:  3
df_startup_cost:  272


#### Inconsistent time frames across datasets.
##### Define
- We will analyze the data using the common time frame across datasets, which spans from 2015 to 2019, as indicated by the time frame in the `df_ease_business` dataset.

In [483]:
df_gdp = df_gdp[df_ease_business.columns]
df_pop_size = df_pop_size[df_ease_business.columns]
df_pop_growth = df_pop_growth[df_ease_business.columns]
df_urban_pop = df_urban_pop[df_ease_business.columns]
df_human_dev_idx = df_human_dev_idx[df_ease_business.columns]
df_startup_cost = df_startup_cost[df_ease_business.columns]

In [484]:
(df_ease_business.columns == df_gdp.columns).all()

True

In [485]:
df_gdp.shape

(195, 6)