# A look at World Hapiness from 2015 to 2019

![](image1.png)

## Introduction
As part of the Data Science Bootcamp, The Bridge, and for my personal project I have chosen to dive deepe in a subject of great importance (especially during the current situation worlwide): Happiness. I have decided to futher investigate which countries in the world are most happy, and more importantly: which factors are actually an influence in this. 

I have chosen to use data of the World Happiness Report from Kaggle. The World Happiness Report is a landmark survey of the state of global happiness that ranks (around) 156 countries by how happy their citizens perceive themselves to be. 

The Happiness Score is a national average of the responses to the main life evaluation question asked in the Gallup World Poll (GWP), which uses the Cantril Ladder. The Happiness Score is explained by the following factors:

- GDP per capita:
- Social support
- Healthy life expectancy
- Freedom to make life choices
- Generosity
- Perceptions of corruption

In order for myself to practise more with obtaining data, data wrangling and mining I have decided to no just use the dataframes found at kaggle, but search for additional data to add for different sources. This will be information on:

- The Global Peace Index(GPI)
- Unemployment Rate

Sources: https://www.kaggle.com/mathurinache/world-happiness-report, http://visionofhumanity.org/indexes/global-peace-index/ https://worldhappiness.report/ 


Defenition of the factors used in the analysis:

- GDP per capita: GDP per capita is a measure of a country's economic output that accounts for its number of people.

- Social support: Social support means having friends and other people, including family, to turn to in times of need or crisis to give you a broader focus and positive self-image. Social support enhances quality of life and provides a buffer against adverse life events.

- Healthy life expectancy: Healthy Life Expectancy is the average number of years that a newborn can expect to live in "full health"—in other words, not hampered by disabling illnesses or injuries.

- Freedom to make life choices:Freedom of choice describes an individual's opportunity and autonomy to perform an action selected from at least two available options, unconstrained by external parties.

- Generosity:the quality of being kind and generous.

- Perceptions of corruption:The Corruption Perceptions Index (CPI) is an index published annually by Transparency International since 1995 which ranks countries "by their perceived levels of public sector corruption, as determined by expert assessments and opinion surveys.

- The Global Peace Index(GPI): The GPI ranks 172 independent states and territories (99.7 per cent of the world's population) according to their levels of peacefulness. The closest the score is to 0, the more peacefull the country is. 

- Unemployment Rate: The unemployment rate is calculated by expressing the number of unemployed persons as a
  percentage of the total number of persons in the labour force (all people between the age of 15 and 64).

Sources: https://www.kaggle.com/mathurinache/world-happiness-report, http://visionofhumanity.org/indexes/global-peace-index/ https://worldhappiness.report/, https://www.ilo.org/ilostat-files/Documents/description_UR_EN.pdf, https://en.wikipedia.org/wiki/Global_Peace_Index.

## Hypothesis

During the analysis of the data regarding World Happiness, I would like to find out if the following statements are correct:
- The happiest country in the world does not change over the years.
- The most important factor to influence world happiness is GDP / The famous quote: "Money makes happiness" is    true.
- (Europe is the happiest continent on the planet)

## Method of working

1. Collecting the necessary data
2. Cleaning and filtering datasets
3. Studying and visualizing the data.
4. Drawing conclusions on hypothesis


In [None]:
Importing the necessary libraries:

In [1]:
os.path.abspath('')
root_path = os.path.dirname(os.path.abspath(''))
sys.path.append(root_path)

import pandas as pd
import numpy as np
import os, sys
from flask import Flask, render_template, redirect, request, jsonify 
import missingno
import time
import random
import json
import seaborn as sns
import matplotlib.pyplot  as plt
#from utils.apis_tb import
import xlrd
from utils.mining_data_tb import *

### Collecting Data

In [3]:
#Importing datasets of yearly world hapiness report from Kaggle: https://www.kaggle.com/mathurinache/world-happiness-report.
data_2015 = pd.read_csv("datasets_2015.csv")
data_2016 = pd.read_csv("datasets_2016.csv")
data_2017 = pd.read_csv("datasets_2017.csv")
data_2018 = pd.read_csv("datasets_2018.csv")
data_2019 = pd.read_csv("datasets_2019.csv")
data_2017.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


In [5]:
#importing content table from Wikipedia with Gloval Peace Index per country from 2008 to 2019
url = "https://en.wikipedia.org/wiki/Global_Peace_Index"
df_world_peace = pd.read_html(url)[2]
df_world_peace.head()

Unnamed: 0,Country,2019 rank,2019 score[12],2018 rank,2018 score[13],2017 rank,2017 score[2],2016 rank,2016 score[14],2015 rank,...,2012 rank,"2012 score[18],[19]",2011 rank,2011 score[20],2010 rank,2010 score[21],2009 rank,2009 score,2008 rank,2008 score
0,Iceland,1,1.072,1,1.096,1,1.084,1,1.138,1,...,1,1.159,1,1.099,1,1.143,1,1.16,1,1.111
1,New Zealand,2,1.221,2,1.188,2,1.216,3,1.238,3,...,3,1.276,2,1.255,2,1.251,2,1.26,2,1.261
2,Portugal,3,1.274,5,1.315,3,1.273,5,1.324,12,...,16,1.52,16,1.485,14,1.472,13,1.437,10,1.387
3,Austria,4,1.291,3,1.273,4,1.292,4,1.249,4,...,7,1.407,9,1.416,5,1.383,5,1.369,5,1.337
4,Denmark,5,1.316,4,1.313,5,1.299,2,1.201,2,...,2,1.235,3,1.29,4,1.334,3,1.269,3,1.272


In [6]:
#Importing content from Json file of unemployment rate per country from 2015 to 2019.Source:https://www.ilo.org/shinyapps/bulkexplorer15/?lang=en&segment=indicator&id=SDG_0852_SEX_AGE_RT_A 
df_unemployment = pd.read_json("percentage_unemployment.json")
df_unemployment.head()

Unnamed: 0,ref_area.label,indicator.label,source.label,sex.label,classif1.label,time,obs_value,obs_status.label,note_classif.label,note_indicator.label,note_source.label
0,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2015,17.5806,,,,Repository: ILO-STATISTICS - Micro data proces...
1,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2016,15.8059,,,,Repository: ILO-STATISTICS - Micro data proces...
2,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2017,13.9826,,,,Repository: ILO-STATISTICS - Micro data proces...
3,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2018,12.7767,,,,Repository: ILO-STATISTICS - Micro data proces...
4,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2019,11.9473,,,,Repository: ILO-STATISTICS - Micro data proces...


### Cleaning Dataframes

Cleaning and manipulating the datasets will be a crucial step for this analysis. As I am working with dataframes from 3 different sources: World happiness reports from Kaggle, Global peace index table from Wikipedia and the unemployment rates from the International Labour Organization, I will first clean each dataset for the necessary columns and merge the dataframes from the 3 different sources in one dataframe per year. 

Below I will describe the cleaning process per dataset. The functions used in the cleaning process will be importen from the utils folder (Mining_data file).

1) Datasets from the world happiness reports.

As all the dataframes of the different years have slightly different columns names and in general different amount of columns, I will first clean the dataframe by applying the following steps:
- change the column names so they are all the same.
- filtering the dataframe with only the columns that are useful for the analysis and that are available in all      the dataframes.
- setting the country column as Index
- adding a column with the year for reference. This will come in handy when adding all the dataframes from the      different years together. 


In [7]:
Change_columns(df=data_2015)
data_2015 = Filter_dataframe(df=data_2015)
Add_year(df=data_2015, year="2015")

done


Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Switzerland,2015,1,7.587,1.39651,1.34951,0.94143,0.66557,0.29678,0.41978
Iceland,2015,2,7.561,1.30232,1.40223,0.94784,0.62877,0.43630,0.14145
Denmark,2015,3,7.527,1.32548,1.36058,0.87464,0.64938,0.34139,0.48357
Norway,2015,4,7.522,1.45900,1.33095,0.88521,0.66973,0.34699,0.36503
Canada,2015,5,7.427,1.32629,1.32261,0.90563,0.63297,0.45811,0.32957
...,...,...,...,...,...,...,...,...,...
Rwanda,2015,154,3.465,0.22208,0.77370,0.42864,0.59201,0.22628,0.55191
Benin,2015,155,3.340,0.28665,0.35386,0.31910,0.48450,0.18260,0.08010
Syria,2015,156,3.006,0.66320,0.47489,0.72193,0.15684,0.47179,0.18906
Burundi,2015,157,2.905,0.01530,0.41587,0.22396,0.11850,0.19727,0.10062


In [8]:
Change_columns(df=data_2016)
data_2016 = Filter_dataframe(df=data_2016)
Add_year(df=data_2016, year="2016")

done


Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Denmark,2016,1,7.526,1.44178,1.16374,0.79504,0.57941,0.36171,0.44453
Switzerland,2016,2,7.509,1.52733,1.14524,0.86303,0.58557,0.28083,0.41203
Iceland,2016,3,7.501,1.42666,1.18326,0.86733,0.56624,0.47678,0.14975
Norway,2016,4,7.498,1.57744,1.12690,0.79579,0.59609,0.37895,0.35776
Finland,2016,5,7.413,1.40598,1.13464,0.81091,0.57104,0.25492,0.41004
...,...,...,...,...,...,...,...,...,...
Benin,2016,153,3.484,0.39499,0.10419,0.21028,0.39747,0.20180,0.06681
Afghanistan,2016,154,3.360,0.38227,0.11037,0.17344,0.16430,0.31268,0.07112
Togo,2016,155,3.303,0.28123,0.00000,0.24811,0.34678,0.17517,0.11587
Syria,2016,156,3.069,0.74719,0.14866,0.62994,0.06912,0.48397,0.17233


In [9]:
Change_columns(df=data_2017)
data_2017 = Filter_dataframe(df=data_2017)
Add_year(df=data_2017, year="2017")

done


Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Norway,2017,1,7.537,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964
Denmark,2017,2,7.522,1.482383,1.551122,0.792566,0.626007,0.355280,0.400770
Iceland,2017,3,7.504,1.480633,1.610574,0.833552,0.627163,0.475540,0.153527
Switzerland,2017,4,7.494,1.564980,1.516912,0.858131,0.620071,0.290549,0.367007
Finland,2017,5,7.469,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612
...,...,...,...,...,...,...,...,...,...
Rwanda,2017,151,3.471,0.368746,0.945707,0.326425,0.581844,0.252756,0.455220
Syria,2017,152,3.462,0.777153,0.396103,0.500533,0.081539,0.493664,0.151347
Tanzania,2017,153,3.349,0.511136,1.041990,0.364509,0.390018,0.354256,0.066035
Burundi,2017,154,2.905,0.091623,0.629794,0.151611,0.059901,0.204435,0.084148


In [10]:
Change_columns(df=data_2018)
data_2018 = Filter_dataframe(df=data_2018)
Add_year(df=data_2018, year="2018")

done


Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Finland,2018,1,7.632,1.305,1.592,0.874,0.681,0.202,0.393
Norway,2018,2,7.594,1.456,1.582,0.861,0.686,0.286,0.340
Denmark,2018,3,7.555,1.351,1.590,0.868,0.683,0.284,0.408
Iceland,2018,4,7.495,1.343,1.644,0.914,0.677,0.353,0.138
Switzerland,2018,5,7.487,1.420,1.549,0.927,0.660,0.256,0.357
...,...,...,...,...,...,...,...,...,...
Yemen,2018,152,3.355,0.442,1.073,0.343,0.244,0.083,0.064
Tanzania,2018,153,3.303,0.455,0.991,0.381,0.481,0.270,0.097
South Sudan,2018,154,3.254,0.337,0.608,0.177,0.112,0.224,0.106
Central African Republic,2018,155,3.083,0.024,0.000,0.010,0.305,0.218,0.038


In [11]:
Change_columns(df=data_2019)
data_2019 = Filter_dataframe(df=data_2019)
Add_year(df=data_2019, year="2019")

done


Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Finland,2019,1,7.769,1.340,1.587,0.986,0.596,0.153,0.393
Denmark,2019,2,7.600,1.383,1.573,0.996,0.592,0.252,0.410
Norway,2019,3,7.554,1.488,1.582,1.028,0.603,0.271,0.341
Iceland,2019,4,7.494,1.380,1.624,1.026,0.591,0.354,0.118
Netherlands,2019,5,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
Rwanda,2019,152,3.334,0.359,0.711,0.614,0.555,0.217,0.411
Tanzania,2019,153,3.231,0.476,0.885,0.499,0.417,0.276,0.147
Afghanistan,2019,154,3.203,0.350,0.517,0.361,0.000,0.158,0.025
Central African Republic,2019,155,3.083,0.026,0.000,0.105,0.225,0.235,0.035


2) Dataset for World Peace Index.

As the dataset has columns of years that are not necessary and some column names that have extra characters and letters in it, I will be applying the following steps:
- filter the dataframe for the correct years
- clean the names of the columns
- setting country as index.  

In [12]:
df_world_peace = Clean_data_peace_index(df=df_world_peace)
df_world_peace

done


Unnamed: 0_level_0,2019 rank,2019 score,2018 rank,2018 score,2017 rank,2017 score,2016 rank,2016 score,2015 rank,2015 score
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Iceland,1,1.072,1,1.096,1,1.084,1,1.138,1,1.142
New Zealand,2,1.221,2,1.188,2,1.216,3,1.238,3,1.263
Portugal,3,1.274,5,1.315,3,1.273,5,1.324,12,1.418
Austria,4,1.291,3,1.273,4,1.292,4,1.249,4,1.264
Denmark,5,1.316,4,1.313,5,1.299,2,1.201,2,1.179
...,...,...,...,...,...,...,...,...,...,...
Yemen,159,3.369,160,3.436,161,3.516,162,3.530,160,3.481
Syria,160,3.412,158,3.308,157,3.316,157,3.287,148,2.840
South Sudan,161,3.526,161,3.525,160,3.462,161,3.524,161,3.491
Iraq,162,3.573,163,3.599,163,3.660,163,3.653,162,3.628


3) Dataset for unemployment rate.

As the dataframe contains many unuseful columns, unclear column names and structure, I will pass the dataframe through a cleaning functions that will do the following:
- filter only the necessary columns
- change the columns names
- set Country and Year as indices

In [13]:
df_unemployment = clean_data_unemployment_rate(x=df_unemployment)
df_unemployment

Year,2015,2016,2017,2018,2019
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Albania,17.5806,15.8059,13.9826,12.7767,11.9473
Argentina,,,8.5335,9.3666,10.0923
Armenia,18.8186,18.1652,18.2708,19.3374,
Australia,6.2178,5.8676,5.7566,5.4573,5.3115
Austria,5.8226,6.1113,5.5785,4.9212,4.5524
...,...,...,...,...,...
United Kingdom,5.4372,4.9368,4.4372,4.0994,3.8272
United States,5.3661,4.9337,4.4040,3.9376,3.7180
Uruguay,7.7062,8.0649,8.1399,8.5756,9.5715
Viet Nam,1.9191,1.9149,1.9440,1.2083,2.0831


#### Merging dataframes

For further usage in this project, I will join the columns of the seperate years of the World Peace Index dataframe and the Unemployment Rate dataframe to the main dataframe of each corresponding year. 

In [14]:
#First we will filter the the dataframes that will be joined (world peace and unemployment) for the corresponding column of the correct year. Then all the dataframes will be run through a function that will join them based on the index of the main dataframe.
world_peace_2015 = df_world_peace.loc[:,"2015 score":"2015 score"]
unemployment_2015 = df_unemployment.loc[:, 2015:2015]
complete_data_2015 = join_df(df=data_2015, df1=world_peace_2015, df2=unemployment_2015)

Dataframes are joined


In [15]:
complete_data_2015

Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Peace index,Unemployment rate
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Switzerland,2015,1,7.587,1.39651,1.34951,0.94143,0.66557,0.29678,0.41978,1.443,4.9200
Iceland,2015,2,7.561,1.30232,1.40223,0.94784,0.62877,0.43630,0.14145,1.142,4.1591
Denmark,2015,3,7.527,1.32548,1.36058,0.87464,0.64938,0.34139,0.48357,1.179,6.4513
Norway,2015,4,7.522,1.45900,1.33095,0.88521,0.66973,0.34699,0.36503,1.483,4.4206
Canada,2015,5,7.427,1.32629,1.32261,0.90563,0.63297,0.45811,0.32957,1.337,6.9993
...,...,...,...,...,...,...,...,...,...,...,...
Rwanda,2015,154,3.465,0.22208,0.77370,0.42864,0.59201,0.22628,0.55191,2.228,
Benin,2015,155,3.340,0.28665,0.35386,0.31910,0.48450,0.18260,0.08010,1.975,
Syria,2015,156,3.006,0.66320,0.47489,0.72193,0.15684,0.47179,0.18906,2.840,
Burundi,2015,157,2.905,0.01530,0.41587,0.22396,0.11850,0.19727,0.10062,2.327,


In [16]:
world_peace_2016 = df_world_peace.loc[:,"2016 score":"2016 score"]
unemployment_2016 = df_unemployment.loc[:, 2016:2016]
complete_data_2016 = join_df(df=data_2016, df1=world_peace_2016, df2=unemployment_2016)

Dataframes are joined


In [17]:
world_peace_2017 = df_world_peace.loc[:,"2017 score":"2017 score"]
unemployment_2017 = df_unemployment.loc[:, 2017:2017]
complete_data_2017 = join_df(df=data_2017, df1=world_peace_2017, df2=unemployment_2017)

Dataframes are joined


In [18]:
world_peace_2018 = df_world_peace.loc[:,"2018 score":"2018 score"]
unemployment_2018 = df_unemployment.loc[:, 2018:2018]
complete_data_2018 = join_df(df=data_2018, df1=world_peace_2018, df2=unemployment_2018)

Dataframes are joined


In [19]:
world_peace_2019 = df_world_peace.loc[:,"2019 score":"2019 score"]
unemployment_2019 = df_unemployment.loc[:, 2019:2019]
complete_data_2019 = join_df(df=data_2019, df1=world_peace_2019, df2=unemployment_2019)

Dataframes are joined


In [20]:
complete_data_2019

Unnamed: 0_level_0,Year,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Peace index,Unemployment rate
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Finland,2019,1,7.769,1.340,1.587,0.986,0.596,0.153,0.393,1.488,6.8335
Denmark,2019,2,7.600,1.383,1.573,0.996,0.592,0.252,0.410,1.316,5.1341
Norway,2019,3,7.554,1.488,1.582,1.028,0.603,0.271,0.341,1.536,3.7848
Iceland,2019,4,7.494,1.380,1.624,1.026,0.591,0.354,0.118,1.072,3.6336
Netherlands,2019,5,7.488,1.396,1.522,0.999,0.557,0.322,0.298,1.530,3.3744
...,...,...,...,...,...,...,...,...,...,...,...
Rwanda,2019,152,3.334,0.359,0.711,0.614,0.555,0.217,0.411,2.014,
Tanzania,2019,153,3.231,0.476,0.885,0.499,0.417,0.276,0.147,1.860,
Afghanistan,2019,154,3.203,0.350,0.517,0.361,0.000,0.158,0.025,3.300,
Central African Republic,2019,155,3.083,0.026,0.000,0.105,0.225,0.235,0.035,3.296,


Now that we have the complete dataframes that we will work with in order to answer the hypothesis, we will conintue with having a thorough look at the data and clean it further if necessary by looking at the following steps:

- Check the types of the columns and changing the date column to a datetime64 type and setting it as our index.

- Checking the data for any Nan values.

- Checking for duplicates in the data.

-----------------------------------------------------------------------------------------------------

In [14]:
#merging dataframes together
data_all_years = pd.concat([data_2015, data_2016, data_2017, data_2018, data_2019], ignore_index=True)

In [15]:
https://en.wikipedia.org/wiki/Global_Peace_Index webscraping voor peace index per country 2015 to 2020!
http://visionofhumanity.org/app/uploads/2020/06/GPI_2020_web.pdf peace index 2020  ##global peace index. 

SyntaxError: invalid syntax (<ipython-input-15-8234302c9d95>, line 1)

In [16]:
werkeloosheid: percentage per land: https://www.ilo.org/shinyapps/bulkexplorer15/?lang=en&segment=indicator&id=SDG_0852_SEX_AGE_RT_A   # in csv en json file.
df_unemployment = pd.read_json("percentage_unemployment.json")

SyntaxError: invalid syntax (<ipython-input-16-15093909c72d>, line 1)

In [17]:
df_unemployment = pd.read_json("percentage_unemployment.json") ## hebben we ook in json file .


In [18]:
df_unemployment.head(10)

Unnamed: 0,ref_area.label,indicator.label,source.label,sex.label,classif1.label,time,obs_value,obs_status.label,note_classif.label,note_indicator.label,note_source.label
0,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2015,17.5806,,,,Repository: ILO-STATISTICS - Micro data proces...
1,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2016,15.8059,,,,Repository: ILO-STATISTICS - Micro data proces...
2,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2017,13.9826,,,,Repository: ILO-STATISTICS - Micro data proces...
3,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2018,12.7767,,,,Repository: ILO-STATISTICS - Micro data proces...
4,Albania,SDG indicator 8.5.2 - Unemployment rate (%),ALB - LFS - Labour Force Survey,Sex: Total,"Age (Youth, adults): 15-64",2019,11.9473,,,,Repository: ILO-STATISTICS - Micro data proces...
5,United Arab Emirates,SDG indicator 8.5.2 - Unemployment rate (%),ARE - LFS - Labour force survey,Sex: Total,"Age (Youth, adults): 15-64",2017,2.469,,,,Repository: ILO-STATISTICS - Micro data proces...
6,United Arab Emirates,SDG indicator 8.5.2 - Unemployment rate (%),ARE - LFS - Labour force survey,Sex: Total,"Age (Youth, adults): 15-64",2019,2.238,,,,Repository: ILO-STATISTICS - Micro data proces...
7,Argentina,SDG indicator 8.5.2 - Unemployment rate (%),ARG - LFS - Encuesta Permanente de Hogares (Ur...,Sex: Total,"Age (Youth, adults): 15-64",2017,8.5335,,,,Repository: ILO-STATISTICS - Micro data proces...
8,Argentina,SDG indicator 8.5.2 - Unemployment rate (%),ARG - LFS - Encuesta Permanente de Hogares (Ur...,Sex: Total,"Age (Youth, adults): 15-64",2018,9.3666,,,,Repository: ILO-STATISTICS - Micro data proces...
9,Argentina,SDG indicator 8.5.2 - Unemployment rate (%),ARG - LFS - Encuesta Permanente de Hogares (Ur...,Sex: Total,"Age (Youth, adults): 15-64",2019,10.0923,,,,Repository: ILO-STATISTICS - Micro data proces...


In [19]:
df_household = pd.read_csv("GDL-Average-household-size-data.csv")

FileNotFoundError: [Errno 2] File GDL-Average-household-size-data.csv does not exist: 'GDL-Average-household-size-data.csv'

In [20]:
df_household.head(500)

NameError: name 'df_household' is not defined

Webscraping the wikipedia page for global peace index

In [21]:
## werkt!
url = "https://en.wikipedia.org/wiki/Global_Peace_Index"
df_world_peace = pd.read_html(url)[2]
df_world_peace

Unnamed: 0,Country,2019 rank,2019 score[12],2018 rank,2018 score[13],2017 rank,2017 score[2],2016 rank,2016 score[14],2015 rank,...,2012 rank,"2012 score[18],[19]",2011 rank,2011 score[20],2010 rank,2010 score[21],2009 rank,2009 score,2008 rank,2008 score
0,Iceland,1,1.072,1,1.096,1,1.084,1,1.138,1,...,1,1.159,1,1.099,1,1.143,1,1.160,1,1.111
1,New Zealand,2,1.221,2,1.188,2,1.216,3,1.238,3,...,3,1.276,2,1.255,2,1.251,2,1.260,2,1.261
2,Portugal,3,1.274,5,1.315,3,1.273,5,1.324,12,...,16,1.520,16,1.485,14,1.472,13,1.437,10,1.387
3,Austria,4,1.291,3,1.273,4,1.292,4,1.249,4,...,7,1.407,9,1.416,5,1.383,5,1.369,5,1.337
4,Denmark,5,1.316,4,1.313,5,1.299,2,1.201,2,...,2,1.235,3,1.290,4,1.334,3,1.269,3,1.272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158,Yemen,159,3.369,160,3.436,161,3.516,162,3.530,160,...,159,3.234,162,3.411,161,3.600,161,3.616,161,3.681
159,Syria,160,3.412,158,3.308,157,3.316,157,3.287,148,...,150,2.790,148,2.735,145,2.686,138,2.484,129,2.350
160,South Sudan,161,3.526,161,3.525,160,3.462,161,3.524,161,...,141,2.561,143,2.591,163,,163,,163,
161,Iraq,162,3.573,163,3.599,163,3.660,163,3.653,162,...,154,2.935,115,2.276,115,2.232,107,2.170,96,2.083


In [22]:
from bs4 import BeautifulSoup
import requests
from flask import Flask, request
import urllib.request
from urllib.request import urlopen

In [23]:
wiki_url = "https://en.wikipedia.org/wiki/Global_Peace_Index"
class_id = "wikitable sortable jquery-tablesorter"
#request.get("https://en.wikipedia.org/wiki/Peru").text

In [24]:
url = 'https://en.wikipedia.org/wiki/Global_Peace_Index'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')
#peace_index = soup.find('table', attrs={'class': class_id})
#print(peace_index)
soup.title.string
header = [th.text.rstrip() for th in rows[0].find_all('th')]
print(header)
my_table = soup.find('table',{'class':'wikitable sortable'})

NameError: name 'rows' is not defined

In [25]:
wikiurl='https://en.wikipedia.org/wiki/Global_Peace_Index'
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wikiurl)
print(response.status_code)

200


In [26]:
url = "https://en.wikipedia.org/wiki/Global_Peace_Index"
table = pd.read_html(url)[2]
type(table)


pandas.core.frame.DataFrame

In [27]:
soup = BeautifulSoup(response.text, 'html.parser')
#soup.find("div", {"class":"mw-parser-output"})
#soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
#table = soup.find("table", {"class":"wikitable"})
ta=soup.find("table", {"class":"wikitable sortable"}).tbody


In [28]:
c1 = []
c2 = []
c3 = []
c4 = []
c5 = []
c6 = []
c7 = []
c8 = []
c9 = []
c10 = []
c11 = []

for row in my_table.findAll("tr"):
    #print(row)
    #cells = row.findAll('td')

        

SyntaxError: unexpected EOF while parsing (<ipython-input-28-f5deed91ab4b>, line 17)

In [29]:
c1.append(cells[0].find('a').text)
    c2.append(cells[1].find(text=True))
    c3.append(cells[2].find(text=True))
    c4.append(cells[3].find(text=True))
    c5.append(cells[4].find(text=True))
    c6.append(cells[5].find(text=True))
    c7.append(cells[6].find(text=True))
    c8.append(cells[7].find(text=True))
    c9.append(cells[8].find(text=True))
    c10.append(cells[9].find(text=True))
    c11.append(cells[10].find(text=True))

IndentationError: unexpected indent (<ipython-input-29-dfa1d836bd0e>, line 2)

In [30]:
d = {"Country":0, "2019 Rank":0, "2019 score":0, "2018 Rank":0, "2018 score":0, "2017 Rank":0, "2017 score":0, "2016 Rank":0, "2016 score":0, "2015 Rank":0, "2015 score":0}
d

{'Country': 0,
 '2019 Rank': 0,
 '2019 score': 0,
 '2018 Rank': 0,
 '2018 score': 0,
 '2017 Rank': 0,
 '2017 score': 0,
 '2016 Rank': 0,
 '2016 score': 0,
 '2015 Rank': 0,
 '2015 score': 0}

In [31]:
d['Country'] = c1
d['2019 Rank'] = c2
d['2019 score'] = c3
d['2018 Rank'] = c4
d['2018 score'] = c5
d['2017 Rank'] = c6
d['2017 score'] = c7
d['2016 Rank'] = c8
d['2016 score'] = c9
d['2015 Rank'] = c10
d['2015 score'] = c11

NameError: name 'c1' is not defined

In [32]:
df_table = pd.DataFrame(d)

ValueError: If using all scalar values, you must pass an index

In [33]:
df_table

NameError: name 'df_table' is not defined

In [34]:
my_table = soup.find('table',{'class':'wikitable sortable'})
#my_table

In [35]:
### playing around with dataframe of stringency, may not be usefull. 

In [36]:
stringency = pd.read_csv("datasets_stringency.csv")

#Stringency index 0 to 100 of each country. 100 is strictest.

FileNotFoundError: [Errno 2] File datasets_stringency.csv does not exist: 'datasets_stringency.csv'

In [37]:
stringency.rename(columns={"Entity;Code;Date;Government Response Stringency Index ((0 to 100" : "Country", " 100 = strictest))" : "StringencyIndex"}, inplace=True)


NameError: name 'stringency' is not defined

In [38]:
stringency["Country"] = stringency["Country"].str.split(';', 2, expand=True)
stringency["StringencyIndex"] = stringency["StringencyIndex"].str.split(';').str[1]

NameError: name 'stringency' is not defined

In [39]:
stringency["StringencyIndex"] = stringency["StringencyIndex"].astype('float')
stringency_mean_countries = stringency.groupby("Country").mean()

NameError: name 'stringency' is not defined

In [40]:
import tabula
from tabula import read_pdf

In [41]:
data= read_pdf("GPI_2020_web.pdf", pages=10, encoding = 'latin1', guess = False)

FileNotFoundError: [Errno 2] No such file or directory: 'GPI_2020_web.pdf'

In [None]:
#!!! in case the function does not work:
## function somehow not working, returning none value. follow next steps:
#df_unemployment = df_unemployment[['ref_area.label', 'time', 'obs_value']]
#df_unemployment.rename(columns={'ref_area.label':'Country', 'time':'Year', 'obs_value':'Unemployment_rate'},                            inplace=True)
#df_unemployment.set_index("Country", inplace=True) 
#Changing to pivot table to have clearer overview. 
#df_unemployment = df_unemployment.pivot_table(values='Unemployment_rate', index=df_unemployment.index, columns='Year', aggfunc='first')
#df_unemployment