Social Data Science WS19/20

# Home Assignment 1



Submit your solution via Moodle until 23.59pm on Wednesday, November 6th. Late submissions are accepted for 12 hours following the deadline, with 1/3 of the total possible points deducted from the score.

You can (and should!) submit solutions in teams of up 2-3 members.
Please denote all members of the team with their student id and full name in the notebook. Please submit only one notebook per team. Only submit a notebook, do not submit the dataset(s) you used.

Cite ALL your sources for coding this home assignment. In case of plagiarism (copying solutions from other teams or from the internet) ALL team members will be expelled from the course without warning.

##### List team members, including all student IDs here:
1. Student 1 (123456)
2. Student 2 (123457)
3. (optional) Student 3 (123458)

## Exploring the Quality of Government Dataset

In this home assignment we are going to explore the 2019 Quality of Government dataset(s) that has been assembled by the QOG institute from University of Gothenburg.
All data as well as documentation can be found here: https://www.qogdata.pol.gu.se/data/

Note that we only consider the data that has been published in January 2019, i.e. the data files that contain the suffix "jan19". Do NOT use any other dataset other than those that can be found in this online repo, except for the data file that we refer to in task 1.

#### Coding guidelines:
* Make sure that your code is executable, any task for which the code does directly not run on our machine will be graded with 0 points.
* In that regard, do not rename the dataset you use, and load it from the same directory that your ipynb-notebook is located in, i.e., your working directory. In particular, when loading your file via a pandas or numpy command, make sure that it has the form `pd.read_csv("qog_file.csv")` instead of `pd.read_csv("C:/User/Path/to/your/Homework/qog_file.csv")` so that the code directly works from our machines.
* Make sure you clean up your code before submission, e.g., properly align your code, and delete every line of code that you do not need anymore, even if you may have experimented with it
* Feel free to use comments in the code. While we do not require them to get full marks, they may help us in case your code has minor errors
* You may create as many additional cells as you want, just make sure that the solutions to the individual tasks can be found near the corresponding assignment.

#### Plotting guidelines:
* For both visualization tasks, you may only create ONE graphic. Thus, if you want to convey much information, think properly about how you can approach this. If you have more than one visualization per task, we will only count the LAST one. You may however use auxiliary plots or textual outputs to illustrate how you came to your final plot. 
* To get full marks for your plots, we require that you consider the principles taught in lecture. In particular, your plot and the message they convey should be easy to understand. No chart-junk. Optimize the data to ink ratio!
* Write a brief summary (<=5 sentences) of your plot in a markdown cell directly below it. 
* Make sure you also pay attention to details such as properly calibrated axes, understandable labels, properly placed legends, etc.

In [27]:
# General preprocessing may go here
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

### Task 1: Analyzing Life Expectancy (2 pts)

It is widely known that in most countries in Africa, the life expectancy is much lower than in well-developed countries such as Germany. In this first part of the homework, we are looking into this issue.

We are going to investigate differences with respect to continents, and we consider five continents: Africa, Americas, Asia, Europe, Oceania. We use the following reference allocation from countries to continents:  
https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv

Note that loading this file into your notebook will also yield the easiest way to add continent information to the QOG data. Again, keep the existing file name "all.csv".

In [5]:
# General preprocessing may go here
data_qog = pd.read_csv("qog_std_cs_jan19.csv") 
df_qog = pd.DataFrame(data_qog)
data_continent = pd.read_csv("all.csv")
df_continent = pd.DataFrame(data_continent)
df_all = pd.merge(df_qog,df_continent,left_on='ccode',right_on='country-code')
df_all

Unnamed: 0,ccode,cname,ccodealp,ccodecow,ccodewb,version,aid_cpnc,aid_cpsc,aid_crnc,aid_crnio,...,alpha-2,alpha-3,country-code,iso_3166-2,region,sub-region,intermediate-region,region-code,sub-region-code,intermediate-region-code
0,4,Afghanistan,AFG,700.0,4.0,QoGStdCSJan19,,,29.0,13.0,...,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142.0,34.0,
1,8,Albania,ALB,339.0,8.0,QoGStdCSJan19,,,26.0,13.0,...,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150.0,39.0,
2,12,Algeria,DZA,615.0,12.0,QoGStdCSJan19,,,21.0,6.0,...,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,,2.0,15.0,
3,20,Andorra,AND,232.0,20.0,QoGStdCSJan19,,,,,...,AD,AND,20,ISO 3166-2:AD,Europe,Southern Europe,,150.0,39.0,
4,24,Angola,AGO,540.0,24.0,QoGStdCSJan19,,,22.0,13.0,...,AO,AGO,24,ISO 3166-2:AO,Africa,Sub-Saharan Africa,Middle Africa,2.0,202.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,860,Uzbekistan,UZB,704.0,860.0,QoGStdCSJan19,,,21.0,16.0,...,UZ,UZB,860,ISO 3166-2:UZ,Asia,Central Asia,,142.0,143.0,
190,862,Venezuela,VEN,101.0,862.0,QoGStdCSJan19,,,20.0,6.0,...,VE,VEN,862,ISO 3166-2:VE,Americas,Latin America and the Caribbean,South America,19.0,419.0,5.0
191,882,Samoa,WSM,990.0,882.0,QoGStdCSJan19,,,10.0,7.0,...,WS,WSM,882,ISO 3166-2:WS,Oceania,Polynesia,,9.0,61.0,
192,887,Yemen,YEM,679.0,887.0,QoGStdCSJan19,,,26.0,16.0,...,YE,YEM,887,ISO 3166-2:YE,Asia,Western Asia,,142.0,145.0,


#### a) Hypothesis testing (1 pt)
Based on the QOG 2019 data, apply a hypothesis test to show that in African countries, the life expectancy is significantly lower than in European countries. Explicitly state the null hypothesis that you choose, including the test statistic you are using, and explain your approach.

_Use this Markdown cell to formulate your hypothesis and explain your testing procedure._

In [46]:
# Code to a) goes here
# null hypothesis is that the life expectancy in African countries is lower than in European countries. 
# We use the variable "ihme_lifexp_allt" to do the hypothesis testing.
# Because this variable means the life expectancy of males and females in all ages.

Africa = df_all.loc[df_all.loc[:,"region"] == "Africa"] #choose the data in African countries
Europe = df_all.loc[df_all.loc[:,"region"] == "Europe"] #choose the data in European countries
mean_Africanlife = Africa.ihme_lifexp_allt.mean()
mean_Europeanlife = Europe.ihme_lifexp_allt.mean()

print(mean_Africanlife)
print(mean_Europeanlife)

mean_diff = mean_Africanlife-mean_Europeanlife
print(mean_diff)

ct = 0
col_1 = Africa.loc[:,"ihme_lifexp_allt"].to_numpy()
col_2 = Europe.loc[:,"ihme_lifexp_allt"].to_numpy()
for i in range(50):
    diff = np.random.choice(col_1)-np.random.choice(col_2)
    if diff < 0:
        ct+=1
p_value = ct/50
print("Empirical p-value: " + str(p_value))

if p_value <0.05:
  print("we reject null hypothesis")
else:
  print("we accept null hypothesis")

63.906111944444426
78.73235559999998
-14.826243655555551
Empirical p-value: 0.86
we accept null hypothesis


#### b) Visualization  (1 pt):
We now consider the life expectancies over all continents. Create an informative visualization that points out how the life expectancies over all five continents differ.

50% baseline: Plotting life expectancy means by continent against each other.

In [4]:
# Code to b) goes here

_Description of your plot, up to 5 sentences may go here_

### Task 2: Investigating Corruption (3 pts)

In this second part, we focus on corruption as measured by the _Bayesian Corruption Indicator (BCI)_ (Column bci_bci). 
Explore the data for factors that correlate with corruption, and visualize your findings.  
__Note__: you may NOT consider correlations with other columns that explicitly measure corruption.

50% Baseline: Plotting values from one other column in the data against the BCI in a properly designed plot, where some correlation becomes apparent.

_Description of your plot, consisting of up to 5 sentences may go here_