# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2
from pandas_datareader import wb

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


# Read and clean data

In [13]:
wb_gini = wb.download(indicator='SI.POV.GINI', start=1990, end=2017)
wb_gini = wb_gini.rename(columns = {'SI.POV.GINI':'GINI'})
wb_gini = wb_gini.reset_index()
wb_gini.year = wb_gini.year.astype(int) # convert year
wb_gini.country = wb_gini.country.astype('string') # convert country to the special pandas string type
wb_gini.info()

wb_gdp = wb.download(indicator='NY.GDP.PCAP.KD', start=1990, end=2017)
wb_gdp = wb_gdp.rename(columns = {'NY.GDP.PCAP.KD':'GDP'})
wb_gdp.reset_index(inplace = True)
wb_gdp.year = wb_gdp.year.astype(int)
wb_gdp.country = wb_gdp.country.astype('string') # convert country to the special pandas string type
wb_gdp.info()
wb_gdp.head()

wb = pd.merge(wb_gini, wb_gdp, how = 'outer', on = ['country','year']);
wb.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  84 non-null     string 
 1   year     84 non-null     int32  
 2   GINI     66 non-null     float64
dtypes: float64(1), int32(1), string(1)
memory usage: 1.8 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  84 non-null     string 
 1   year     84 non-null     int32  
 2   GDP      77 non-null     float64
dtypes: float64(1), int32(1), string(1)
memory usage: 1.8 KB


Unnamed: 0,country,year,GINI,GDP
0,Canada,2017,33.3,44325.488337
1,Canada,2016,32.7,43536.913403
2,Canada,2015,33.7,43596.135537
3,Canada,2014,33.2,43635.095481
4,Canada,2013,33.8,42846.284196


In [48]:
wb_wgi=pd.read_csv("WB WGI.csv")

# These columns have to go: 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3'
drop_these = ['Country Code','Series Code'] # use list comprehension to create list of columns
wb_wgi.drop(drop_these, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made
wb_wgi.head(5)

col_dict = {}
for i in range(1996, 2022): # range goes from 1996 to 2021
    col_dict[str(i)+" [YR"+str(i)+"]"] = f'wgi{i}' 
col_dict

wb_wgi.rename(columns = col_dict, inplace=True)
wb_wgi.sample(10)

['Country Code', 'Series Code']


Unnamed: 0,Country Name,Series Name,wgi1996,wgi1998,wgi2000,wgi2002,wgi2003,wgi2004,wgi2005,wgi2006,...,wgi2012,wgi2013,wgi2014,wgi2015,wgi2016,wgi2017,wgi2018,wgi2019,wgi2020,wgi2021
393,Finland,Regulatory Quality: Percentile Rank,94.5652160644531,97.2826080322266,97.2826080322266,98.3783798217773,99.4594573974609,99.0049743652344,99.0196075439453,94.6078414916992,...,98.1042633056641,98.5782012939453,98.5576934814453,98.0769195556641,96.6346130371094,96.6346130371094,96.1538467407227,97.5961532592773,99.038459777832,99.038459777832
24,Andorra,Control of Corruption: Percentile Rank,87.0967712402344,89.3048095703125,88.2978744506836,87.8306884765625,88.3597869873047,84.2364501953125,88.2926864624023,88.2926864624023,...,87.2037887573242,86.7298583984375,87.0192337036133,87.0192337036133,87.5,86.538459777832,86.538459777832,88.461540222168,88.461540222168,88.9423065185547
823,Netherlands,Government Effectiveness: Percentile Rank,98.3606567382813,99.4535522460938,98.9071044921875,96.7567596435547,97.8378372192383,98.0099487304688,98.0392150878906,94.634147644043,...,97.1563949584961,97.6303329467773,98.0769195556641,97.1153869628906,97.1153869628906,96.6346130371094,96.6346130371094,97.1153869628906,97.5961532592773,97.1153869628906
203,Cambodia,Voice and Accountability: Percentile Rank,25.0,24.8756217956543,25.3731346130371,26.8656711578369,23.3830852508545,22.115385055542,18.75,21.634614944458,...,20.1877937316895,19.7183094024658,18.7192115783691,18.2266006469727,18.2266006469727,15.7635469436646,13.5265703201294,15.4589376449585,12.5603866577148,10.6280193328857
901,Papua New Guinea,Government Effectiveness: Percentile Rank,43.7158470153809,38.2513656616211,44.2622947692871,39.4594612121582,37.8378372192383,29.8507461547852,21.0784320831299,27.3170738220215,...,26.0663509368896,30.3317527770996,26.9230766296387,31.730770111084,23.5576915740967,26.4423084259033,25.0,20.1923084259033,18.269229888916,16.8269233703613
940,Puerto Rico,Rule of Law: Percentile Rank,81.4070358276367,81.5,82.5870666503906,81.592041015625,70.1492538452148,71.1538467407227,68.4210510253906,65.0717697143555,...,72.7699508666992,70.8920211791992,76.4423065185547,74.5192337036133,74.038459777832,71.6346130371094,72.1153869628906,72.1153869628906,72.1153869628906,70.6730804443359
788,Morocco,Political Stability and Absence of Violence/Te...,39.8936157226563,57.4468078613281,42.3280410766602,37.037036895752,32.1608047485352,35.4368934631348,29.1262130737305,30.917875289917,...,32.7014236450195,28.9099521636963,29.5238094329834,33.8095245361328,34.2857131958008,31.4285717010498,34.9056587219238,33.0188674926758,34.9056587219238,32.5471687316895
81,Azerbaijan,Regulatory Quality: Percentile Rank,11.4130430221558,15.7608699798584,18.4782600402832,24.8648643493652,30.2702693939209,27.8606967926025,29.9019603729248,32.8431358337402,...,33.1753540039063,36.4928894042969,43.75,40.8653831481934,39.4230766296387,38.9423065185547,39.9038467407227,44.230770111084,39.9038467407227,50.480770111084
914,Peru,Political Stability and Absence of Violence/Te...,15.4255323410034,29.7872333526611,17.9894180297852,20.1058197021484,15.5778894424438,17.9611644744873,18.9320392608643,21.7391300201416,...,19.4312801361084,20.3791465759277,27.1428565979004,32.3809509277344,38.0952377319336,36.1904754638672,37.2641525268555,42.4528312683105,34.4339637756348,32.0754699707031
1263,West Bank and Gaza,Regulatory Quality: Percentile Rank,21.1956520080566,20.1086959838867,17.3913040161133,17.8378372192383,17.2972965240479,26.3681583404541,15.6862745285034,16.1764698028564,...,54.502368927002,55.4502372741699,60.5769233703613,55.769229888916,55.769229888916,54.3269233703613,54.8076934814453,56.25,56.25,50.961540222168


In [51]:
#FEJL I ØVERSTE LINJE.
wb_wgi = pd.wide_to_long(wb_wgi, stubnames='wgi', i=['Country Name','Series Name'], j='year')
wb_wgi.sample(10)


ValueError: the id variables need to uniquely identify each row

In [None]:
# Build up a logical index I
I
for i in range(1996,2022):
    I=wb_wgi.i.str.contains("..")
I
I = wb_wgi.municipality.str.contains('Region')
I |= wb_wgi.municipality.str.contains('Province')
I |= wb_wgi.municipality.str.contains('All Denmark')
empl.loc[I, :]

Import your data, either through an API or manually, and load it. 

## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

In [None]:
def plot_func():
    # Function that operates on data set
    pass

widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
); 


Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.