# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from pandas_datareader import wb

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


In [None]:
# dobbelt kode??

final_data = final.dropna(subset=['GINI', 'GDP', 'COC'], how='any')

# Marker size in units of points^2
quality = final_data.COC

# Get minimum and maximum values of COC for colorbar range
cmin = final_data.COC.min()
cmax = final_data.COC.max()

def plot_scatter(year):
    fig, ax = plt.subplots()
    scatter = ax.scatter(final_data[final_data['year'] == year]['GINI'], final_data[final_data['year'] == year]['GDP'],
                         c=final_data[final_data['year'] == year]['COC'], alpha=0.5, vmin=cmin, vmax=cmax)
    ax.set_xlabel(r'$GINI$', fontsize=15)
    ax.set_ylabel(r'$GDP$', fontsize=15)
    ax.set_title('GINI and GDP (Year: {})'.format(year))
    ax.set_xlim(final_data['GINI'].min(), final_data['GINI'].max())  # Set x-axis limits
    ax.set_ylim(final_data['GDP'].min(), final_data['GDP'].max())  # Set y-axis limits
    ax.grid(True)
    fig.tight_layout()
    # Add color bar
    cbar = plt.colorbar(scatter)
    cbar.set_label('COC', fontsize=12)  # Set color bar label
    plt.show()

years = sorted(final_data['year'].unique())  # Get unique years in ascending order

# Create interactive slider for year variable
interact(plot_scatter, year=years)

# Read and clean data

In [3]:
wb_gini = wb.download(country='all', indicator='SI.POV.GINI', start=1990, end=2017)
wb_gini = wb_gini.rename(columns = {'SI.POV.GINI':'GINI'})
wb_gini = wb_gini.reset_index()
wb_gini.year = wb_gini.year.astype(int) # convert year
wb_gini.country = wb_gini.country.astype('string') # convert country to the special pandas string type
wb_gini.info()

wb_gdp = wb.download(country='all', indicator='NY.GDP.PCAP.KD', start=1990, end=2017)
wb_gdp = wb_gdp.rename(columns = {'NY.GDP.PCAP.KD':'GDP'})
wb_gdp.reset_index(inplace = True)
wb_gdp.year = wb_gdp.year.astype(int)
wb_gdp.country = wb_gdp.country.astype('string') # convert country to the special pandas string type
wb_gdp.info()
wb_gdp.head()

wb = pd.merge(wb_gini, wb_gdp, how = 'outer', on = ['country','year']);
wb.head(5)

AttributeError: 'DataFrame' object has no attribute 'download'

In [4]:
wb_wgi=pd.read_csv("WB WGI.csv")

# These columns have to go: 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3'
drop_these = ['Country Code','Series Code'] # use list comprehension to create list of columns
wb_wgi.drop(drop_these, axis=1, inplace=True) # axis = 1 -> columns, inplace=True -> changed, no copy made
wb_wgi.head(5)

col_dict = {}
for i in range(1996, 2022): # range goes from 1996 to 2021
    col_dict[str(i)+" [YR"+str(i)+"]"] = f'wgi{i}' 
col_dict

wb_wgi.rename(columns = col_dict, inplace=True)
wb_wgi.rename (columns ={'Country Name': 'country'}, inplace=True)
wb_wgi.sample(10)

Unnamed: 0,country,Series Name,wgi1996,wgi1998,wgi2000,wgi2002,wgi2003,wgi2004,wgi2005,wgi2006,...,wgi2012,wgi2013,wgi2014,wgi2015,wgi2016,wgi2017,wgi2018,wgi2019,wgi2020,wgi2021
1173,Tunisia,Regulatory Quality: Percentile Rank,60.8695640563965,57.0652160644531,57.0652160644531,51.8918914794922,55.6756744384766,49.7512435913086,52.9411773681641,56.8627433776855,...,45.0236968994141,43.6018943786621,40.8653831481934,40.3846168518066,35.0961532592773,39.9038467407227,34.1346168518066,38.461540222168,40.8653831481934,38.461540222168
726,Marshall Islands,Control of Corruption: Percentile Rank,..,..,..,..,..,..,47.8048782348633,31.2195129394531,...,51.6587677001953,58.2938385009766,56.25,55.769229888916,58.1730766296387,58.1730766296387,58.1730766296387,59.1346168518066,59.6153831481934,57.211540222168
809,Namibia,Voice and Accountability: Percentile Rank,64.5,58.2089538574219,58.7064666748047,55.7213935852051,54.2288551330566,57.6923065185547,59.6153831481934,60.0961532592773,...,60.0938949584961,59.6244125366211,66.0098495483398,67.4876861572266,66.9950714111328,65.5172424316406,61.835750579834,63.7681159973145,63.7681159973145,63.2850227355957
220,Cayman Islands,Rule of Law: Percentile Rank,90.9547729492188,91.5,91.5422897338867,90.5472640991211,86.5671615600586,86.0576934814453,77.9904327392578,85.1674652099609,...,76.0563354492188,76.5258178710938,73.0769195556641,76.9230804443359,77.4038467407227,76.9230804443359,76.9230804443359,76.9230804443359,75.961540222168,76.4423065185547
298,Cuba,Rule of Law: Percentile Rank,16.0804023742676,16.5,18.4079608917236,11.9402980804443,11.4427862167358,15.384614944458,14.8325357437134,26.3157901763916,...,29.577465057373,30.5164318084717,29.8076915740967,28.365385055542,35.0961532592773,35.5769233703613,35.0961532592773,42.788459777832,44.230770111084,37.980770111084
467,Guatemala,Voice and Accountability: Percentile Rank,42.5,43.7810935974121,40.7960205078125,40.2985076904297,34.3283576965332,38.461540222168,37.5,41.3461532592773,...,35.6807518005371,36.1502342224121,36.9458122253418,35.9605903625488,37.4384231567383,37.4384231567383,34.2995185852051,34.7826080322266,35.2657012939453,34.2995185852051
542,Ireland,Political Stability and Absence of Violence/Te...,95.2127685546875,99.4680862426758,97.8835983276367,96.2962951660156,96.9849243164063,91.2621383666992,94.1747589111328,91.7874374389648,...,77.7251205444336,75.3554534912109,87.1428604125977,77.6190490722656,74.2857131958008,85.2380981445313,86.3207550048828,82.0754699707031,82.0754699707031,75.4716949462891
455,Grenada,Voice and Accountability: Percentile Rank,66,68.6567153930664,69.1542282104492,70.6467666625977,71.1442794799805,65.8653869628906,64.4230804443359,70.6730804443359,...,70.4225387573242,70.8920211791992,73.3990173339844,72.4137954711914,72.4137954711914,71.4285736083984,63.7681159973145,70.0483093261719,69.5652160644531,70.0483093261719
1002,Serbia,Control of Corruption: Percentile Rank,11.8279571533203,6.41711235046387,9.04255294799805,23.280424118042,39.1534385681152,35.4679794311523,42.4390258789063,46.8292694091797,...,45.497631072998,46.445499420166,50.480770111084,47.5961532592773,43.75,43.269229888916,41.8269233703613,38.9423065185547,38.461540222168,36.0576934814453
1061,South Sudan,Voice and Accountability: Percentile Rank,..,..,..,..,..,..,..,..,...,10.3286380767822,8.92018795013428,4.92610836029053,5.41871929168701,3.94088673591614,2.95566511154175,1.44927537441254,1.93236720561981,2.4154589176178,2.4154589176178


In [5]:
wb_wgi = pd.wide_to_long(wb_wgi, stubnames='wgi', i=['country','Series Name'], j='year')
wb_wgi = wb_wgi.reset_index()
# Rename colum with series names to "ser"
wb_wgi.rename(columns = {'Series Name':'ser'}, inplace=True)

#Get a list of all the series in the data
namelist = wb_wgi.ser.unique()

newnames = {'Control of Corruption: Percentile Rank':'COC',
            'Government Effectiveness: Percentile Rank':'GOV', 
            'Political Stability and Absence of Violence/Terrorism: Percentile Rank':'RSA', 
            'Regulatory Quality: Percentile Rank':'REQ', 
            'Rule of Law: Percentile Rank':'ROL', 
            'Voice and Accountability: Percentile Rank':'VOA'}

# Renames series
for index, name in enumerate(namelist):
    print(index, "Now rename Series", name, "=", newnames[name])
    
    wb_wgi.loc[wb_wgi.ser == name, 'ser'] = newnames[name]

newnamelist = wb_wgi.ser.unique()

wb_wgi.sample(5)

0 Now rename Series Control of Corruption: Percentile Rank = COC
1 Now rename Series Government Effectiveness: Percentile Rank = GOV
2 Now rename Series Political Stability and Absence of Violence/Terrorism: Percentile Rank = RSA
3 Now rename Series Regulatory Quality: Percentile Rank = REQ
4 Now rename Series Rule of Law: Percentile Rank = ROL
5 Now rename Series Voice and Accountability: Percentile Rank = VOA


Unnamed: 0,country,ser,year,wgi
26360,Timor-Leste,COC,2000,..
5946,"Congo, Dem. Rep.",COC,2011,1.42180097103119
13122,"Jersey, Channel Islands",COC,2011,84.3601913452148
12842,Jamaica,COC,2007,46.6019401550293
23120,Serbia,REQ,2004,34.825870513916


In [6]:
wb_new=pd.pivot(wb_wgi, index=['country','year'], columns = 'ser', values= 'wgi')
wb_new=wb_new.reset_index()
final = pd.merge(wb, wb_new, on=['year', 'country'], how = 'outer')

In [7]:
col_list = ['COC', 'GOV', 'REQ', 'ROL', 'RSA', 'VOA']
for i in col_list:
    final.loc[final[i]==".."] = np.nan
    final[i]=final[i].astype(float)
final.info()
final

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8494 entries, 0 to 8493
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  8191 non-null   object 
 1   year     8191 non-null   float64
 2   GINI     1577 non-null   float64
 3   GDP      6624 non-null   float64
 4   COC      4619 non-null   float64
 5   GOV      4619 non-null   float64
 6   REQ      4619 non-null   float64
 7   ROL      4619 non-null   float64
 8   RSA      4619 non-null   float64
 9   VOA      4619 non-null   float64
dtypes: float64(9), object(1)
memory usage: 730.0+ KB


Unnamed: 0,country,year,GINI,GDP,COC,GOV,REQ,ROL,RSA,VOA
0,Africa Eastern and Southern,2017.0,,1534.683482,,,,,,
1,Africa Eastern and Southern,2016.0,,1534.924746,,,,,,
2,Africa Eastern and Southern,2015.0,,1538.552268,,,,,,
3,Africa Eastern and Southern,2014.0,,1537.159571,,,,,,
4,Africa Eastern and Southern,2013.0,,1518.482850,,,,,,
...,...,...,...,...,...,...,...,...,...,...
8489,Zambia,2021.0,,,25.480770,19.230770,32.211540,30.288462,48.113209,35.265701
8490,Zimbabwe,2018.0,,,10.096154,9.134615,5.288462,8.173077,20.283018,16.908213
8491,Zimbabwe,2019.0,,,10.096154,10.096154,6.250000,7.211538,14.622642,16.425121
8492,Zimbabwe,2020.0,,,9.615385,9.134615,7.692307,7.692307,13.207547,17.874395


In [27]:
final = final.dropna(subset=['country','year'], how='any')
final.year = final.year.astype(int)
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8191 entries, 0 to 8493
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  8191 non-null   object 
 1   year     8191 non-null   int32  
 2   GINI     1577 non-null   float64
 3   GDP      6624 non-null   float64
 4   COC      4619 non-null   float64
 5   GOV      4619 non-null   float64
 6   REQ      4619 non-null   float64
 7   ROL      4619 non-null   float64
 8   RSA      4619 non-null   float64
 9   VOA      4619 non-null   float64
dtypes: float64(8), int32(1), object(1)
memory usage: 671.9+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final.year = final.year.astype(int)


In [28]:
#forklar BNP og ulighed på baggrund af governance. 
    #- lag governance evt. 10 år.

import statsmodels.api as sm 

X = final.loc[:,['GINI','COC']] #COC skal ændres til governance indikator
X['COC_lag'] = X['COC'].shift(-10) # COC skal ændres...
X = sm.add_constant(X)

y = final.loc[:,'GDP']


model = sm.OLS(y,X, missing='drop')
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,GDP,R-squared:,0.63
Model:,OLS,Adj. R-squared:,0.629
Method:,Least Squares,F-statistic:,427.7
Date:,"Thu, 13 Apr 2023",Prob (F-statistic):,4.01e-162
Time:,17:11:03,Log-Likelihood:,-8232.1
No. Observations:,757,AIC:,16470.0
Df Residuals:,753,BIC:,16490.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,490.4908,2855.586,0.172,0.864,-5115.366,6096.348
GINI,-375.2488,61.439,-6.108,0.000,-495.860,-254.637
COC,349.3438,33.554,10.411,0.000,283.474,415.214
COC_lag,208.3893,32.174,6.477,0.000,145.228,271.550

0,1,2,3
Omnibus:,334.892,Durbin-Watson:,0.381
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2123.599
Skew:,1.894,Prob(JB):,0.0
Kurtosis:,10.279,Cond. No.,580.0


In [59]:
# sort dataset to contain years 2000-2020
years = range(2000, 2021)
final1 = final.loc[final['year'].isin(years)]


In [73]:
#detrend BNP
MANGLER

# Calculate std. of GDP for each country
final1['std'] = final1.groupby('country')['GDP'].transform('std')

#select only one year (=2017)
final2 = final1.loc[final['year'] == 2000]

#regress std. på lagged governance 
X = final2.loc[:,['COC']] #COC skal ændres til governance indikator
#X['COC_lag'] = X['COC'].shift(-10) # COC skal ændres...
X = sm.add_constant(X)

y = final2.loc[:,'std']

model = sm.OLS(y,X, missing='drop')
result = model.fit()
result.summary()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final1['std'] = final1.groupby('country')['GDP'].transform('std')


0,1,2,3
Dep. Variable:,std,R-squared:,0.184
Model:,OLS,Adj. R-squared:,0.179
Method:,Least Squares,F-statistic:,39.35
Date:,"Thu, 13 Apr 2023",Prob (F-statistic):,2.69e-09
Time:,17:52:09,Log-Likelihood:,-1600.5
No. Observations:,177,AIC:,3205.0
Df Residuals:,175,BIC:,3211.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-211.0021,302.641,-0.697,0.487,-808.298,386.294
COC,33.1545,5.285,6.273,0.000,22.723,43.586

0,1,2,3
Omnibus:,245.712,Durbin-Watson:,1.919
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18735.249
Skew:,5.825,Prob(JB):,0.0
Kurtosis:,52.037,Cond. No.,112.0


In [None]:
#diverse..

#Detrender
from scipy import signal
y1 = signal.detrend(y).max()


y = final.loc[:,'GDP']
y1 = signal.detrend(y).max() #hvad med noise? 

model = sm.OLS(y1,X, missing='drop')
result = model.fit()
result.summary()


In [18]:
# Find varians i BNP og forklar på baggrund af (lagged værdi af) governance
    # detrend BNP... stationarity...
    # Std. af BNP 

# find bestemte lande.. 
# beregn samme std. fr lande 
# drop for undtagen et land.. 
# se variaion mellem




TypeError: 'method' object is not subscriptable

#given variabel: træk gennemsnittet fra og divider med standardafvigelsen

## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

In [None]:
def plot_func():
    # Function that operates on data set
    pass

widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
); 


Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.