In [2]:
import pandas as pd
import plotly.express as px

# 1. EDB Score 

In `data/DBData.csv`, you have the full "ease of doing business" dataset from the World Bank. Reformat it into the **Tidy Data** format, so one row is per-year-per-country

Result should look like:

![](EDB_unstack.png)

In [16]:
df = pd.read_csv('data/DBData.csv')
pt = df.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
        var_name="Year",
        value_name="Value"
).pivot_table(
    index=['Year','Country Name'],
    columns='Indicator Name'
).reset_index()
pt

Unnamed: 0_level_0,Year,Country Name,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value
Indicator Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Dealing with construction permits (DB06-15 methodology) - Score,Dealing with construction permits (DB16-19 methodology) - Score,Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology),Dealing with construction permits: Building quality control index (0-15) (DB16-19 methodology) - Score,Dealing with construction permits: Cost (% of Warehouse value),Dealing with construction permits: Cost (% of Warehouse value) - Score,Dealing with construction permits: Liability and insurance regimes index (0-2) (DB16-19 methodology),Dealing with construction permits: Procedures (number),...,Trading across borders: Documents to export (number) (DB06-15 methodology),Trading across borders: Documents to export (number) (DB06-15 methodology) - Score,Trading across borders: Documents to import (number) (DB06-15 methodology),Trading across borders: Documents to import (number) (DB06-15 methodology) - Score,Trading across borders: Time to export (days) (DB06-15 methodology) - Score,Trading across borders: Time to export: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to export: Documentary compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import (days) (DB06-15 methodology) - Score,Trading across borders: Time to import: Border compliance (hours) (DB16-19 methodology) - Score,Trading across borders: Time to import: Documentary compliance (hours) (DB16-19 methodology) - Score
0,2004,Albania,,,,,,,,,...,,,,,,,,,,
1,2004,Algeria,,,,,,,,,...,,,,,,,,,,
2,2004,Angola,,,,,,,,,...,,,,,,,,,,
3,2004,Argentina,,,,,,,,,...,,,,,,,,,,
4,2004,Armenia,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3020,2019,Vietnam,,79.05,12.0,80.00,0.7,96.54,0.0,10.0,...,,,,,,66.04,71.01,,80.29,68.62
3021,2019,West Bank and Gaza,,56.15,12.0,80.00,14.4,28.24,0.0,20.0,...,,,,,,96.86,57.99,,98.21,81.45
3022,2019,"Yemen, Rep.",,0.00,,0.00,,0.00,,,...,,,,,,0.00,0.00,,0.00,0.00
3023,2019,Zambia,,71.65,10.0,66.67,2.6,86.92,0.0,10.0,...,,,,,,25.16,43.79,,57.35,70.29


# 2 GDP and ease of doing business

Using the additional data in `data/GDPpc.csv`, join the clean dataset in **1** to the GDP data.

**What are the 3 Ease of Doing Business variables most closely linked to GDP?**

Answer by giving their correlation ratio, and give a possible explanation and a data visualization

**hint:** trying to do `df.corr()` or `sns.pairplot()` on the whole dataset will crash most computers. Be smart about the number of columns you're testing at once.

In [21]:
gdp = pd.read_csv('data/GDPpc.csv')
pt2 = gdp.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
        var_name="Year",
        value_name="Value"
).pivot_table(
    index=['Year','Country Name'],
    columns='Indicator Name'
).reset_index()
business = pd.merge(pt, pt2, on=['Year', 'Country Name'])
business.fillna(value = 0, axis = 'columns', inplace = True)
vf = business['Value']
fig = px.scatter_matrix(vf[['GDP per capita (current US$)',
                            'Resolving insolvency: Recovery rate (cents on the dollar)',
                            'Resolving insolvency: Recovery rate (cents on the dollar) - Score',
                            'Resolving insolvency: Outcome (0 as piecemeal sale and 1 as going concern)'
                           ]],
                       labels = {'GDP per capita (current US$)':'GDP',
                                 'Resolving insolvency: Recovery rate (cents on the dollar)':'Recovery',
                                 'Resolving insolvency: Recovery rate (cents on the dollar) - Score':'Recovery - Score',
                                 'Resolving insolvency: Outcome (0 as piecemeal sale and 1 as going concern)':'Outcome'
                       })
fig.show()
print("Insolvency relates directly to debts and lack of money/funds, thus resolution of insolvency implies long-term funding.\n"
     "This means the relationship between solvency and GDP are directly correlated."
     )
vf.corr()['GDP per capita (current US$)'].sort_values(ascending=False)[1:4]

Insolvency relates directly to debts and lack of money/funds, thus resolution of insolvency implies long-term funding.
This means the relationship between solvency and GDP are directly correlated.


Indicator Name
Resolving insolvency: Recovery rate (cents on the dollar)                     0.644232
Resolving insolvency: Recovery rate (cents on the dollar) - Score             0.644209
Resolving insolvency: Outcome (0 as piecemeal sale and 1 as going concern)    0.514251
Name: GDP per capita (current US$), dtype: float64

# 3. Chocolate Nobel question

In this repository is the academic paper `chocolate_nobel.pdf`. 

Explain in 3 paragraphs why this paper's conclusions are bad statistics.

### Review:
##### Aside
First off, I'd just like to say (rhetorically), how in God's name did the paper get accepted when they start by stating they use Wikipedia as a source? Especially since one of the links in the exact Wikipedia article leads to a site which provides the appropriate information. 

#### Actual answer
The paper goes into vague details/points to discuss the correlation between chocolate and nobel prize winners. They provide a comparison of the nobel prize winners per 10 million population and chocolate consumption on an annual basis. The paper itself actually provides exactly the reason why it fails to conclude any valid points in its own limitations section by stating the very obvious facts. One issue is the paper cannot discern the chocolate consumption by Nobel Laureates, which tells us absolutely nothing about whether or not the hypotheses, specifically the second one, are correct.

Another issue is the first hypothesis considers the concept of how dose dependant consumption of chocolate is important and backs up their hypothesis with the mention of a future test. While the future test is good, there has been no such evidence within their arguments/results to suggest this. The concept of Sweden cannot be considered a good example because it could very well be an outlier like one of the examples Matt brought up in class. The data presented is too miniscule to be noteworthy.

The paper also lacks a large amount of information, it only considered a very small sample size of countries, this is not exactly statistically significant. The sources were also time restricted and did not provide a big enough image for the scope of chocolate consumption. Simply put, the paper is trying to suggest whether or not there is a correlation between chocolate consumption and cognitive function, use nobel laureates numbers in place due to lack of information. This is also bad statistics as it makes the assumption that nobel laureates per capita is a direct relation to cognitive function of a country.