# Home task: pandas 

## Question 1

- Load the energy data from the file [Energy Indicators.xls](http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls).
It is a list of indicators of energy supply and renewable electricity production from the United Nations for the year 2013.


- It should be put into a DataFrame with the variable name of "energy"


- Make sure to exclude the footer and header information from the datafile.


- The first two columns are unneccessary, so you should get rid of them, and you should change the column labels so that the columns are:<br>
`['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']`


- Convert `Energy Supply` to gigajoules (there are 1,000,000 gigajoules in a petajoule).


- For all countries which have missing data (e.g. data with `...`) make sure this is reflected as `np.NaN` values.


- Rename the following list of countries (for use in later questions):
    - `Republic of Korea`: `South Korea`,
    - `United States of America`: `United States`,
    - `United Kingdom of Great Britain and Northern Ireland`: `United Kingdom`,
    - `China, Hong Kong Special Administrative Region`: `Hong Kong`


- There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these, e.g.:
    - `Bolivia (Plurinational State of)` should be `Bolivia`,
    - `Switzerland17` should be `Switzerland`.


- Next, load the GDP data from the file ["world_bank.csv"](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD). 
It is a csv containing countries' GDP from 1960 to 2015 from World Bank. Call this DataFrame "GDP"


- Make sure to skip the header, and rename the following list of countries:
    - `Korea, Rep.`: `South Korea`,
    - `Iran, Islamic Rep.`: `Iran`,
    - `Hong Kong SAR, China`: `Hong Kong`


- Finally, load the "Sciamgo Journal and Country Rank data for [Energy Engineering and Power Technology"](http://www.scimagojr.com/countryrank.php?category=2102). It ranks countries based on their journal contributions in the aforementioned area. Call this DataFrame "ScimEn"


- Join the three datasets: Energy, GDP, and ScimEn into a new dataset (using the intersection of country names). Use only the 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).


- The index of this DataFrame should be the name of the country, and the columns should be<br>
`['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', 2012', '2013', '2014', '2015']`

Function "answer_one" should return the resulted DataFrame (20 columns and 15 entries)

## Import all necessary packages

In [197]:
import pandas as pd
import numpy as np

## Modifying data

### Drop useless stuff, rename column names

In [198]:
# read Excel with all data and drop some useless data
df = pd.read_excel("Energy Indicators.xls", skiprows=16, skipfooter=38).drop(['Unnamed: 0', 'Unnamed: 1'], axis=1).drop(0)

# rename [Unnamed: 2] -> [Country], [Renewable Electricity Production] -> [% Renewable]
df = df.rename(columns = {'Unnamed: 2': 'Country', 'Renewable Electricity Production': '% Renewable'})

df.head(5)

Unnamed: 0,Country,Energy Supply,Energy Supply per capita,% Renewable
1,Afghanistan,321,10,78.66928
2,Albania,102,35,100.0
3,Algeria,1959,51,0.55101
4,American Samoa,...,...,0.641026
5,Andorra,9,121,88.69565


### Replace missing data (e.g. data with `...`) with `np.NaN` values.

In [199]:
cols = list(df.columns)[1:]

# First method
df[cols] = df[cols].apply(lambda x: [item if isinstance(item, (int, float)) else np.NaN for item in x.values])

# Second method
# mask = df[cols].applymap(lambda x: isinstance(x, (int, float)))
# df[cols] = df[cols].where(mask)

print(df.head(5))

df.dtypes

          Country  Energy Supply  Energy Supply per capita  % Renewable
1     Afghanistan          321.0                      10.0    78.669280
2         Albania          102.0                      35.0   100.000000
3         Algeria         1959.0                      51.0     0.551010
4  American Samoa            NaN                       NaN     0.641026
5         Andorra            9.0                     121.0    88.695650


Country                      object
Energy Supply               float64
Energy Supply per capita    float64
% Renewable                 float64
dtype: object

### Convert `Energy Supply` to gigajoules (there are 1,000,000 gigajoules in a petajoule).

In [200]:
# Converting
df['Energy Supply'] = df['Energy Supply']*1_000_000

df.head(5)

Unnamed: 0,Country,Energy Supply,Energy Supply per capita,% Renewable
1,Afghanistan,321000000.0,10.0,78.66928
2,Albania,102000000.0,35.0,100.0
3,Algeria,1959000000.0,51.0,0.55101
4,American Samoa,,,0.641026
5,Andorra,9000000.0,121.0,88.69565


### Rename the following list of countries (for use in later questions):
    - `Republic of Korea`: `South Korea`,
    - `United States of America`: `United States`,
    - `United Kingdom of Great Britain and Northern Ireland`: `United Kingdom`,
    - `China, Hong Kong Special Administrative Region`: `Hong Kong`

In [201]:
df = df.replace({ 'Country': {'Republic of Korea': 'South Korea',
                         'United States of America\d+': 'United States',
                         'United Kingdom of Great Britain and Northern Ireland\d+': 'United Kingdom',
                         'China, Hong Kong Special Administrative Region\d+': 'Hong Kong'}}, regex=True)

# check if replace was successful
[df.loc[df['Country'] == country] for country in ['Hong Kong', 'South Korea', 'United States', 'United Kingdom']]


    

[      Country  Energy Supply  Energy Supply per capita  % Renewable
 44  Hong Kong    585000000.0                      82.0          0.0,
          Country  Energy Supply  Energy Supply per capita  % Renewable
 165  South Korea   1.100700e+10                     221.0     2.279353,
            Country  Energy Supply  Energy Supply per capita  % Renewable
 217  United States   9.083800e+10                     286.0     11.57098,
             Country  Energy Supply  Energy Supply per capita  % Renewable
 215  United Kingdom   7.920000e+09                     124.0     10.60047]

### There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these, e.g.:
    - `Bolivia (Plurinational State of)` should be `Bolivia`,
    - `Switzerland17` should be `Switzerland`.

In [202]:
# replace:
# Switzerland17 -> Switzerland
# Bolivia (Plurinational State of) -> Bolivia
df = df.replace({ 'Country': { r'([A-z]+)\d+': r'\1', r'(\w+) \(.*\)': r'\1' } }, regex=True)

[df.loc[df['Country'] == country] for country in ['Switzerland', 'Bolivia']]

[         Country  Energy Supply  Energy Supply per capita  % Renewable
 198  Switzerland   1.113000e+09                     136.0     57.74548,
     Country  Energy Supply  Energy Supply per capita  % Renewable
 25  Bolivia    336000000.0                      32.0     31.47712]

### Next, load the GDP data from the file ["world_bank.csv"](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD).

It is a csv containing countries' GDP from 1960 to 2015 from World Bank. Call this DataFrame "GDP"

Make sure to skip the header, and rename the following list of countries:
- `Korea, Rep.`: `South Korea`,
- `Iran, Islamic Rep.`: `Iran`,
- `Hong Kong SAR, China`: `Hong Kong`

In [215]:
df_gdp = pd.read_excel('world_bank.xls', skiprows=3)

df_gdp.head(2)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,ABW,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,2549721000.0,2534637000.0,2727850000.0,2790849000.0,2962905000.0,2983637000.0,3092430000.0,3202189000.0,,
1,Africa Eastern and Southern,AFE,GDP (current US$),NY.GDP.MKTP.CD,19291930000.0,19701860000.0,21470350000.0,25705000000.0,23501650000.0,26781170000.0,...,896256100000.0,913197400000.0,927655500000.0,956318700000.0,893099700000.0,854751900000.0,962269000000.0,984032000000.0,977809200000.0,898474100000.0


In [216]:
df_gdp = df_gdp.rename(columns={ 'Country Name': 'Country' })

df_gdp = df_gdp.replace({ 'Country': { 'Korea, Rep.': 'South Korea', 'Iran, Islamic Rep.': 'Iran', 'Hong Kong SAR, China': 'Hong Kong' } })

[df_gdp.loc[df_gdp['Country'] == country] for country in ['South Korea', 'Iran', 'Hong Kong']]

KeyError: 'Country'

## Read the last dataset

In [213]:
df_sci = pd.read_excel('scimagojr country rank 1996-2020.xlsx')

df_sci

Unnamed: 0,Rank,Country,Region,Documents,Citable documents,Citations,Self-citations,Citations per document,H index
0,1,China,Asiatic Region,273437,272374,2336764,1615239,8.55,245
1,2,United States,Northern America,175891,172431,2230544,724472,12.68,363
2,3,India,Asiatic Region,55082,53775,463165,162944,8.41,181
3,4,Japan,Asiatic Region,50523,50065,488062,119930,9.66,193
4,5,United Kingdom,Western Europe,43389,42284,615670,111290,14.19,226
...,...,...,...,...,...,...,...,...,...
203,204,Comoros,Africa,1,1,0,0,0.00,0
204,205,Svalbard and Jan Mayen,Western Europe,1,1,0,0,0.00,0
205,206,Palau,Pacific Region,1,1,0,0,0.00,0
206,207,Bahamas,Latin America,1,1,0,0,0.00,0


## Answer the following questions in the context of only the top 15 countries by Scimagojr Rank (aka the DataFrame returned by `answer_one()`)

### Question 2
What is the average GDP over the last 10 years for each country? (exclude missing values from this calculation.)

*This function should return a Series named `avgGDP` with 15 countries and their average GDP sorted in descending order.*

In [1]:
def answer_two():
    Top15 = answer_one()
    return "ANSWER"

### Question 3
By how much had the GDP changed over the 10 year span for the country with the 6th largest average GDP?

*This function should return a single number.*

In [2]:
def answer_three():
    Top15 = answer_one()
    return "ANSWER"

### Question 4

Create a new column that is the ratio of Self-Citations to Total Citations. 
What is the maximum value for this new column, and what country has the highest ratio?

*This function should return a tuple with the name of the country and the ratio.*

In [3]:
def answer_four():
    Top15 = answer_one()
    return "ANSWER"


### Question 5

Create a column that estimates the population using Energy Supply and Energy Supply per capita. 
What is the third most populous country according to this estimate?

*This function should return a single string value.*

In [4]:
def answer_five():
    Top15 = answer_one()
    return "ANSWER"

### Question 6
Create a column that estimates the number of citable documents per person. 
What is the correlation between the number of citable documents per capita and the energy supply per capita? Use the `.corr()` method, (Pearson's correlation).

*This function should return a single number.*


In [5]:
def answer_six():
    Top15 = answer_one()
    return "ANSWER"

### Question 7
Use the following dictionary to group the Countries by Continent, then create a dateframe that displays the sample size (the number of countries in each continent bin), and the sum, mean, and std deviation for the estimated population of each country.

```python
ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
```

*This function should return a DataFrame with index named Continent `['Asia', 'Australia', 'Europe', 'North America', 'South America']` and columns `['size', 'sum', 'mean', 'std']`*

In [6]:
def answer_seven():
    Top15 = answer_one()
    return "ANSWER"