## Additional material

### Using `size()` to summarize categorical data 

When working with data,
we commonly want to know the number of observations present for each categorical variable.
For this,
pandas provides the `size()` method.
For example,
to find the number of observations per region
(in this case unique countries during year 2018):

In [65]:
world_data_2018.groupby('region').size()

region
Africa      52
Americas    31
Asia        47
Europe      39
Oceania     10
dtype: int64

`size()` can also be used when grouping on multiple variables.

In [66]:
world_data_2018.groupby(['region', 'income_group']).size()

region    income_group
Africa    Low             27
          Lower middle    16
          Upper middle     8
          High             1
Americas  Low              1
          Lower middle     4
          Upper middle    16
          High            10
Asia      Low              6
          Lower middle    17
          Upper middle    13
          High            11
Europe    Low              0
          Lower middle     2
          Upper middle     9
          High            28
Oceania   Low              0
          Lower middle     4
          Upper middle     3
          High             3
dtype: int64

If there are many groups,
`size()` is not that useful on its own.
For example,
it is difficult to quickly find the five most abundant species among the observations.

In [67]:
world_data_2018.groupby('sub_region').size()

sub_region
Australia and New Zealand           2
Central Asia                        5
Eastern Asia                        5
Eastern Europe                     10
Latin America and the Caribbean    29
Melanesia                           4
Micronesia                          2
Northern Africa                     6
Northern America                    2
Northern Europe                    10
Polynesia                           2
South-eastern Asia                 10
Southern Asia                       9
Southern Europe                    12
Sub-Saharan Africa                 46
Western Asia                       18
Western Europe                      7
dtype: int64

Since there are many rows in this output,
it would be beneficial to sort the table values and display the most abundant species first.
This is easy to do with the `sort_values()` method.

In [68]:
world_data_2018.groupby('sub_region').size().sort_values()

sub_region
Australia and New Zealand           2
Polynesia                           2
Micronesia                          2
Northern America                    2
Melanesia                           4
Eastern Asia                        5
Central Asia                        5
Northern Africa                     6
Western Europe                      7
Southern Asia                       9
Northern Europe                    10
South-eastern Asia                 10
Eastern Europe                     10
Southern Europe                    12
Western Asia                       18
Latin America and the Caribbean    29
Sub-Saharan Africa                 46
dtype: int64

That's better,
but it could be helpful to display the most abundant species on top.
In other words,
the output should be arranged in descending order.

In [69]:
world_data_2018.groupby('sub_region').size().sort_values(ascending=False).head(5)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

A shortcut for sorting and returning the top values is to use `nlargest()`.

In [71]:
world_data_2018.groupby('sub_region').size().nlargest(5)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

Looks good!

### Method chaining

By now,
the code statement has grown quite long because many methods have been *chained* together.
It can be tricky to keep track of what is going on in long method chains.
To make the code more readable,
it can be broken up multiple lines by adding a surrounding parenthesis.

In [72]:
(world_data_2018
     .groupby('sub_region')
     .size()
     .sort_values(ascending=False)
     .head(5)
)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

This looks neater and makes long method chains easier to reads.
There is no absolute rule for when to break code into multiple line,
but always try to write code that is easy for collaborators to understand.
Remember that your most common collaborator is a future version of yourself!

pandas has a convenience function for returning the top five results,
so the values don't need to be sorted explicitly.

In [73]:
(world_data_2018
     .groupby(['sub_region'])
     .size()
     .nlargest()  # the default is 5
)

sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

To include more attributes about these countries,
add those columns to `groupby()`.

In [74]:
(world_data_2018
     .groupby(['region', 'sub_region'])
     .size()
     .nlargest()  # the default is 5
)

region    sub_region                     
Africa    Sub-Saharan Africa                 46
Americas  Latin America and the Caribbean    29
Asia      Western Asia                       18
Europe    Southern Europe                    12
Asia      South-eastern Asia                 10
dtype: int64

In [75]:
world_data.head()

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_emissions,years_in_school_men,years_in_school_women,population_income
0,Afghanistan,1800,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
1,Afghanistan,1801,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
2,Afghanistan,1802,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
3,Afghanistan,1803,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000
4,Afghanistan,1804,3280000,Asia,Southern Asia,Low,28.2,603,7.0,469.0,,,,,1977840000


>#### Challenge
>
> 1. How many countries are there in each income group worldwide?
> 2. Assign the variable name `world_data_2015` to a dataframe containing only the values from year 2015
>    (e.g. the same way as `world_data_2018` was created)
> 3.
>    a. For those countries where women went to school longer than men,
>       how many are there in each income group.
>    b. Do the same as above but for countries where men went to school longer than women.
>       What does this distribution tell you?

### Data cleaning tips

`dropna()` removes both explicit `NaN` values
and value that pandas assumed to be `NaN`,
such as the non-numeric values in the life_expectancy column.


In [80]:
world_data_2018.dropna()

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_emissions,years_in_school_men,years_in_school_women,population_income


Instead of dropping observations that has `NaN` values in a any column,
a subset of columns can be considered.

In [83]:
world_data_2018.dropna(subset=['life_expectancy'])

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_emissions,years_in_school_men,years_in_school_women,population_income
218,Afghanistan,2018,36400000,Asia,Southern Asia,Low,58.7,1870,4.33,65.90,55.7,,,,68068000000
437,Albania,2018,2930000,Europe,Southern Europe,Upper middle,78.0,12400,1.71,12.90,107.0,,,,36332000000
656,Algeria,2018,42000000,Africa,Northern Africa,Upper middle,77.9,13700,2.64,23.10,17.6,,,,575400000000
875,Angola,2018,30800000,Africa,Sub-Saharan Africa,Lower middle,65.2,5850,5.55,81.60,24.7,,,,180180000000
1094,Antigua and Barbuda,2018,103000,Americas,Latin America and the Caribbean,High,77.6,21000,2.03,7.89,234.0,,,,2163000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38324,Venezuela,2018,32400000,Americas,Latin America and the Caribbean,Upper middle,75.9,14200,2.27,15.40,36.7,,,,460080000000
38543,Vietnam,2018,96500000,Asia,South-eastern Asia,Lower middle,74.9,6550,1.95,20.20,311.0,,,,632075000000
38762,Yemen,2018,28900000,Asia,Western Asia,Low,67.1,2430,3.79,51.90,54.8,,,,70227000000
38981,Zambia,2018,17600000,Africa,Sub-Saharan Africa,Lower middle,59.5,3870,4.87,59.50,23.7,,,,68112000000


Non-numeric values can also be coerced into explicit `NaN` values
via the `to_numeric()` top level function.

In [84]:
pd.to_numeric(world_data_2018['life_expectancy'], errors='coerce')

218      58.7
437      78.0
656      77.9
875      65.2
1094     77.6
         ... 
38324    75.9
38543    74.9
38762    67.1
38981    59.5
39200    60.2
Name: life_expectancy, Length: 179, dtype: float64