# Aggregation 📚

### After this encounter we will have covered 
- how to apply different aggregation methods to your datasets
- explanation of what .groupby() does and some options on how apply it to your datasets

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## 1. Applying aggregation methods: 

In [2]:
df = pd.read_csv("large_countries_2015.csv", sep = ",")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     12 non-null     object 
 1   population  12 non-null     float64
 2   fertility   12 non-null     float64
 3   continent   12 non-null     object 
dtypes: float64(2), object(2)
memory usage: 512.0+ bytes


In [3]:
df["population"] = (df["population"]/1000).astype(int)

In [4]:
df.head(2)

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995,2.12,Asia
1,Brazil,207847,1.78,South America


Let's apply some aggregation methods!
Intuition of "aggregation": take some rows, apply some kind of "operation" on them and return a resumed version of these rows.

In [5]:
df.sum()

country       BangladeshBrazilChinaIndiaIndonesiaJapanMexico...
population                                              4504146
fertility                                                 29.25
continent     AsiaSouth AmericaAsiaAsiaAsiaAsiaNorth America...
dtype: object

If we apply .sum() to the complete dataframe, strings will be concatenated. 

In [6]:
df["population"].sum()

4504146

In [7]:
df[["population", "fertility"]].sum()

population    4504146.00
fertility          29.25
dtype: float64

In [8]:
df["country"].count()

12

In [9]:
df["country"].value_counts()

Indonesia        1
United States    1
Nigeria          1
India            1
Pakistan         1
Russia           1
China            1
Philippines      1
Mexico           1
Brazil           1
Bangladesh       1
Japan            1
Name: country, dtype: int64

In [10]:
df["country"].unique()

array(['Bangladesh', 'Brazil', 'China', 'India', 'Indonesia', 'Japan',
       'Mexico', 'Nigeria', 'Pakistan', 'Philippines', 'Russia',
       'United States'], dtype=object)

In [11]:
df.describe()

Unnamed: 0,population,fertility
count,12.0,12.0
mean,375345.5,2.4375
std,456519.3,1.200781
min,100699.0,1.45
25%,139346.2,1.7375
50%,185562.5,2.125
75%,273615.5,2.5675
max,1376048.0,5.89


.agg() can be used to aggregate more "modularly":

In [12]:
df.agg(
    {"population":"mean",
    "fertility":"median"
    }
)

population    375345.500
fertility          2.125
dtype: float64

In [13]:
df.agg(
    ["median","mean","std"]
)

Unnamed: 0,population,fertility
median,185562.5,2.125
mean,375345.5,2.4375
std,456519.344768,1.200781


In [23]:
def double(x):
    return 2*x

In [24]:
df.agg(
    {"population":"mean",
    "fertility":"double"
    }
)

Unnamed: 0,population,fertility
0,375345.5,2.12
1,375345.5,1.78
2,375345.5,1.57
3,375345.5,2.43
4,375345.5,2.28
5,375345.5,1.45
6,375345.5,2.13
7,375345.5,5.89
8,375345.5,3.04
9,375345.5,2.98


In [25]:
df[["population", "fertility"]].agg(
    ["median","mean","double"]
)

Unnamed: 0,population,fertility
median,185562,2.125
mean,375346,2.4375
double,"[160995.0, 207847.0, 1376048.0, 1311050.0, 257...","[2.12, 1.78, 1.57, 2.43, 2.28, 1.45, 2.13, 5.8..."


## 2. .groupby()

What DOES .groupby() actually do?
1. it **splits** the data
2. it **applies** some kind of operation ON THE GROUPED data
3. it **combines** the data back into a new (pandas) object (i.e. series or dataframe)

In [32]:
g = df.groupby('continent')

In [33]:
for index, elements in g: 
    print(index)
    print(elements)
    print('/n')
      # i want to open the group by and see inside

Africa
   country  population  fertility continent
7  Nigeria      182201       5.89    Africa
/n
Asia
       country  population  fertility continent
0   Bangladesh      160995       2.12      Asia
2        China     1376048       1.57      Asia
3        India     1311050       2.43      Asia
4    Indonesia      257563       2.28      Asia
5        Japan      126573       1.45      Asia
8     Pakistan      188924       3.04      Asia
9  Philippines      100699       2.98      Asia
/n
Europe
   country  population  fertility continent
10  Russia      143456       1.61    Europe
/n
North America
          country  population  fertility      continent
6          Mexico      127017       2.13  North America
11  United States      321773       1.97  North America
/n
South America
  country  population  fertility      continent
1  Brazil      207847       1.78  South America
/n


In [34]:
g.get_group('Asia')

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995,2.12,Asia
2,China,1376048,1.57,Asia
3,India,1311050,2.43,Asia
4,Indonesia,257563,2.28,Asia
5,Japan,126573,1.45,Asia
8,Pakistan,188924,3.04,Asia
9,Philippines,100699,2.98,Asia


In [35]:
g2 = df.groupby('continent')['population'].mean()
g2

continent
Africa           182201.000000
Asia             503121.714286
Europe           143456.000000
North America    224395.000000
South America    207847.000000
Name: population, dtype: float64

In [37]:
g2 = df.groupby('continent')['population','fertility'].mean()
g2

Unnamed: 0_level_0,population,fertility
continent,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,182201.0,5.89
Asia,503121.714286,2.267143
Europe,143456.0,1.61
North America,224395.0,2.05
South America,207847.0,1.78


In [39]:
g2 = df.groupby(['continent','country'])['population','fertility'].mean()
g2

Unnamed: 0_level_0,Unnamed: 1_level_0,population,fertility
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,Nigeria,182201,5.89
Asia,Bangladesh,160995,2.12
Asia,China,1376048,1.57
Asia,India,1311050,2.43
Asia,Indonesia,257563,2.28
Asia,Japan,126573,1.45
Asia,Pakistan,188924,3.04
Asia,Philippines,100699,2.98
Europe,Russia,143456,1.61
North America,Mexico,127017,2.13


In [41]:
g3 = df.groupby(['continent','country'])
g3

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa6f8ed0950>

In [42]:
for index, elements in g3: 
    print(index)
    print(elements)
    print('/n')

('Africa', 'Nigeria')
   country  population  fertility continent
7  Nigeria      182201       5.89    Africa
/n
('Asia', 'Bangladesh')
      country  population  fertility continent
0  Bangladesh      160995       2.12      Asia
/n
('Asia', 'China')
  country  population  fertility continent
2   China     1376048       1.57      Asia
/n
('Asia', 'India')
  country  population  fertility continent
3   India     1311050       2.43      Asia
/n
('Asia', 'Indonesia')
     country  population  fertility continent
4  Indonesia      257563       2.28      Asia
/n
('Asia', 'Japan')
  country  population  fertility continent
5   Japan      126573       1.45      Asia
/n
('Asia', 'Pakistan')
    country  population  fertility continent
8  Pakistan      188924       3.04      Asia
/n
('Asia', 'Philippines')
       country  population  fertility continent
9  Philippines      100699       2.98      Asia
/n
('Europe', 'Russia')
   country  population  fertility continent
10  Russia      143456     

After grouping, we can now "mix" between aggregations and transformations, combining .groupby() with .agg() and our customized function from above.

Applying transformations to selected cols:

Plotting examples:

## Comments and questions during the encounter