# Aggregation ðŸ“š

### After this encounter we will have covered 
- how to apply different aggregation methods to your datasets
- explanation of what .groupby() does and some options on how apply it to your datasets

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## 1. Applying aggregation methods: 

In [2]:
df = pd.read_csv("large_countries_2015.csv", sep = ",")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     12 non-null     object 
 1   population  12 non-null     float64
 2   fertility   12 non-null     float64
 3   continent   12 non-null     object 
dtypes: float64(2), object(2)
memory usage: 512.0+ bytes


In [3]:
df.head(2)

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995642.0,2.12,Asia
1,Brazil,207847528.0,1.78,South America


In [4]:
df["population"] = (df["population"]/1000).astype(int)

In [5]:
df.head(2)

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995,2.12,Asia
1,Brazil,207847,1.78,South America


Let's apply some aggregation methods!
Intuition of "aggregation": take some rows, apply some kind of "operation" on them and return a resumed version of these rows.

In [6]:
df.sum()

country       BangladeshBrazilChinaIndiaIndonesiaJapanMexico...
population                                              4504146
fertility                                                 29.25
continent     AsiaSouth AmericaAsiaAsiaAsiaAsiaNorth America...
dtype: object

If we apply .sum() to the complete dataframe, strings will be concatenated. 

In [7]:
df["population"].sum()

4504146

In [8]:
df[["population", "fertility"]].sum()

population    4504146.00
fertility          29.25
dtype: float64

In [9]:
df["country"].count()

12

In [10]:
df["country"].value_counts()

United States    1
Mexico           1
Bangladesh       1
Philippines      1
China            1
India            1
Japan            1
Nigeria          1
Brazil           1
Indonesia        1
Russia           1
Pakistan         1
Name: country, dtype: int64

In [12]:
df["fertility"].nunique()

12

In [13]:
df["fertility"].unique()

array([2.12, 1.78, 1.57, 2.43, 2.28, 1.45, 2.13, 5.89, 3.04, 2.98, 1.61,
       1.97])

In [18]:
df.describe()

Unnamed: 0,population,fertility
count,12.0,12.0
mean,375345.5,2.4375
std,456519.3,1.200781
min,100699.0,1.45
25%,139346.2,1.7375
50%,185562.5,2.125
75%,273615.5,2.5675
max,1376048.0,5.89


.agg() can be used to aggregate more "modularly":

In [15]:
df.agg(
    {"population":"mean",
    "fertility":"median"
    }
)

population    375345.500
fertility          2.125
dtype: float64

In [16]:
df.agg(
    ["median","mean","std"]
)

Unnamed: 0,population,fertility
median,185562.5,2.125
mean,375345.5,2.4375
std,456519.344768,1.200781


In [19]:
def double(x):
    return 2*x

In [25]:
d = df.agg(
    {"population":"double",
    "fertility":"double"
    }
)

In [26]:
d

population    [160995.0, 207847.0, 1376048.0, 1311050.0, 257...
fertility     [2.12, 1.78, 1.57, 2.43, 2.28, 1.45, 2.13, 5.8...
dtype: object

In [21]:
df[["population", "fertility"]]

Unnamed: 0,population,fertility
0,160995,2.12
1,207847,1.78
2,1376048,1.57
3,1311050,2.43
4,257563,2.28
5,126573,1.45
6,127017,2.13
7,182201,5.89
8,188924,3.04
9,100699,2.98


In [22]:
df[["population", "fertility"]].agg(
    ["median","mean","double"]
)

Unnamed: 0,population,fertility
median,185562.5,2.125
mean,375345.5,2.4375
double,"[160995.0, 207847.0, 1376048.0, 1311050.0, 257...","[2.12, 1.78, 1.57, 2.43, 2.28, 1.45, 2.13, 5.8..."


## 2. .groupby()

What DOES .groupby() actually do?
1. it **splits** the data
2. it **applies** some kind of operation ON THE GROUPED data
3. it **combines** the data back into a new (pandas) object (i.e. series or dataframe)

In [28]:
g = df.groupby("continent")
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9d66307ca0>

In [29]:
for index, elements in g:
    print(index)
    print(elements)
    print("\n")

Africa
   country  population  fertility continent
7  Nigeria      182201       5.89    Africa


Asia
       country  population  fertility continent
0   Bangladesh      160995       2.12      Asia
2        China     1376048       1.57      Asia
3        India     1311050       2.43      Asia
4    Indonesia      257563       2.28      Asia
5        Japan      126573       1.45      Asia
8     Pakistan      188924       3.04      Asia
9  Philippines      100699       2.98      Asia


Europe
   country  population  fertility continent
10  Russia      143456       1.61    Europe


North America
          country  population  fertility      continent
6          Mexico      127017       2.13  North America
11  United States      321773       1.97  North America


South America
  country  population  fertility      continent
1  Brazil      207847       1.78  South America




In [30]:
g.get_group("Asia")

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995,2.12,Asia
2,China,1376048,1.57,Asia
3,India,1311050,2.43,Asia
4,Indonesia,257563,2.28,Asia
5,Japan,126573,1.45,Asia
8,Pakistan,188924,3.04,Asia
9,Philippines,100699,2.98,Asia


In [34]:
g2 = df.groupby("continent")["population", "fertility"].mean()
g2

Unnamed: 0_level_0,population,fertility
continent,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,182201.0,5.89
Asia,503121.714286,2.267143
Europe,143456.0,1.61
North America,224395.0,2.05
South America,207847.0,1.78


In [35]:
g3 = df.groupby(["continent", "country"])["population", "fertility"].mean()
g3

Unnamed: 0_level_0,Unnamed: 1_level_0,population,fertility
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,Nigeria,182201,5.89
Asia,Bangladesh,160995,2.12
Asia,China,1376048,1.57
Asia,India,1311050,2.43
Asia,Indonesia,257563,2.28
Asia,Japan,126573,1.45
Asia,Pakistan,188924,3.04
Asia,Philippines,100699,2.98
Europe,Russia,143456,1.61
North America,Mexico,127017,2.13


In [38]:
g4 = df.groupby(["continent", "country"])
g4


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9d6631d460>

In [39]:
g4.get_group(["Asia","Bangladesh"])

ValueError: must supply a tuple to get_group with multiple grouping keys

In [37]:
for index, elements in g4:
    print(index)
    print(elements)
    print("\n")

('Africa', 'Nigeria')
   country  population  fertility continent
7  Nigeria      182201       5.89    Africa


('Asia', 'Bangladesh')
      country  population  fertility continent
0  Bangladesh      160995       2.12      Asia


('Asia', 'China')
  country  population  fertility continent
2   China     1376048       1.57      Asia


('Asia', 'India')
  country  population  fertility continent
3   India     1311050       2.43      Asia


('Asia', 'Indonesia')
     country  population  fertility continent
4  Indonesia      257563       2.28      Asia


('Asia', 'Japan')
  country  population  fertility continent
5   Japan      126573       1.45      Asia


('Asia', 'Pakistan')
    country  population  fertility continent
8  Pakistan      188924       3.04      Asia


('Asia', 'Philippines')
       country  population  fertility continent
9  Philippines      100699       2.98      Asia


('Europe', 'Russia')
   country  population  fertility continent
10  Russia      143456       1.61  

After grouping, we can now "mix" between aggregations and transformations, combining .groupby() with .agg() and our customized function from above.

Applying transformations to selected cols:

Plotting examples:

## Comments and questions during the encounter