# 1. DataFrame GroupBy
## Keypoint!
- 범주형 데이터(Categorical Data)를 바탕으로 Group을 생성해서 데이터를 정제할 수 있습니다.
- `sum()`, `mean()` 등의 집계함수를 이용해 원하는 데이터를 추출해낼 수 있습니다.

---

### **0. 데이터 준비하기**

`.read_csv()`를 이용해서 Comma Separated Value를 DataFrame으로 생성해줄 수 있습니다.

In [2]:
# 동일 경로에 avocado.csv 존재하면:

import pandas as pd

avocado = pd.read_csv("./avocado.csv")

avocado.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [3]:
# 위에서부터 5개의 데이터 관찰

avocado.head(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [4]:
# 아래에서부터 5개의 데이터 관찰

avocado.tail(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
18244,7,2018-02-04,1.63,17074.83,2046.96,1529.2,0.0,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,2018-01-28,1.71,13888.04,1191.7,3431.5,0.0,9264.84,8940.04,324.8,0.0,organic,2018,WestTexNewMexico
18246,9,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.8,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.0,0.0,organic,2018,WestTexNewMexico
18248,11,2018-01-07,1.62,17489.58,2894.77,2356.13,224.53,12014.15,11988.14,26.01,0.0,organic,2018,WestTexNewMexico


### **1. DataFrame GroupBy 적용하기**

**GroupBy란?**

- 특정 범주를 기준으로 데이터를 요약하는 방법
- `sum()` 등의 집계함수를 사용해서 이를 기준으로 데이터를 요약할 수 있습니다!

In [8]:
# type에 대해서 GroupBy를 적용해보자.

avo_type = avocado.groupby('type')
avo_type

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12d44f4c0>

In [12]:
# 특정 Group의 원소 가져오기 : 해당 변수(type)의 범주 중 하나 택(organic)
avo_type.get_group('organic')

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
9126,0,2015-12-27,1.83,989.55,8.16,88.59,0.00,892.80,892.80,0.00,0.0,organic,2015,Albany
9127,1,2015-12-20,1.89,1163.03,30.24,172.14,0.00,960.65,960.65,0.00,0.0,organic,2015,Albany
9128,2,2015-12-13,1.85,995.96,10.44,178.70,0.00,806.82,806.82,0.00,0.0,organic,2015,Albany
9129,3,2015-12-06,1.84,1158.42,90.29,104.18,0.00,963.95,948.52,15.43,0.0,organic,2015,Albany
9130,4,2015-11-29,1.94,831.69,0.00,94.73,0.00,736.96,736.96,0.00,0.0,organic,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,7,2018-02-04,1.63,17074.83,2046.96,1529.20,0.00,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,2018-01-28,1.71,13888.04,1191.70,3431.50,0.00,9264.84,8940.04,324.80,0.0,organic,2018,WestTexNewMexico
18246,9,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.80,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.00,0.0,organic,2018,WestTexNewMexico


DataFrameGroupBy 객체가 생겼습니다.

이제 여기에 **함수를 적용(apply)** 해서 원하는 데이터를 가져올 수 있습니다.

In [13]:
# 각 type별로 얼마나 데이터가 포함되어있는지 알아봅시다.
avo_type['AveragePrice'].count()

type
conventional    9126
organic         9123
Name: AveragePrice, dtype: int64

In [14]:
# 각 type별 AveragePrice(평균가격)의 평균을 알아봅시다.
avo_type['AveragePrice'].mean()

type
conventional    1.158040
organic         1.653999
Name: AveragePrice, dtype: float64

결국 **GroupBy**를 활용하면, 특정 Column과 연관된 정보를 쉽게 파악할 수 있다는 장점이 있겠군요!

#### cf) 자주 사용되는 집계함수
- `count()` : **개수**를 집계
- `sum()` : **합**을 집계
- `mean()` : **평균**을 집계

### **2. DataFrame GroupBy '더' 적용하기**

- 2개 이상의 기준으로 GroupBy를 진행할 수도 있습니다.

In [24]:
# type에 대해서 GroupBy를 적용해보자.
avo_type_year = avocado.groupby(['year','type']) # 멀티인덱스로 groupby
avo_type_year

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12d48a9e0>

In [22]:
# Q. 각 Group의 개수는?

avo_type_year['AveragePrice'].count()

year  type        
2015  conventional    2808
      organic         2807
2016  conventional    2808
      organic         2808
2017  conventional    2862
      organic         2860
2018  conventional     648
      organic          648
Name: AveragePrice, dtype: int64

In [27]:
# Q. 각 Group의 Total Volume의 합은?
# cf) e+09 = (...) * 10^9

avo_type_year['Total Volume'].sum()

year  type        
2015  conventional    4.296599e+09
      organic         8.886943e+07
2016  conventional    4.690250e+09
      organic         1.306401e+08
2017  conventional    4.766166e+09
      organic         1.681399e+08
2018  conventional    1.334206e+09
      organic         4.853227e+07
Name: Total Volume, dtype: float64

In [28]:
# Q. 각 Group의 AveragePrice의 평균은?

avo_type_year['AveragePrice'].mean()


year  type        
2015  conventional    1.077963
      organic         1.673324
2016  conventional    1.105595
      organic         1.571684
2017  conventional    1.294888
      organic         1.735521
2018  conventional    1.127886
      organic         1.567176
Name: AveragePrice, dtype: float64