# 분포와 기술 통계
데이터의 분포 알아 보기 
* 기술 통계 (descriptive statistics)
* 히스토그램 (histogram)
* 커널 밀도 (kernel density)

# 기술 통계
* 데이터의 숫자 (count)
* 평균 (mean, average)
* 분산 (variance)
* 표준 편차 (standard deviation)
* 최댓값 (maximum)
* 최솟값 (minimum)
* 중앙값 (median)
* 사분위수 (quartile)

In [1]:
import numpy as np
import pandas as pd

In [2]:
x = np.array([ 18,   5,  10,  23,  19,  -8,  10,   0,   0,   5,   2,  15,   8,
                2,   5,   4,  15,  -1,   4,  -7, -24,   7,   9,  -6,  23, -13,
                1,   0,  16,  15,   2,   4,  -7, -18,  -2,   2,  13,  13,  -2,
               -2,  -9, -13, -16,  20,  -4,  -3, -11,   8, -15,  -1,  -7,   4,
               -4, -10,   0,   5,   1,   4,  -5,  -2,  -5,  -2,  -7, -16,   2,
               -3, -15,   5,  -8,   1,   8,   2,  12, -11,   5,  -5,  -7,  -4])

In [3]:
len(x) #row 개수

78

In [4]:
np.mean(x) # 평균

0.69230769230769229

In [5]:
np.var(x) # 분산

96.059171597633139

In [6]:
np.std(x) # 표준편차

9.8009780939268065

In [7]:
np.max(x) # 최대값

23

In [8]:
np.min(x) # 최소값

-24

In [9]:
np.median(x) # 중앙값

0.5

In [11]:
np.percentile(x, 25) # 1사분위수

-5.75

In [12]:
np.percentile(x, 75) # 3사분위수

5.0

In [13]:
# 기술 통계값
s = pd.Series(x)
s.describe()

count    78.000000
mean      0.692308
std       9.864416
min     -24.000000
25%      -5.750000
50%       0.500000
75%       5.000000
max      23.000000
dtype: float64

# 피봇테이블과 그룹 분석
## 피봇 테이블
데이터 열(column) 중에서 두 개를 키(key)로 사용하여 데이터를 선택하는 방법<br>
주어진 데이터가 없으면 NaN 값을 넣음

* 행 인덱스 (row index)
* 열 인덱스 (column index)

**pivot** ("행인덱스 열이름", "열인덱스로 사용할 열이름", "데이터로 사용할 열이름")

In [14]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 2.5, 3.0, 2.5, 3.5]
}
df = pd.DataFrame(data, columns=["state", "year", "pop"])
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,2.5
2,Ohio,2002,3.0
3,Nevada,2001,2.5
4,Nevada,2002,3.5


In [15]:
df.pivot("state", "year", "pop")

year,2000,2001,2002
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nevada,,2.5,3.5
Ohio,1.5,2.5,3.0


set_index 명령과 unstack 명령을 사용 (유일한 값을 사용해야함)

In [18]:
df.set_index(["state", "year"]).unstack()

Unnamed: 0_level_0,pop,pop,pop
year,2000,2001,2002
state,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Nevada,,2.5,3.5
Ohio,1.5,2.5,3.0


## 그룹 분석
- 분석하고자 하는 시리즈나 데이터프레임에 groupby 메서드를 호출
- 호출한 결과에 그룹 연산을 수행

### groupby 메서드
- 열 또는 열의 리스트
- 행 인덱스

### 그룹 연산 메서드
https://pandas.pydata.org/pandas-docs/stable/api.html#groupby

- **size(), count()** : 갯수
- **mean(), median(), min(), max()** : 평균, 중앙값, 최소, 최대
- **sum(), prod(), std(), var(), quantile()** : 합계, 곱, 표준편차, 분산, 사분위수
- **first(), last()** : 가장 첫번째 데이터와 가장 나중 데이터

- **agg(), aggregate()** <br>
만약 원하는 그룹 연산이 없는 경우 함수를 만들고 이 함수를 agg()에 전달<br>
또는 여러가지 그룹 연산을 동시에 하고 싶은 경우 함수 이름 문자열의 리스트를 전달

- **transform()**
그룹 연산으로 대표값을 만든 다음, 이 대표 값을 새로운 열(column)로 원래 데이터프레임에 추가

- **describe()**
하나의 그룹 대표값이 아니라 여러개의 값을 데이터프레임으로 구함

- **apply()**
describe() 처럼 하나의 대표값이 아닌 데이터프레임을 출력하지만 원하는 그룹 연산이 없는 경우에 사용

In [19]:
np.random.seed(0)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,1.764052,-0.977278,a,one
1,0.400157,0.950088,a,two
2,0.978738,-0.151357,b,one
3,2.240893,-0.103219,b,two
4,1.867558,0.410599,a,one


In [20]:
df.groupby(df.key1).mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.343923,0.127803
b,1.609816,-0.127288


In [21]:
df.data1.groupby(df.key1).mean()

key1
a    1.343923
b    1.609816
Name: data1, dtype: float64

In [22]:
df.groupby(df.key1)["data1"].mean() # data1 만 뽑아서 평균을 구함

key1
a    1.343923
b    1.609816
Name: data1, dtype: float64

In [25]:
df.groupby(df.key1).mean()["data1"] # 각각 평균을 구한 후 data1 만 뽑음

key1
a    1.343923
b    1.609816
Name: data1, dtype: float64

In [26]:
df.data1.groupby([df.key1, df.key2]).mean() # 복합 키에 대한 평균

key1  key2
a     one     1.815805
      two     0.400157
b     one     0.978738
      two     2.240893
Name: data1, dtype: float64

In [27]:
# unstack 피봇 테이블 형태로 만듬
df.data1.groupby([df.key1, df.key2]).mean().unstack("key2")

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.815805,0.400157
b,0.978738,2.240893


## pivot_table
groupby 명령처럼 그룹 분석을 하지만 최종적으로는 pivot 명령처럼 피봇테이블을 반환<br>
groupby 명령의 결과에 unstack을 자동 적용하여 2차원적인 형태로 변형

* **pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, margins_name='All')**
    - data: 분석할 데이터프레임 (메서드일 때는 필요하지 않음)
    - values: 분석할 데이터프레임에서 분석할 열
    - index: 행 인덱스로 들어갈 키 열 또는 키 열의 리스트
    - columns: 열 인덱스로 들어갈 키 열 또는 키 열의 리스트
    - aggfunc: 분석 메서드
    - fill_value: NaN 대체 값
    - margins: 오른쪽과 아래에 합계를 붙일지 여부
    - margins_name: 합계 열(행)의 이름

In [28]:
df

Unnamed: 0,data1,data2,key1,key2
0,1.764052,-0.977278,a,one
1,0.400157,0.950088,a,two
2,0.978738,-0.151357,b,one
3,2.240893,-0.103219,b,two
4,1.867558,0.410599,a,one


In [29]:
df.pivot_table("data1", "key1", "key2")

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.815805,0.400157
b,0.978738,2.240893


In [30]:
df.pivot_table("data1", "key1", "key2", margins=True, margins_name="합계")

key2,one,two,합계
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.815805,0.400157,1.343923
b,0.978738,2.240893,1.609816
합계,1.536783,1.320525,1.45028


## 예제) TIP 데이터 
식당에서 식사 후 내는 팁(tip) 으로 보는 데이터분석 예제

In [32]:
import seaborn as sns
tips = sns.load_dataset("tips")
tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


- total_bill: 식사대금
- tip: 팁
- sex: 성별
- smoker: 흡연/금연 여부
- day: 요일
- time: 시간
- size: 인원

In [35]:
# 식사대금 별 팁 컬럼 추가 
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
239,29.03,5.92,Male,No,Sat,Dinner,3,0.203927
240,27.18,2.0,Female,Yes,Sat,Dinner,2,0.073584
241,22.67,2.0,Male,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,Male,No,Sat,Dinner,2,0.098204
243,18.78,3.0,Female,No,Thur,Dinner,2,0.159744


In [36]:
tips.describe()

Unnamed: 0,total_bill,tip,size,tip_pct
count,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,0.160803
std,8.902412,1.383638,0.9511,0.061072
min,3.07,1.0,1.0,0.035638
25%,13.3475,2.0,2.0,0.129127
50%,17.795,2.9,2.0,0.15477
75%,24.1275,3.5625,3.0,0.191475
max,50.81,10.0,6.0,0.710345


### 그룹별 통계

In [38]:
# 성별
tips.groupby("sex").count()

Unnamed: 0_level_0,total_bill,tip,smoker,day,time,size,tip_pct
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Male,157,157,157,157,157,157,157
Female,87,87,87,87,87,87,87


In [39]:
tips.groupby("sex").size()

sex
Male      157
Female     87
dtype: int64

In [40]:
tips.groupby(["sex", "smoker"]).size()

sex     smoker
Male    Yes       60
        No        97
Female  Yes       33
        No        54
dtype: int64

In [42]:
tips.pivot_table("tip_pct", "sex", "smoker", aggfunc="count", margins=True)

smoker,Yes,No,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,60.0,97.0,157.0
Female,33.0,54.0,87.0
All,93.0,151.0,244.0


In [43]:
tips.pivot_table("tip_pct", "sex", "smoker")

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,0.152771,0.160669
Female,0.18215,0.156921


In [45]:
tips.groupby(["sex", "smoker"])[["tip", "tip_pct"]].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,tip,tip,tip,tip,tip,tip,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
Male,Yes,60.0,3.051167,1.50012,1.0,2.0,3.0,3.82,10.0,60.0,0.152771,0.090588,0.035638,0.101845,0.141015,0.191697,0.710345
Male,No,97.0,3.113402,1.489559,1.25,2.0,2.74,3.71,9.0,97.0,0.160669,0.041849,0.071804,0.13181,0.157604,0.18622,0.29199
Female,Yes,33.0,2.931515,1.219916,1.0,2.0,2.88,3.5,6.5,33.0,0.18215,0.071595,0.056433,0.152439,0.173913,0.198216,0.416667
Female,No,54.0,2.773519,1.128425,1.0,2.0,2.68,3.4375,5.2,54.0,0.156921,0.036421,0.056797,0.139708,0.149691,0.18163,0.252672


In [46]:
# 함수를 적용 
def peak_to_peak(x):
    return x.max() - x.min()

tips.groupby(["sex", "smoker"])[["tip"]].agg(peak_to_peak)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip
sex,smoker,Unnamed: 2_level_1
Male,Yes,9.0
Male,No,7.75
Female,Yes,5.5
Female,No,4.2


In [49]:
# 한번에 여러 열에 함수 적용 1
tips.groupby(["sex", "smoker"]).agg(["mean", peak_to_peak])[["total_bill"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,peak_to_peak
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,Yes,22.2845,43.56
Male,No,19.791237,40.82
Female,Yes,17.977879,41.23
Female,No,18.105185,28.58


In [50]:
# 한번에 여러 열에 함수 적용 2
tips.groupby(["sex", "smoker"]).agg({'tip_pct':'mean', 'total_bill':peak_to_peak})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,Yes,0.152771,43.56
Male,No,0.160669,40.82
Female,Yes,0.18215,41.23
Female,No,0.156921,28.58


In [51]:
tips.pivot_table(index=['sex','smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_pct,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,Yes,2.5,3.051167,0.152771,22.2845
Male,No,2.71134,3.113402,0.160669,19.791237
Female,Yes,2.242424,2.931515,0.18215,17.977879
Female,No,2.592593,2.773519,0.156921,18.105185


In [52]:
# 테이블의 컬럼으로 smoker 를 사용 
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns='smoker')

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,Yes,No,Yes,No
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Male,Thur,2.3,2.5,0.164417,0.165706
Male,Fri,2.125,2.0,0.14473,0.138005
Male,Sat,2.62963,2.65625,0.139067,0.162132
Male,Sun,2.6,2.883721,0.173964,0.158291
Female,Thur,2.428571,2.48,0.163073,0.155971
Female,Fri,2.0,2.5,0.209129,0.165296
Female,Sat,2.2,2.307692,0.163817,0.147993
Female,Sun,2.5,3.071429,0.237075,0.16571


In [53]:
# na 값 채워서 출력하기 
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum', fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,day,Thur,Fri,Sat,Sun
time,sex,smoker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Lunch,Male,Yes,23,5,0,0
Lunch,Male,No,50,0,0,0
Lunch,Female,Yes,17,6,0,0
Lunch,Female,No,60,3,0,0
Dinner,Male,Yes,0,12,71,39
Dinner,Male,No,0,4,85,124
Dinner,Female,Yes,0,8,33,10
Dinner,Female,No,2,2,30,43


# Pandas 시계열 자료

시계열 자료는 DatetimeIndex : 타임스탬프 를 인덱스로 가진다. <br>
꼭 일정한 간격을 갖지 않아도됨

## DatatimeIndex

** DatetimeIndex ** 타입의 인덱스는 보통 다음 방법으로 생성한다.

- pd.to_datetime 함수
- pd.date_range 함수

### to_datetime

In [5]:
import pandas as pd
import numpy as np

data_str = ["2017, 1, 1", "2017, 1, 4", "2017, 1, 5", "2017, 1, 6"]
idx = pd.to_datetime(data_str)
idx

DatetimeIndex(['2017-01-01', '2017-01-04', '2017-01-05', '2017-01-06'], dtype='datetime64[ns]', freq=None)

In [6]:
np.random.seed(0)
s = pd.Series(np.random.randn(4), index = idx)
s

2017-01-01    1.764052
2017-01-04    0.400157
2017-01-05    0.978738
2017-01-06    2.240893
dtype: float64

### date_range
시작일과 종료일 또는 시작일과 기간을 입력하면 범위 내의 날짜 및 시간 인덱스 생성 <br>
freq 인수로 빈도 지정 가능 <br>
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects

In [7]:
pd.date_range("2017-1-2", "2017-1-31")

DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
               '2017-01-06', '2017-01-07', '2017-01-08', '2017-01-09',
               '2017-01-10', '2017-01-11', '2017-01-12', '2017-01-13',
               '2017-01-14', '2017-01-15', '2017-01-16', '2017-01-17',
               '2017-01-18', '2017-01-19', '2017-01-20', '2017-01-21',
               '2017-01-22', '2017-01-23', '2017-01-24', '2017-01-25',
               '2017-01-26', '2017-01-27', '2017-01-28', '2017-01-29',
               '2017-01-30', '2017-01-31'],
              dtype='datetime64[ns]', freq='D')

In [8]:
pd.date_range(start="2017-1-2", periods=30)

DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
               '2017-01-06', '2017-01-07', '2017-01-08', '2017-01-09',
               '2017-01-10', '2017-01-11', '2017-01-12', '2017-01-13',
               '2017-01-14', '2017-01-15', '2017-01-16', '2017-01-17',
               '2017-01-18', '2017-01-19', '2017-01-20', '2017-01-21',
               '2017-01-22', '2017-01-23', '2017-01-24', '2017-01-25',
               '2017-01-26', '2017-01-27', '2017-01-28', '2017-01-29',
               '2017-01-30', '2017-01-31'],
              dtype='datetime64[ns]', freq='D')

In [11]:
pd.date_range("2016-4-1", "2016-4-30", freq="B")

DatetimeIndex(['2016-04-01', '2016-04-04', '2016-04-05', '2016-04-06',
               '2016-04-07', '2016-04-08', '2016-04-11', '2016-04-12',
               '2016-04-13', '2016-04-14', '2016-04-15', '2016-04-18',
               '2016-04-19', '2016-04-20', '2016-04-21', '2016-04-22',
               '2016-04-25', '2016-04-26', '2016-04-27', '2016-04-28',
               '2016-04-29'],
              dtype='datetime64[ns]', freq='B')

In [12]:
pd.date_range("2017-4-1", "2017-12-31", freq="MS")

DatetimeIndex(['2016-04-01', '2016-05-01', '2016-06-01', '2016-07-01',
               '2016-08-01', '2016-09-01', '2016-10-01', '2016-11-01',
               '2016-12-01'],
              dtype='datetime64[ns]', freq='MS')

In [18]:
pd.date_range("2016-4-1", "2016-12-31", freq="BMS")

DatetimeIndex(['2016-04-01', '2016-05-02', '2016-06-01', '2016-07-01',
               '2016-08-01', '2016-09-01', '2016-10-03', '2016-11-01',
               '2016-12-01'],
              dtype='datetime64[ns]', freq='BMS')

In [19]:
pd.date_range("2016-1-1", "2016-12-31", freq="W-MON")

DatetimeIndex(['2016-01-04', '2016-01-11', '2016-01-18', '2016-01-25',
               '2016-02-01', '2016-02-08', '2016-02-15', '2016-02-22',
               '2016-02-29', '2016-03-07', '2016-03-14', '2016-03-21',
               '2016-03-28', '2016-04-04', '2016-04-11', '2016-04-18',
               '2016-04-25', '2016-05-02', '2016-05-09', '2016-05-16',
               '2016-05-23', '2016-05-30', '2016-06-06', '2016-06-13',
               '2016-06-20', '2016-06-27', '2016-07-04', '2016-07-11',
               '2016-07-18', '2016-07-25', '2016-08-01', '2016-08-08',
               '2016-08-15', '2016-08-22', '2016-08-29', '2016-09-05',
               '2016-09-12', '2016-09-19', '2016-09-26', '2016-10-03',
               '2016-10-10', '2016-10-17', '2016-10-24', '2016-10-31',
               '2016-11-07', '2016-11-14', '2016-11-21', '2016-11-28',
               '2016-12-05', '2016-12-12', '2016-12-19', '2016-12-26'],
              dtype='datetime64[ns]', freq='W-MON')

In [17]:
pd.date_range("2017-1-1", "2017-12-31", freq="Q-DEC")

DatetimeIndex(['2017-03-31', '2017-06-30', '2017-09-30', '2017-12-31'], dtype='datetime64[ns]', freq='Q-DEC')

## Shift 연산
날짜 이동

In [20]:
ts = pd.Series(np.random.randn(4), index=pd.date_range("2000-1-1", periods=4, freq="M"))
ts

2000-01-31    1.867558
2000-02-29   -0.977278
2000-03-31    0.950088
2000-04-30   -0.151357
Freq: M, dtype: float64

In [21]:
# value 값을 이동 
ts.shift(1)

2000-01-31         NaN
2000-02-29    1.867558
2000-03-31   -0.977278
2000-04-30    0.950088
Freq: M, dtype: float64

In [23]:
ts.shift(-1)

2000-01-31   -0.977278
2000-02-29    0.950088
2000-03-31   -0.151357
2000-04-30         NaN
Freq: M, dtype: float64

In [24]:
# 날짜 인덱스를 이동
ts.shift(1, freq="M")

2000-02-29    1.867558
2000-03-31   -0.977278
2000-04-30    0.950088
2000-05-31   -0.151357
Freq: M, dtype: float64

In [25]:
ts.shift(1, freq="W")

2000-02-06    1.867558
2000-03-05   -0.977278
2000-04-02    0.950088
2000-05-07   -0.151357
Freq: WOM-1SUN, dtype: float64

## 리샘플링 (Resampling)
* up-sampling : 구간이 작아지는 경우
* down-sampling: 구간이 커지는 경우

In [27]:
ts = pd.Series(np.random.randn(100), index=pd.date_range("2000-1-1", periods=100, freq="D"))
ts.tail(10)

2000-03-31    2.163236
2000-04-01    1.336528
2000-04-02   -0.369182
2000-04-03   -0.239379
2000-04-04    1.099660
2000-04-05    0.655264
2000-04-06    0.640132
2000-04-07   -1.616956
2000-04-08   -0.024326
2000-04-09   -0.738031
Freq: D, dtype: float64

In [28]:
ts.resample('W').mean()

2000-01-02    1.701728
2000-01-09    0.757735
2000-01-16    0.326132
2000-01-23    0.125678
2000-01-30    0.043661
2000-02-06    0.205133
2000-02-13    0.145183
2000-02-20   -0.490555
2000-02-27    0.235654
2000-03-05    0.150915
2000-03-12   -0.419241
2000-03-19   -0.953882
2000-03-26   -0.196795
2000-04-02    0.725372
2000-04-09   -0.031948
Freq: W-SUN, dtype: float64

In [29]:
ts.resample('M').first()

2000-01-31    1.922942
2000-02-29   -1.093062
2000-03-31    1.188030
2000-04-30    1.336528
Freq: M, dtype: float64

In [40]:
ts = pd.Series(np.random.randn(60), index=pd.date_range("2000-1-1", periods=60, freq="T"))
ts.head(10)

2000-01-01 00:00:00    1.648135
2000-01-01 00:01:00    0.164228
2000-01-01 00:02:00    0.567290
2000-01-01 00:03:00   -0.222675
2000-01-01 00:04:00   -0.353432
2000-01-01 00:05:00   -1.616474
2000-01-01 00:06:00   -0.291837
2000-01-01 00:07:00   -0.761492
2000-01-01 00:08:00    0.857924
2000-01-01 00:09:00    1.141102
Freq: T, dtype: float64

In [32]:
ts.resample('10min').sum()

2000-01-01 00:00:00   -0.906751
2000-01-01 00:10:00    2.848847
2000-01-01 00:20:00    3.808071
2000-01-01 00:30:00   -2.951444
2000-01-01 00:40:00   -0.998848
2000-01-01 00:50:00   -1.688519
Freq: 10T, dtype: float64

In [41]:
# cloased : 그 범위를 포함 할지 안할지 정함 
# right 포함 (01 분부터 ~ 10분까지, 11분 부터~20분까지 .. )
ts.resample('10min', closed="right").sum()

1999-12-31 23:50:00    1.648135
2000-01-01 00:00:00    0.951212
2000-01-01 00:10:00   -0.463708
2000-01-01 00:20:00    2.453638
2000-01-01 00:30:00   -1.141659
2000-01-01 00:40:00    0.503144
2000-01-01 00:50:00   -1.114498
Freq: 10T, dtype: float64

In [39]:
# 시가, 최고가, 최저가, 종가 
ts.resample('5min').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,-0.7047,0.943261,-1.188945,0.773253
2000-01-01 00:05:00,-1.183881,0.60632,-2.659172,0.450934
2000-01-01 00:10:00,-0.684011,1.659551,-0.687838,-0.687838
2000-01-01 00:15:00,-1.214077,0.156704,-1.214077,0.156704
2000-01-01 00:20:00,0.578521,1.364532,-1.437791,1.364532
2000-01-01 00:25:00,-0.689449,-0.477974,-1.84307,-0.477974
2000-01-01 00:30:00,-0.479656,0.931848,-0.479656,0.931848
2000-01-01 00:35:00,0.339965,0.339965,-0.39485,-0.39485
2000-01-01 00:40:00,-0.267734,0.841631,-1.128011,0.841631
2000-01-01 00:45:00,-0.249459,0.643314,-1.570623,-1.570623


In [38]:
ts.resample('30s').ffill().head(20)

2000-01-01 00:00:00   -0.704700
2000-01-01 00:00:30   -0.704700
2000-01-01 00:01:00    0.943261
2000-01-01 00:01:30    0.943261
2000-01-01 00:02:00    0.747188
2000-01-01 00:02:30    0.747188
2000-01-01 00:03:00   -1.188945
2000-01-01 00:03:30   -1.188945
2000-01-01 00:04:00    0.773253
2000-01-01 00:04:30    0.773253
2000-01-01 00:05:00   -1.183881
2000-01-01 00:05:30   -1.183881
2000-01-01 00:06:00   -2.659172
2000-01-01 00:06:30   -2.659172
2000-01-01 00:07:00    0.606320
2000-01-01 00:07:30    0.606320
2000-01-01 00:08:00   -1.755891
2000-01-01 00:08:30   -1.755891
2000-01-01 00:09:00    0.450934
2000-01-01 00:09:30    0.450934
Freq: 30S, dtype: float64