<img src="https://raw.githubusercontent.com/dataitgirls2/10minutes2pandas/master/_layouts/og_image_trsp.png" width="60%">

>[Pandas 10분 완성](https://dataitgirls2.github.io/10minutes2pandas/)
  > 1. Object Creation (객체 생성)
  > 2. Viewing Data (데이터 확인하기)
  > 3. Selection (선택)
  > 4. Missing Data (결측치)
  > 5. Operation (연산)
  > 6. Merge (병합)
  > 7. Grouping (그룹화)
  > 8. Reshaping (변형)
  > 9. Time Series (시계열)
  > 10. Categoricals (범주화)
  > 11. Plotting (그래프)
  > 12. Getting Data In / Out (데이터 입 / 출력)
  > 13. Gotchas (잡았다!)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Object Creation (객체 생성)
> *[데이터 구조 소개 섹션](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) 을 참조*</br>
>Pandas는 값을 가지고 있는 리스트를 통해 Series를 만들고, 정수로 만들어진 인덱스를 기본값으로 불러올 것입니다.

In [2]:
s = pd.Series([1, 3, 5,np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

>datetime 인덱스와 레이블이 있는 열을 가지고 있는 numpy 배열을 전달하여 데이터프레임을 만듭니다.

In [3]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.215793,-1.388198,-0.398252,1.477181
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858
2013-01-04,-0.404943,-1.319855,-0.155648,-2.401771
2013-01-05,0.663559,0.383396,-0.299668,0.303964
2013-01-06,-0.04713,0.190586,-0.174484,1.184288


> Series와 같은 것으로 변환될 수 있는 객체들의 dict로 구성된 데이터프레임을 만듭니다.

In [5]:
df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D' : np.array([3]  * 4, dtype='int32'),
                    'E' : pd.Categorical(['test', 'train', 'test', 'train']),
                    'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## Viewing Data (데이터 확인하기)
>[Basic Section](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html)을 참조

In [7]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.215793,-1.388198,-0.398252,1.477181
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858
2013-01-04,-0.404943,-1.319855,-0.155648,-2.401771
2013-01-05,0.663559,0.383396,-0.299668,0.303964


In [8]:
df.tail()

Unnamed: 0,A,B,C,D
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858
2013-01-04,-0.404943,-1.319855,-0.155648,-2.401771
2013-01-05,0.663559,0.383396,-0.299668,0.303964
2013-01-06,-0.04713,0.190586,-0.174484,1.184288


In [9]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [11]:
df.values

array([[-1.21579272, -1.38819787, -0.39825174,  1.47718078],
       [-0.84059446,  0.60662617,  1.06953106,  0.3205947 ],
       [ 0.8937758 ,  0.65832119,  1.24371021, -0.72985814],
       [-0.40494276, -1.31985489, -0.15564785, -2.40177109],
       [ 0.66355931,  0.38339641, -0.29966786,  0.30396394],
       [-0.04712969,  0.19058579, -0.17448384,  1.1842883 ]])

In [12]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.158521,-0.144854,0.214198,0.025733
std,0.829376,0.951621,0.73736,1.419648
min,-1.215793,-1.388198,-0.398252,-2.401771
25%,-0.731682,-0.942245,-0.268372,-0.471403
50%,-0.226036,0.286991,-0.165066,0.312279
75%,0.485887,0.550819,0.763236,0.968365
max,0.893776,0.658321,1.24371,1.477181


> 행 과 열 전치.

In [13]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.215793,-0.840594,0.893776,-0.404943,0.663559,-0.04713
B,-1.388198,0.606626,0.658321,-1.319855,0.383396,0.190586
C,-0.398252,1.069531,1.24371,-0.155648,-0.299668,-0.174484
D,1.477181,0.320595,-0.729858,-2.401771,0.303964,1.184288


> 축 별 정렬

In [14]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,1.477181,-0.398252,-1.388198,-1.215793
2013-01-02,0.320595,1.069531,0.606626,-0.840594
2013-01-03,-0.729858,1.24371,0.658321,0.893776
2013-01-04,-2.401771,-0.155648,-1.319855,-0.404943
2013-01-05,0.303964,-0.299668,0.383396,0.663559
2013-01-06,1.184288,-0.174484,0.190586,-0.04713


> 값 별 정렬

In [15]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-01,-1.215793,-1.388198,-0.398252,1.477181
2013-01-04,-0.404943,-1.319855,-0.155648,-2.401771
2013-01-06,-0.04713,0.190586,-0.174484,1.184288
2013-01-05,0.663559,0.383396,-0.299668,0.303964
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858


## Selection (선택)
> 주석 (Note) : 선택과 설정을 위한 Python / Numpy의 표준화된 표현들이 직관적이며, 코드 작성을 위한 양방향 작업에 유용하지만 우리는 Pandas에 최적화된 데이터 접근 방법인 .at, .iat, .loc 및 .iloc 을 추천</br>
> *[데이터 인덱싱 및 선택](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) 문서와 [다중 인덱싱 / 심화 인덱싱 문서](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)를 참조*

### Getting (데이터 얻기)
> df.A 와 동일한 Series를 생성하는 단일의 열 선택

In [16]:
df['A']

2013-01-01   -1.215793
2013-01-02   -0.840594
2013-01-03    0.893776
2013-01-04   -0.404943
2013-01-05    0.663559
2013-01-06   -0.047130
Freq: D, Name: A, dtype: float64

> row indexing

In [17]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.215793,-1.388198,-0.398252,1.477181
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858


In [18]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858
2013-01-04,-0.404943,-1.319855,-0.155648,-2.401771


### Selection by Label (Label 을 통한 선택) loc[]
> *[Label을 통한 선택](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) 참조* </br>
> loc indexing (dates는 위에서 row 인덱스로 지정)

In [19]:
df.loc[dates[0]]

A   -1.215793
B   -1.388198
C   -0.398252
D    1.477181
Name: 2013-01-01 00:00:00, dtype: float64

> loc 라벨을 사용하여 다중 column 가져오기

In [20]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-1.215793,-1.388198
2013-01-02,-0.840594,0.606626
2013-01-03,0.893776,0.658321
2013-01-04,-0.404943,-1.319855
2013-01-05,0.663559,0.383396
2013-01-06,-0.04713,0.190586


> row 는 인덱싱 column 은 지정하는 형식 (test. 2번째줄의 결과는 같다.)

In [21]:
df.loc['20130102' : '20130104', ['A', 'B']]
# df.loc['20130102' : '20130104', 'A' : 'B']

Unnamed: 0,A,B
2013-01-02,-0.840594,0.606626
2013-01-03,0.893776,0.658321
2013-01-04,-0.404943,-1.319855


> 반환되는 객체의 차원를 줄입니다. (type = pandas.core.series.Series)

In [22]:
df.loc['20130102',['A','B']]

A   -0.840594
B    0.606626
Name: 2013-01-02 00:00:00, dtype: float64

> 특정 행열의 스칼라 값을 가져온다.

In [23]:
df.loc[dates[0],'A']

-1.215792718505012

> 스칼라 값만 가져오는 방법 (loc와 비슷함, <u>"*at*"</u> if you only need to get or set a single value in a DataFrame
or Series.)

In [24]:
df.at[dates[0], 'A']
df.at?

### Selection by Position (위치로 선택하기) iloc[]
> *자세한 내용은 [위치로 선택하기](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)를 참고* </br>
> index로 위치 선택

In [25]:
df.iloc[3]

A   -0.404943
B   -1.319855
C   -0.155648
D   -2.401771
Name: 2013-01-04 00:00:00, dtype: float64

> numpy / python과 유사하게 slicing

In [26]:
df.iloc[3:5,0:2]

Unnamed: 0,A,B
2013-01-04,-0.404943,-1.319855
2013-01-05,0.663559,0.383396


> 리스트형태로 원하는 행과 열만 뽑니다. (numpy / python의 스타일)

In [27]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,-0.840594,1.069531
2013-01-03,0.893776,1.24371
2013-01-05,0.663559,-0.299668


> 명시적으로 행으로 나누기

In [28]:
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,-0.840594,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,-0.729858


> 명시적으로 열로 나누기

In [29]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,-1.388198,-0.398252
2013-01-02,0.606626,1.069531
2013-01-03,0.658321,1.24371
2013-01-04,-1.319855,-0.155648
2013-01-05,0.383396,-0.299668
2013-01-06,0.190586,-0.174484


> 명시적으로 특정(스칼라) 값을 얻을때

In [30]:
df.iloc[1,1]

0.6066261731899044

> 스칼라 값을 빠르게 얻는 방법입니다 (위의 방식과 동일합니다).

In [31]:
df.iat[1,1]

0.6066261731899044

### Boolean Indexing
>데이터를 선택하기 위해 특정 열의 값을 기준으로 사용

In [32]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-03,0.893776,0.658321,1.24371,-0.729858
2013-01-05,0.663559,0.383396,-0.299668,0.303964


> Boolean 조건을 충족하는 DataFrame 에서 값을 선택

In [33]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,,1.477181
2013-01-02,,0.606626,1.069531,0.320595
2013-01-03,0.893776,0.658321,1.24371,
2013-01-04,,,,
2013-01-05,0.663559,0.383396,,0.303964
2013-01-06,,0.190586,,1.184288


> isin() 필터링을 위한 메소드

In [34]:
df2 = df.copy()

In [35]:
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.215793,-1.388198,-0.398252,1.477181,one
2013-01-02,-0.840594,0.606626,1.069531,0.320595,one
2013-01-03,0.893776,0.658321,1.24371,-0.729858,two
2013-01-04,-0.404943,-1.319855,-0.155648,-2.401771,three
2013-01-05,0.663559,0.383396,-0.299668,0.303964,four
2013-01-06,-0.04713,0.190586,-0.174484,1.184288,three


In [36]:
df2[df2['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.893776,0.658321,1.24371,-0.729858,two
2013-01-05,0.663559,0.383396,-0.299668,0.303964,four


### Setting (설정)
> 새 열을 설정하면 데이터가 인덱스 별로 자동 정렬됩니다.

In [37]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [38]:
df['f'] = s1

> 라벨로 값 설정 (0 행 A 열 값을 0으로)

In [39]:
df.at[dates[0], 'A'] = 0

> 위치로 값 설정

In [40]:
df.iat[0, 1] = 0

> Numpy 배열을 사용하여 값 설정

In [41]:
df.loc[:, 'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,f
2013-01-01,0.0,0.0,-0.398252,5,
2013-01-02,-0.840594,0.606626,1.069531,5,1.0
2013-01-03,0.893776,0.658321,1.24371,5,2.0
2013-01-04,-0.404943,-1.319855,-0.155648,5,3.0
2013-01-05,0.663559,0.383396,-0.299668,5,4.0
2013-01-06,-0.04713,0.190586,-0.174484,5,5.0


> where 연산을 설정

In [42]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,f
2013-01-01,0.0,0.0,-0.398252,-5,
2013-01-02,-0.840594,-0.606626,-1.069531,-5,-1.0
2013-01-03,-0.893776,-0.658321,-1.24371,-5,-2.0
2013-01-04,-0.404943,-1.319855,-0.155648,-5,-3.0
2013-01-05,-0.663559,-0.383396,-0.299668,-5,-4.0
2013-01-06,-0.04713,-0.190586,-0.174484,-5,-5.0


## Missing Data (결측치)
> Pandas는 결측치를 표현하기 위해 주로 np.nan 값을 사용 </br>
> 이 방법은 기본 설정값이지만 계산에는 포함되지 않는다. [Missing data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)을 참조

>Reindexing으로 지정된 축 상의 인덱스를 변경 / 추가 / 삭제 할 수 있다.</br>
>Reindexing은 데이터의 복사본을 반환

In [43]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [44]:
df1.loc[dates[0]:dates[1], 'E'] = 1

In [45]:
df1

Unnamed: 0,A,B,C,D,f,E
2013-01-01,0.0,0.0,-0.398252,5,,1.0
2013-01-02,-0.840594,0.606626,1.069531,5,1.0,1.0
2013-01-03,0.893776,0.658321,1.24371,5,2.0,
2013-01-04,-0.404943,-1.319855,-0.155648,5,3.0,


> 결측치를 가지고 있는 행 삭제

In [46]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,f,E
2013-01-02,-0.840594,0.606626,1.069531,5,1.0,1.0


> 결측치 채워넣기

In [47]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,f,E
2013-01-01,0.0,0.0,-0.398252,5,5.0,1.0
2013-01-02,-0.840594,0.606626,1.069531,5,1.0,1.0
2013-01-03,0.893776,0.658321,1.24371,5,2.0,5.0
2013-01-04,-0.404943,-1.319855,-0.155648,5,3.0,5.0


> NaN 값을 boolean을 통해 표시 </br>
> <u>"*isna()*"</u> 값이 있는 데이터는 False, NAN인 값은 True 로 표시

In [48]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,f,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


## Operation (연산)
> [이진 (Binary) 연산의 기본섹션](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html) 참조

### Stats (통계)
> 일반적으로 결측치를 제외한 후 연산 </br>
> 기술통계를 수행한다. </br>

> <u>"*mean()*"</u> 기본은 columns

In [49]:
df.mean()

A    0.044111
B    0.086512
C    0.214198
D    5.000000
f    3.000000
dtype: float64

> rows 축

In [50]:
df.mean(1)

2013-01-01    1.150437
2013-01-02    1.367113
2013-01-03    1.959161
2013-01-04    1.223911
2013-01-05    1.949458
2013-01-06    1.993794
Freq: D, dtype: float64

정렬이 필요하며, 차원이 다른 객체로 연산해보겠습니다. 또한, pandas는 지정된 차원을 따라 자동으로 브로드 캐스팅됩니다.

역자 주 : broadcast란 numpy에서 유래한 용어로, n차원이나 스칼라 값으로 연산을 수행할 때 도출되는 결과의 규칙을 설명하는 것을 의미합니다.