In [1]:
import numpy as np
import pandas as pd

## 기술 통계 계산과 요약

In [2]:
df = pd.DataFrame([[1.4, np.nan], [7.1,-4.5],
                  [np.nan,np.nan],[0.75,-1.3]],
                 index = ['a','b','c','d'],
                 columns = ['one','two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [3]:
# 각 컬럼의 합을 담은 Series

df.sum()

one    9.25
two   -5.80
dtype: float64

In [4]:
df.sum(axis = 'columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

전체 로우나 컬럼의 값이 NA가 아니라면 NA 값은 제외되고 계산한다.

이는 **Skipna**옵션으로 조정할 수 있다.

In [5]:
df.mean(axis = 'columns', skipna = False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

#### 축소 메소드 옵션

|옵션|설명|
|-----|-----|
|**axis**|연산을 수행할 축 DataFrame에서 0은 로우고 1은 컬럼이다.|
|**skipna**|누락된 값을 제외할 것인지 정하는 옵션. 기본값은 True|


In [6]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


**describe**는 한 번에 여러 개의 통계 결과를 만들어낸다.

### 상관관계와 공분산

**pandas-datareader** 패키지를 이용해서 주식가격과 시가총액을 담고있는 DataFrame을 구할 수 있다.

In [7]:
import pandas_datareader.data as web

all_data = {ticker : web.get_data_yahoo(ticker)
           for ticker in ['AAPL','IBM','MSFT','GOOG']}

price = pd.DataFrame({ticker : data['Adj Close']
                     for ticker, data in all_data.items()})

volume = pd.DataFrame({ticker : data['Volume']
                      for ticker, data in all_data.items()})

In [8]:
all_data

{'AAPL':                   High         Low        Open       Close      Volume  \
 Date                                                                     
 2015-02-20  129.500000  128.050003  128.619995  129.500000  48948400.0   
 2015-02-23  133.000000  129.660004  130.020004  133.000000  70974100.0   
 2015-02-24  133.600006  131.169998  132.940002  132.169998  69228100.0   
 2015-02-25  131.600006  128.149994  131.559998  128.789993  74711700.0   
 2015-02-26  130.869995  126.610001  128.789993  130.419998  91287500.0   
 2015-02-27  130.570007  128.240005  130.000000  128.460007  62014800.0   
 2015-03-02  130.279999  128.300003  129.250000  129.089996  48096700.0   
 2015-03-03  129.520004  128.089996  128.960007  129.360001  37816300.0   
 2015-03-04  129.559998  128.320007  129.100006  128.539993  31666300.0   
 2015-03-05  128.750000  125.760002  128.580002  126.410004  56517100.0   
 2015-03-06  129.369995  126.260002  128.399994  126.599998  72842100.0   
 2015-03-09  129.

In [9]:
price

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-02-20,119.173599,131.425018,39.547894,537.474365
2015-02-23,122.394508,130.830750,39.809380,530.453613
2015-02-24,121.630684,132.372681,39.755280,534.622192
2015-02-25,118.520218,130.750443,39.665104,542.380920
2015-02-26,120.020248,129.192459,39.728226,553.959106
2015-02-27,118.216537,130.051773,39.538872,556.871094
2015-03-02,118.796295,128.879242,39.565929,569.775696
2015-03-03,119.044762,129.320953,39.024906,572.069397
2015-03-04,118.290138,128.027969,38.826534,571.800110
2015-03-05,116.330009,129.441391,38.871628,573.754761


In [10]:
returns = price.pct_change()

returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-02-12,0.023748,0.011923,0.001464,0.006283
2020-02-13,-0.007121,-0.006439,-0.005414,-0.002378
2020-02-14,0.000246,-0.023394,0.008927,0.004014
2020-02-18,-0.018311,0.002654,0.010143,-0.000704
2020-02-19,0.014483,-0.001588,0.002999,0.004619


corr메소드는 NA가 아니며 정렬된 색인에서 연속하는 두 Series에 대해 상관관계 계산

cov메소드는 공분산을 계산한다.

In [11]:
returns['MSFT'].corr(returns['IBM'])

0.4678460826113295

In [12]:
returns['MSFT'].cov(returns['IBM'])

8.804963427771954e-05

좀 더 편리한 문법으로 해당 컬럼을 선택할 수도 있다.

In [13]:
returns.MSFT.corr(returns.IBM)

0.4678460826113295

DataFrame에서 corr과 cov메소드는 DataFrame 행렬에서 상관관계와 공분산을 계산한다.

In [14]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.38741,0.583704,0.532157
IBM,0.38741,1.0,0.467846,0.404681
MSFT,0.583704,0.467846,1.0,0.671147
GOOG,0.532157,0.404681,0.671147,1.0


In [15]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000242,7.9e-05,0.000131,0.000125
IBM,7.9e-05,0.00017,8.8e-05,8e-05
MSFT,0.000131,8.8e-05,0.000208,0.000146
GOOG,0.000125,8e-05,0.000146,0.000227


In [16]:
returns.corrwith(returns.IBM)

AAPL    0.387410
IBM     1.000000
MSFT    0.467846
GOOG    0.404681
dtype: float64

### 값 세기

In [17]:
obj = pd.Series(['a','b','d','c','a','c','b','a','a','c','c'])

**Value_counts**메소드는 Series에서 도수를 계산하여 반환한다.

In [18]:
obj.value_counts()

c    4
a    4
b    2
d    1
dtype: int64

In [20]:
# 내림 차순으로 정렬

pd.value_counts(obj.values, sort = False)

a    4
d    1
b    2
c    4
dtype: int64