### 기술통계 계산, 요약

#### ```mean()```, ```sum()```등의 축소 메서드의 옵션
- ```axis``` : 연산을 수행할 축, 0과 'index'는 행 / 1과 'columns'는 열
- ```skipna``` : 결측치 제외 여부
- ```level``` : 계산하려는 축이 다중 인덱스라면 레벨에 따라 묶어서 계산

In [2]:
import numpy as np
import pandas as pd

In [6]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index = ['a', 'b', 'c', 'd'],
                  columns = ['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [16]:
df.sum() # 총합
df.sum(axis = 'columns') # 행합

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

- 결측치를 제외한 평균

In [14]:
df.mean(axis = 'columns', skipna = False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

### 
### 요약 통계 메서드

| 메서드 | 설명 |
| -- | -- |
| .count | 결측치를 제외한 값의 수를 반환 |
| .describe | Series나 데이터 프레임의 각 열에 대한 요약 통계를 계산 |
| .min, .max | 최솟값과 최댓값을 계산 |
| .argmin, .argmax | 각각 최솟갑과 최댓값을 담고 있는 값의 인덱스의 위치 |
| .idxmin, .idxmax | 각각 최솟갑과 최댓값을 담고 있는 값의 인덱스 |
| .quantile | 0부터 1까지의 분위수 |
| .sum | 합 |
| .mean | 평균  |
| .median | 중간값 |
| .mad | 평균값 - 평균절대편차 |
| .prod | 모든 값의 곱 |
| .var | 표본분산 |
| .std | 표본표준편차 |
| .skew | 표본비대칭도 (왜도) |
| .kurt | 표본첨도 |
| .cumsum | 누적합 |
| .cummin, .cummax | 누적 최솟값, 누적 최댓값 |
| .cumprod | 누적곱 |
| .diff | 차분 |
| .pct_change | 퍼센트 변화율 |

In [22]:
df.idxmax()

one    b
two    d
dtype: object

In [20]:
#df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


##### 
### 상관관계와 공분산

In [25]:
import pandas_datareader.data as web

In [26]:
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [28]:
price = pd.DataFrame({ticker: data['Adj Close']
                      for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

- 주식의 퍼센트 변화율

In [36]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-07-25,-0.007398,0.002261,-0.005876,-0.001384
2022-07-26,-0.008826,-0.003579,-0.026774,-0.025598
2022-07-27,0.034235,0.00812,0.066852,0.07739
2022-07-28,0.003572,0.000775,0.028541,0.008715
2022-07-29,0.032793,0.01215,0.015665,0.01789


### ```.corr``` : 상관관계
### ```.cov``` : 공분산

In [38]:
print(returns['MSFT'].corr(returns['IBM'])) # 상관관계
print(returns['MSFT'].cov(returns['IBM'])) # 공분산

0.47651617507373006
0.00015257643620763167


#### 변수간 상관관계, 공분산

In [39]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.433089,0.756459,0.682834
IBM,0.433089,1.0,0.476516,0.443673
MSFT,0.756459,0.476516,1.0,0.78689
GOOG,0.682834,0.443673,0.78689,1.0


In [40]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000411,0.00015,0.000287,0.000259
IBM,0.00015,0.000294,0.000153,0.000142
MSFT,0.000287,0.000153,0.000349,0.000275
GOOG,0.000259,0.000142,0.000275,0.000349


### ```.corrwith()``` : 다른 Series나 데이터 프레임과의 상관관계

In [43]:
returns.corrwith(returns.IBM)

AAPL    0.433089
IBM     1.000000
MSFT    0.476516
GOOG    0.443673
dtype: float64

##### 
### 유일값, 값 세기, 멤버십

### ```.unique()``` : 중복값 제거, 유일값만 담고 있는 Series를 반환

In [48]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [49]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

### ```.value_counts()``` : 도수 계산

In [52]:
obj.value_counts()
pd.value_counts(obj.values, sort = False)

c    3
a    3
d    1
b    2
dtype: int64

### ```.isin()``` : 어떤 값이 Series에 존재하는지 boolean으로 포현

In [55]:
mask = obj.isin(['b', 'c'])
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

### ```Index.get_indexer()``` : 여러 값이 들어 있는 배열에서 유일한 값의 인덱스 배열 계산

In [56]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

In [57]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [58]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
