## 등분산 검정

- 종류
 - F-test : 2개의 집단의 등분산 검정을 실시하며, 각 집단은 정규분포를 따를 때 사용
 - Bartlett's test: 2개 집단의 등분산 검정을 실시하며, 각 집단은 정규분포를 따를 때 사용
 - Levene's test: 2개 집단 이상의 등분산 검정을 실시하며 각 집단은 정규분포를 따를 필요가 없음!

- 가설
 - 귀무 가설: 집단 간 분산은 같다
 - 대립 가설: 집단 간 분산은 다르다
 
- `scipy의 f.cdf()` : f 검정을 실시할 때 사용, F 검정통계량으로 P-value를 산출하는 함수
 - 입력인자: F-검정통계량, 첫번째 데이터의 자유도, 두번째 데이터의 자유도 필요 3가지

In [2]:
import pandas as pd
from scipy.stats import f
from scipy.stats import bartlett
from scipy.stats import levene

In [3]:
df = pd.read_csv('C:/Users/silan/Python/Data/financial_info_10k_persons.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     10000 non-null  int64  
 1   is_attrited            10000 non-null  int64  
 2   Age                    10000 non-null  int64  
 3   Gender                 10000 non-null  object 
 4   Dependent_cnt          10000 non-null  int64  
 5   Edu_level              10000 non-null  object 
 6   Marital_status         10000 non-null  object 
 7   Income                 10000 non-null  object 
 8   Card                   10000 non-null  object 
 9   Period_m               10000 non-null  int64  
 10  Total_rel_cnt          10000 non-null  int64  
 11  Inactive_last_12m      10000 non-null  int64  
 12  Contacts_cnt_last_12m  10000 non-null  int64  
 13  Credit_limit           10000 non-null  float64
 14  Total_trans_amt        10000 non-null  int64  
 15  Tot

In [6]:
df.head(3)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65
2,3,0,57,F,2,Uneducated,Single,$40K - $60K,Silver,36,6,3,0,19482.0,1421,22


In [7]:
ser_M = df.loc[df['Gender']=='M',"Period_m"]
ser_F = df.loc[df['Gender']=='F',"Period_m"]

In [8]:
F = ser_M.var() / ser_F.var()

In [9]:
F

1.040426345317289

In [10]:
result = f.cdf(F, dfd = len(ser_M) - 1, dfn = len(ser_F) - 1)
result

0.9187803061040568

In [11]:
(1 - result) * 2 # p-value

0.16243938779188638

In [12]:
bartlett(ser_F, ser_M)

BartlettResult(statistic=1.9563015878266161, pvalue=0.16190940989253869)

In [13]:
stat, p = bartlett(ser_F, ser_M) # bartlett 검정

In [14]:
stat

1.9563015878266161

In [15]:
stat, p = levene(ser_F, ser_M)
p

0.11651198398605053

In [17]:
round(stat, 4)

2.464

In [20]:
df['Avg_trans_amt'] = df['Total_trans_amt'] / df['Total_trans_cnt']

In [26]:
samp_m = df.loc[df['Gender']=='M', 'Avg_trans_amt']
samp_f = df.loc[df['Gender']=='F', 'Avg_trans_amt']

In [27]:
F = samp_m.var() / samp_f.var()

In [28]:
F

1.6665446172570928

In [31]:
df['Age_g'] = df['Age'] // 10 * 10

In [34]:
bartlett(df.loc[df['Age_g'] == 50, 'Avg_trans_amt'],
         df.loc[df['Age_g'] == 60, 'Avg_trans_amt'],
         df.loc[df['Age_g'] == 70, 'Avg_trans_amt'])

BartlettResult(statistic=10.989031521671865, pvalue=0.004109245841612487)

### 범주형 변수의 독립성 검정 (Chi-squared test)

- 독립성 검정의 특징
1) 2개의 명목형 변수를 대상으로 실시하는 분석
2) 독립 관점에서의 해석과 연관 관점에서의 해석이 존재
3) 연속형 변수의 경우 명목형 변수로 변환 후 실시

- 가설
 - 귀무 가설 (H0): 2개 변수가 서로 독립이다. (연관 X)
 - 대립 가설 (H1): 2개 변수가 서로 독립이 아니다. (연관 O)
 
- **scipy - chi2_contingency**
 - scipy의 독립성 검정을 실시하는 함수
 - 입력: 2개의 명목형 변수의 각 원소의 "빈도" <- crosstab 권장
 - 출력: 검정 통계량, P-value, 자유도, 기대도수 4개의 연산 결과가 튜플로 산출

In [2]:
import pandas as pd
from scipy.stats import chi2_contingency

In [3]:
df = pd.read_csv("C:/Users/silan/Python/Data/financial_info_10k_persons.csv")
df

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65
2,3,0,57,F,2,Uneducated,Single,$40K - $60K,Silver,36,6,3,0,19482.0,1421,22
3,4,0,57,F,2,Doctorate,Single,Less than $40K,Blue,44,2,2,3,9149.0,14401,100
4,5,0,63,F,1,Uneducated,Single,Unknown,Blue,55,4,3,1,16312.0,4366,68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,1,36,M,2,Graduate,Married,$40K - $60K,Blue,18,3,1,3,7758.0,569,23
9996,9997,0,54,M,4,Graduate,Married,$60K - $80K,Blue,36,4,3,3,6905.0,1370,25
9997,9998,0,46,M,3,Uneducated,Single,$60K - $80K,Blue,36,5,1,2,5489.0,3215,64
9998,9999,0,43,M,3,Graduate,Unknown,$40K - $60K,Blue,36,2,3,3,4878.0,5021,84


In [5]:
chi2_contingency(pd.crosstab(df['Gender'], df['Marital_status'])) # crosstab인 자료로 넣어주기

(4.093468963560284,
 0.2515464475739655,
 3,
 array([[ 392.8524, 2462.9028, 2067.3924,  392.8524],
        [ 346.1476, 2170.0972, 1821.6076,  346.1476]]))

In [7]:
stat, p, dof, e_val = chi2_contingency(pd.crosstab(df['Gender'], df['Marital_status']))
print(stat)
print(p)

4.093468963560284
0.2515464475739655


In [8]:
edu_high = df.loc[df["Edu_level"] == "High School"]
edu_high

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65
6,7,0,52,M,4,High School,Married,$80K - $120K,Blue,42,6,1,0,5738.0,1922,48
16,17,0,34,F,4,High School,Married,$40K - $60K,Blue,28,6,1,0,2432.0,4760,69
17,18,0,42,F,2,High School,Married,Less than $40K,Blue,31,3,5,2,3964.0,4580,75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9972,9973,0,36,F,3,High School,Single,Unknown,Blue,23,6,3,4,2804.0,2812,77
9981,9982,0,52,F,2,High School,Single,Less than $40K,Blue,44,6,3,3,2971.0,4458,63
9982,9983,0,41,F,2,High School,Single,$40K - $60K,Blue,36,4,5,3,1561.0,4391,67
9987,9988,0,43,F,3,High School,Married,Less than $40K,Blue,28,3,3,2,2590.0,3391,66


In [12]:
stat, p, dof, e_val = chi2_contingency(pd.crosstab(edu_high['Gender'], edu_high['is_attrited']),
                                      correction = True) # 연속성 t수정 허용하지 x correction = True
print(stat)
print(p)

3.7320501930487042
0.053377838253583124


In [13]:
stat, p, dof, e_val = chi2_contingency(pd.crosstab(df['Gender'], df['Card']),
                                      correction = True) # 연속성 t수정 허용하지 x correction = True
print(stat)
print(p)

66.45702170623164
2.4470625495771945e-14


In [14]:
df['Inactive_last_12m'].unique()

array([2, 3, 1, 6, 5, 4, 0], dtype=int64)

In [18]:
df['is_inactive_last_12m'] = (df['Inactive_last_12m'] >= 3) + 0 # 더미 변수화 해서 파생변수 만들기 

stat, p, dof, e_val = chi2_contingency(pd.crosstab(df['Inactive_last_12m'], df['is_attrited']))

# 3개월 이상일 때 1, 아닐 때 0
print(stat)
print(p)

392.26058849678714
1.2887936694268352e-81


## 시계열 분석

- 평활화(Smoothing): 시계열 데이터를 일련의 정제법을 사용하여 보다 부드럽게 만드는 과정
 - 이동 평균법: 단순이동 평균법 (Simple Moving Average), 가중이동평균법(Weighted Moving Average)
 - 지수 평활법: 단순/이중/삼중 지수 평활법이 있으며 각각 EWMA, WINTERS 등이 있음
 
- Time Series Decomposition: 시계열 분해
 - 일련의 공식을 활용하여 추세, 변동 등 세부 시계열 요소로 분리하는 과정
 - 서울시 지하철 승하차 데이터 정보 활용
 
- **pandas - rolling()**: 단순 이동평균을 수행하는 판다스 메서드
 - window에는 이동평균 대상이 되는 데이터 개수를 지정
 - 뒤에 붙이는 메서드에 따라 각 구간의 연산 결과가 달라짐 (ex. mean())
 - center = True (중심이동 평균) 가능
 
- **pandas - ewm()**: 단순 이동평균을 수행하는 판다스 메서드
 - alpha에는 지수평활계수 입력, rolling과 같이 mean 추가해줘야 함
 
- **statsmodels - seasonal_decompose()**
 - 시계열 분해를 위한 statsmodels의 함수
 - model 인자에 'multiplicative'를 입력하면 승법 모형 적용(기본은 가법모형 - 고전적인 방법)
 - 입력하는 시계열 데이터는 `pandas의 시리즈`, 인덱스:`시간데이터` 필수

In [19]:
subway_df = pd.read_csv("C:/Users/silan/Python/Data/seoul_subway.csv")
subway_df

Unnamed: 0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자
0,20191201,1호선,종각,19093,17141,20191204
1,20191201,1호선,종로3가,19646,17772,20191204
2,20191201,1호선,종로5가,13716,13149,20191204
3,20191201,1호선,동대문,11040,13079,20191204
4,20191201,1호선,신설동,8498,8322,20191204
...,...,...,...,...,...,...
216865,20201130,공항철도 1호선,검암,6292,6142,20201203
216866,20201130,공항철도 1호선,청라국제도시,5772,5315,20201203
216867,20201130,공항철도 1호선,운서,5174,5129,20201203
216868,20201130,공항철도 1호선,공항화물청사,2148,2384,20201203


In [22]:
df_sub = subway_df.loc[(subway_df["노선명"] == "1호선") & (subway_df["역명"] == "종로3가")]
df_sub.head()

Unnamed: 0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자
1,20191201,1호선,종로3가,19646,17772,20191204
593,20191202,1호선,종로3가,34277,32405,20191205
1184,20191203,1호선,종로3가,34957,32832,20191206
1776,20191204,1호선,종로3가,36007,33498,20191207
2426,20191205,1호선,종로3가,35536,33702,20191208


In [24]:
df_sub["MA_5"] = df_sub["승차총승객수"].rolling(window = 5).mean() # 5일이동평균

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["MA_5"] = df_sub["승차총승객수"].rolling(window = 5).mean()


In [27]:
df_sub["MA_5"].head(10)

1           NaN
593         NaN
1184        NaN
1776        NaN
2426    32084.6
2960    35403.8
3602    36085.8
4142    33656.2
4733    33491.8
5324    33694.8
Name: MA_5, dtype: float64

In [28]:
df_sub["EWMA_01"] = df_sub["승차총승객수"].ewm(alpha = 0.1).mean() # 5일이동평균

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["EWMA_01"] = df_sub["승차총승객수"].ewm(alpha = 0.1).mean() # 5일이동평균


In [29]:
df_sub['EWMA_01'].head()

1       19646.000000
593     27346.526316
1184    30154.819188
1776    31856.529224
2426    32755.034944
Name: EWMA_01, dtype: float64

In [30]:
from statsmodels.tsa.seasonal import seasonal_decompose

In [32]:
df = subway_df.set_index("사용일자")
df.head()

Unnamed: 0_level_0,노선명,역명,승차총승객수,하차총승객수,등록일자
사용일자,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20191201,1호선,종각,19093,17141,20191204
20191201,1호선,종로3가,19646,17772,20191204
20191201,1호선,종로5가,13716,13149,20191204
20191201,1호선,동대문,11040,13079,20191204
20191201,1호선,신설동,8498,8322,20191204


In [33]:
df_sub = df.loc[(df['노선명'] == '1호선') & (df['역명'] == '종각')]
df_sub.head()

Unnamed: 0_level_0,노선명,역명,승차총승객수,하차총승객수,등록일자
사용일자,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20191201,1호선,종각,19093,17141,20191204
20191202,1호선,종각,48153,46770,20191205
20191203,1호선,종각,49696,47715,20191206
20191204,1호선,종각,49877,48664,20191207
20191205,1호선,종각,51426,49816,20191208


In [37]:
df_sub.reset_index()

Unnamed: 0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자
0,20191201,1호선,종각,19093,17141,20191204
1,20191202,1호선,종각,48153,46770,20191205
2,20191203,1호선,종각,49696,47715,20191206
3,20191204,1호선,종각,49877,48664,20191207
4,20191205,1호선,종각,51426,49816,20191208
...,...,...,...,...,...,...
361,20201126,1호선,종각,30870,30073,20201129
362,20201127,1호선,종각,30432,29121,20201130
363,20201128,1호선,종각,12280,11753,20201201
364,20201129,1호선,종각,7749,7256,20201202


In [41]:
df_sub = df.loc[(df['노선명'] == '3호선') & (df['역명'] == '신사')]
df_sub.head()

Unnamed: 0_level_0,노선명,역명,승차총승객수,하차총승객수,등록일자
사용일자,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20191201,3호선,신사,18120,18435,20191204
20191202,3호선,신사,34918,37450,20191205
20191203,3호선,신사,36095,38205,20191206
20191204,3호선,신사,35617,38286,20191207
20191205,3호선,신사,35749,38480,20191208


In [42]:
df_sub = df_sub.reset_index()

In [43]:
df_sub['date'] = pd.to_datetime(df_sub['사용일자'], format = "%Y%m%d")
df_sub

Unnamed: 0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자,date
0,20191201,3호선,신사,18120,18435,20191204,2019-12-01
1,20191202,3호선,신사,34918,37450,20191205,2019-12-02
2,20191203,3호선,신사,36095,38205,20191206,2019-12-03
3,20191204,3호선,신사,35617,38286,20191207,2019-12-04
4,20191205,3호선,신사,35749,38480,20191208,2019-12-05
...,...,...,...,...,...,...,...
361,20201126,3호선,신사,27025,28089,20201129,2020-11-26
362,20201127,3호선,신사,27700,28677,20201130,2020-11-27
363,20201128,3호선,신사,17209,17574,20201201,2020-11-28
364,20201129,3호선,신사,8714,8898,20201202,2020-11-29


In [45]:
df_sub = df_sub.set_index(df_sub['date'])

In [46]:
df_sub

Unnamed: 0_level_0,사용일자,노선명,역명,승차총승객수,하차총승객수,등록일자,date
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-12-01,20191201,3호선,신사,18120,18435,20191204,2019-12-01
2019-12-02,20191202,3호선,신사,34918,37450,20191205,2019-12-02
2019-12-03,20191203,3호선,신사,36095,38205,20191206,2019-12-03
2019-12-04,20191204,3호선,신사,35617,38286,20191207,2019-12-04
2019-12-05,20191205,3호선,신사,35749,38480,20191208,2019-12-05
...,...,...,...,...,...,...,...
2020-11-26,20201126,3호선,신사,27025,28089,20201129,2020-11-26
2020-11-27,20201127,3호선,신사,27700,28677,20201130,2020-11-27
2020-11-28,20201128,3호선,신사,17209,17574,20201201,2020-11-28
2020-11-29,20201129,3호선,신사,8714,8898,20201202,2020-11-29


In [48]:
td = seasonal_decompose(df_sub["승차총승객수"],
                       model = "addictive",
                       extrapolate_trend = 1)

In [50]:
td.trend.tail()

date
2020-11-26    23306.571429
2020-11-27    22918.285714
2020-11-28    22202.571429
2020-11-29    21650.571429
2020-11-30    21098.571429
Name: trend, dtype: float64