# Key words
## F-test, Bartlett's-test, Levene's-test, f.cdf, bartlett, levene

## `등분산` 검정 종류
- F-test: `'두 집단'`의 등분산 검정을 실시하며 각 집단은 `'정규분포를 따를 때 사용'`
- Bartlett's test: `'두 집단 이상'`의 등분산 검정을 실시하며 각 집단은 `'정규분포를 따를 때 사용'`
 - ANOVA랑 같이 씀
- Levene's-test: `'두 집단 이상'`의 등분산 검점을 실시하며 각 집단은 `'정규분포를 따를 필요가 없음'`

### 가설
- 귀무가설 : 집단 간 `분산`은 서로 같음
- 대립가설 : 집단 간 `분산`은 서로 다름

---
## F-test: `f.cdf()`
- scipy의 f검정을 실시할 때 사용하는 함수
- F검점통계량을 입력받아 P-value를 산출하는 함수
- 입력은 F 검정통계량, 첫 번째 데이터의 자유도, 두 번째 데이터의 자유도가 필요

In [1]:
import pandas as pd
from scipy.stats import f
from scipy.stats import bartlett
from scipy.stats import levene

In [2]:
df = pd.read_csv("financial_info_10k_persons.csv")
df.head(2)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65


Period_m 으로 분산분석 해보기

In [3]:
ser_M = df.loc[df["Gender"] == "M", "Period_m"] # 여성, 남성으로 나누기
ser_F = df.loc[df["Gender"] == "F", "Period_m"]

In [4]:
F = ser_M.var() / ser_F.var() # 두 집단의 분산 비
F # 검정통계량

1.040426345317289

In [5]:
result = f.cdf(F, dfd = len(ser_M)-1, dfn = len(ser_F)) # dfd: 자유도, dfn(Degree of Freedom): 자유도
result

0.9187893064992898

In [31]:
p = (1 - result) * 2 # 검정통계량 기반으로 p-value값 구하기
p

0.1624213870014204

- f검점은 불편함 하나하나씩 다해줘야함

## Bartlett 검정 - `bartlett()`
- scipy의 Bartlett 검정을 실시할 때 사용하는 함수
- 분산을 연산하기 위한 집단을 함수에 입력

In [7]:
bartlett(ser_F, ser_M)

BartlettResult(statistic=1.9563015878266161, pvalue=0.16190940989253869)

## Levene 검정 - `levene()`
- scipy의 Levene 검정을 실시할 때 사용하는 함수
- 분산을 연산하기 위한 집단을 함수에 입력

In [8]:
levene(ser_F, ser_M)

LeveneResult(statistic=2.4640198991740747, pvalue=0.11651198398605053)

## 1. 남성과 여성의 1회 평균 송금액의 분산을 비교 검정하고 그 결과의 검정 통계량은 얼마인가?
- financial_info_10k_persons.csv 파일 사용
- F검정 사용

In [9]:
df = pd.read_csv("financial_info_10k_persons.csv")
df.head(2)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65


In [10]:
df["trans_1_mean"] = df["Total_trans_amt"] / df["Total_trans_cnt"]
df.head(2)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt,trans_1_mean
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67,62.432836
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65,63.707692


In [11]:
samp_m = df.loc[df["Gender"] == "M", "trans_1_mean"]
samp_f = df.loc[df["Gender"] == "F", "trans_1_mean"]
F = samp_m.var() / samp_f.var() # 분산 나누면 검정통계량 나옴
print(F)

1.6665446172570928


In [12]:
samp_f.var() / samp_m.var() # ? 왜 남자를 여자로 나눠줘야하는지 모르겠음

0.6000439410052308

## 2. 50, 60, 70대의 1회 평균 송금액의 분산을 비교 검정하였을 때 산출되는 p-value는 얼마인가?
- financial_info_10k_persons.csv 파일 사용
- Bartlett 검정 사용

In [13]:
df["Age"]

0       41
1       38
2       57
3       57
4       63
        ..
9995    36
9996    54
9997    46
9998    43
9999    45
Name: Age, Length: 10000, dtype: int64

In [14]:
samp_50 = df.loc[(df["Age"] >= 50) & (df["Age"] < 60), "trans_1_mean"]
samp_60 = df.loc[(df["Age"] >= 60) & (df["Age"] < 70), "trans_1_mean"]
samp_70 = df.loc[(df["Age"] >= 70) & (df["Age"] < 80), "trans_1_mean"]

In [15]:
bartlett(samp_50, samp_60, samp_70)

BartlettResult(statistic=10.989031521671865, pvalue=0.004109245841612487)

정답

In [16]:
# 파생변수를 만들어서 해결
df["Age_g"] = (df["Age"] // 10) * 10
df.head(2)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt,trans_1_mean,Age_g
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67,62.432836,40
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65,63.707692,30


In [17]:
bartlett(df.loc[df["Age_g"] == 50, "trans_1_mean"],
         df.loc[df["Age_g"] == 60, "trans_1_mean"],
         df.loc[df["Age_g"] == 70, "trans_1_mean"])

BartlettResult(statistic=10.989031521671865, pvalue=0.004109245841612487)

## 3. 부양가족이 없는 남성을 대상으로 교육수준에 따른 1회 평균 송금액의 분산을 비교 검정하였을 때 산출되는 p-value는 얼마인가?
- financial_info_10k_persons.csv 파일 사용
- Levene 검정 사용

In [18]:
df.head(2)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt,trans_1_mean,Age_g
0,1,0,41,F,2,High School,Married,Less than $40K,Blue,36,6,2,2,4953.0,4183,67,62.432836,40
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65,63.707692,30


In [19]:
df["Edu_level"].unique()

array(['High School', 'Uneducated', 'Doctorate', 'Unknown', 'Graduate',
       'Post-Graduate', 'College'], dtype=object)

In [20]:
df2 = df.loc[(df["Dependent_cnt"] == 0) & (df["Gender"] == "M"), ["trans_1_mean", "Edu_level"]]

In [21]:
df["Dependent_cnt"]

0       2
1       0
2       2
3       2
4       1
       ..
9995    2
9996    4
9997    3
9998    3
9999    3
Name: Dependent_cnt, Length: 10000, dtype: int64

In [22]:
df3 = df2.reset_index().drop("index" ,axis=1)
df3

Unnamed: 0,trans_1_mean,Edu_level
0,63.707692,High School
1,61.153846,College
2,57.931507,High School
3,38.628205,Unknown
4,43.208333,Unknown
...,...,...
408,49.103448,High School
409,49.928571,Uneducated
410,64.476923,Unknown
411,50.766667,Graduate


In [23]:
df3["Edu_level"].unique()

array(['High School', 'College', 'Unknown', 'Graduate', 'Uneducated',
       'Doctorate', 'Post-Graduate'], dtype=object)

In [24]:
levene(df3["Edu_level"] == 'High School', df3["Edu_level"] == 'College', df3["Edu_level"] == 'Unknown', 
       df3["Edu_level"] == 'Graduate', df3["Edu_level"] == 'Uneducated', df3["Edu_level"] == 'Doctorate', 
      df3["Edu_level"] == 'Post-Graduate')

LeveneResult(statistic=31.134743302162533, pvalue=1.9503299244555154e-36)

정답

In [25]:
df_sub = df.loc[(df["Dependent_cnt"] == 0) & (df["Gender"] == "M")]
df_sub.head(2)

Unnamed: 0,ID,is_attrited,Age,Gender,Dependent_cnt,Edu_level,Marital_status,Income,Card,Period_m,Total_rel_cnt,Inactive_last_12m,Contacts_cnt_last_12m,Credit_limit,Total_trans_amt,Total_trans_cnt,trans_1_mean,Age_g
1,2,0,38,M,0,High School,Single,$80K - $120K,Blue,29,3,3,2,5983.0,4141,65,63.707692,30
26,27,0,32,M,0,College,Unknown,Less than $40K,Blue,36,3,3,4,3788.0,3975,65,61.153846,30


In [26]:
len(df_sub)

413

In [27]:
df_sub["Edu_level"].unique()

array(['High School', 'College', 'Unknown', 'Graduate', 'Uneducated',
       'Doctorate', 'Post-Graduate'], dtype=object)

In [28]:
df_sub["Edu_level"].unique()[6]

'Post-Graduate'

In [29]:
levene(df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[0], "trans_1_mean"],
       df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[1], "trans_1_mean"],
       df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[2], "trans_1_mean"],
       df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[3], "trans_1_mean"],
       df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[4], "trans_1_mean"],
       df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[5], "trans_1_mean"],
       df_sub.loc[df_sub["Edu_level"] == df_sub["Edu_level"].unique()[6], "trans_1_mean"],
                 )

LeveneResult(statistic=0.8832361640792544, pvalue=0.5070685402777693)