# 가설검정: t-test
## 01 t-test 개요

### 단일 표본 t-검정(One Sample t-test)
- 단일 모집단에서 추출된 하나의 표본이 대상
- 모평균과 표본 평균의 차이를 검정

### 대응 표본 t-검정(Paired sample t-test)
- 동일한 모집단으로부터 추출된 두 표본 집단을 대상 (의학통계에서 자주 사용, 약물 복용 전후 비교 등)
- 표본이 정규성을 만족하지 못하는 경우 Wilcoxon rank sum test 사용

### 독립 2 표본 t-검정(Independent 2 Sample t-test)
- 독립된 두 표본집단을 대상
- 등분산 여부에 따라 검정통계량 계산식이 다름
- 표본이 정규성을 만족하지 못하는 경우 Wilconxon rank sum test 사용

### 가설
- 귀무가설(H0): 두 집단 간 평균이 같다.
- 대립가설(H1): 두 집단 간 평균이 같지 않다.

## 02 주요 함수 및 메서드 소개
### scipy - ttest_1samp()
- 단일 표본 t검정을 실시할 때 사용하는 함수
- 모집단의 평균은 popmean()인자에 지정

In [1]:
import pandas as pd

In [2]:
from scipy.stats import ttest_1samp

In [5]:
df = pd.read_csv("강의자료/실습파일/iris.csv")
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [6]:
ttest_1samp(df["Sepal.Length"], popmean = 4) # 검정통계량, p값

Ttest_1sampResult(statistic=27.263680640799215, pvalue=8.764592435410748e-60)

In [8]:
result = ttest_1samp(df["Sepal.Length"], popmean = 4)
result[0].round(3)

27.264

In [10]:
stat, p = ttest_1samp(df["Sepal.Length"], popmean = 4)
print(stat.round(3))
print(p.round(3))

27.264
0.0


In [11]:
df["Sepal.Length"].mean()

5.843333333333335

In [15]:
stat, p = ttest_1samp(df["Sepal.Length"], popmean = 5.8433) # 평균에 가까워지면 검정통계량은 0, p값은 1에 수렴함
print(stat.round(3))
print(p.round(3))

0.0
1.0


### scipy - ttest_rel()
- 대응 표본 t검정을 실시할 때 사용하는 함수
- 검정에 실시하는 두 변수를 차례대로 지정

In [17]:
from scipy.stats import ttest_rel

In [18]:
stat, p = ttest_rel(df["Sepal.Length"], df["Sepal.Width"])
print(stat.round(3))
print(p.round(3))

34.815
0.0


### scipy - ttest_ind()
- 독립 2 표본 t검정을 실시할 때 사용하는 함수
- 검정에 실시하는 두 변수를 차례대로 지정
- 등분산 가정을 만족하는 경우, equal_var 인자에 True를 할당

In [19]:
from scipy.stats import ttest_ind

In [20]:
df["Species"].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [21]:
stat, p = ttest_ind(df.loc[df["Species"] == "setosa", "Petal.Length"],
                   df.loc[df["Species"] == "versicolor", "Petal.Length"])
print(stat.round(3))
print(p.round(3))

-39.493
0.0


## Q1 자료가 수집된 지역의 평균온도는 20도라고 한다. 수집된 데이터를 사용하여 양측 검정을 실시했을 때 p-value는 얼마인가?

In [22]:
Q1 = pd.read_csv("강의자료/실습파일/bike.csv")
Q1.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [27]:
from scipy.stats import ttest_1samp
ttest_1samp(Q1["temp"], popmean = 20)[1].round(3)

0.002

## Q2 2011년 1월의 데이터를 대상으로 동 시간대의 casual과 registered의 평균 차이 검정 시 검정통계량은?

In [39]:
Q2 = pd.read_csv("강의자료/실습파일/bike.csv")
Q2

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


In [40]:
Q2["datetime"] = pd.to_datetime(Q2["datetime"])
Q2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB


In [41]:
Q2["year"] = Q2["datetime"].dt.year
Q2["month"] = Q2["datetime"].dt.month
Q2.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,2011,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,2011,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2011,1
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,2011,1
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,2011,1


In [45]:
from scipy.stats import ttest_rel
stat, p = ttest_rel(Q2.loc[(Q2["year"] == 2011) & (Q2["month"] == 1), "casual"],
                   Q2.loc[(Q2["year"] == 2011) & (Q2["month"] == 1), "registered"])
abs(stat.round(0))

21.0

## Q3 주중과 주말의 registered 평균 검정 시 검정통계량은?

In [46]:
Q3 = pd.read_csv("강의자료/실습파일/bike.csv")
Q3

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


In [47]:
Q3["datetime"] = pd.to_datetime(Q3["datetime"])
Q3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB


In [51]:
Q3["weekday"] = Q3["datetime"].dt.weekday
Q3["weekend"] = (Q3["weekday"] >= 5) + 0
Q3

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,day,weekday,weekend
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,5,5,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,5,5,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,5,5,1
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,5,5,1
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,5,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,2,2,0
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,2,2,0
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,2,2,0
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,2,2,0


In [54]:
from scipy.stats import ttest_ind
stat, p = ttest_ind(Q3.loc[Q3["weekend"] == 1, "registered"],
                   Q3.loc[Q3["weekend"] == 0, "registered"])
abs(stat.round(0))

12.0