## 평균비교, scipy, ttest_1samp, ttest_rel, ttest_ind

## 단일표본 t-검정(One Sample t-test)
- 단일 모집단에서 추출된 하나의 표본이 대상
- 모평균과 표본 평균의 차이를 검정

## 대응표본 t-검정(Paired Sample t-test)
- 동일한 모집단으로부터 추출된 두 표본 집단을 대상
- 표본이 정규성을 만족하지 못하는 경우 Wilcoxon rank sum test 사용
- ex) 내가 인슐린약을 먹기전 혈당과 먹은후 혈당의 비교

## 독립 2 표본 t-검정(Independent 2 Sample t-test) : 가장 많이 사용
- 독립된 두 표본집단을 대상
- `위 2개 t-test와는 다르게` 등분산 여부에 따라 검정통계량 계산식이 다름
- 표본이 정규성을 만족하지 못하는 경우 Wilcoxon rank sum test 사용
 - ex) A학과 B학과 같이 관련 없는 서로 독립일 경우

### 가설
- 귀무가설: 두 집단 간 평균이 같다
- 대립가설: 두 집단 간 평균이 같지 않다

---
## scipy
### ttest_1samp()
- 단일 표본 t검정을 실시할 때 사용하는 함수
- 모집단의 평균은 popmean 인자에 지정

### ttest_rel() - rel은 (relative) 약자
- 대응 표본 t검정을 실시할 때 사용하는 함수
- 검정에 실시하는 두 변수를 차례대로 지정

### ttest_ind() - ind는 independent약자
- 독립 2 표본 t검정을 실시할 때 사용하는 함수
- 검정에 실시하는 두 변수를 차례대로 지정
- 등분산 가정을 만족하는 경우, equal_var 인자에 True를 할당

In [43]:
import pandas as pd
from scipy.stats import ttest_1samp
from scipy.stats import ttest_rel
from scipy.stats import ttest_ind # tt+tab눌러서 불러오기 다외울수없음

### ttest_1samp()

In [44]:
df = pd.read_csv("iris.csv")
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [45]:
ttest_1samp(df["Sepal.Length"], popmean = 4) # popmean값은 일단은 임의로 넣어줌

Ttest_1sampResult(statistic=27.263680640799215, pvalue=8.764592435410748e-60)

- statistic=검정통계량
- pvalue=?

In [46]:
result = ttest_1samp(df["Sepal.Length"], popmean = 4) # 숫자가 헷갈리므로 깔끔하게 보기위해 객체에 넣음
result

Ttest_1sampResult(statistic=27.263680640799215, pvalue=8.764592435410748e-60)

In [47]:
result[0]

27.263680640799215

In [48]:
round(result[0], 3)

27.264

In [49]:
stat, p = ttest_1samp(df["Sepal.Length"], popmean = 4) # 한번에 갹채애 헐덩
print(stat)
print(p)

27.263680640799215
8.764592435410748e-60


In [50]:
stat, p = ttest_1samp(df["Sepal.Length"], popmean = 4) # 한번에 갹채애 헐덩
print(round(stat, 2))
print(round(p, 2))

27.26
0.0


- ttest의 아쉬운점: t-test 평균, 표본 평균이 안나옴

In [51]:
df["Sepal.Length"].mean()

5.843333333333335

In [52]:
stat, p = ttest_1samp(df["Sepal.Length"], popmean = 5.75) # popmean(모집단 평균) 바꿔보기
print(round(stat, 2))
print(round(p, 2))

1.38
0.17


- 모집단 평균이 표본 평균과 같아질수록 검정통계량 수치는 내려가고 p값은 올라간다.

## 대응표본 t검정 예제

In [53]:
stat, p = ttest_rel(df["Sepal.Length"], df["Sepal.Width"])
print(round(stat, 3))
print(round(p, 3))

34.815
0.0


- 두 집단의 평균이 유의미하게 차이가 난다. 대립가설 채택

## ttest_ind() 예제

In [54]:
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [55]:
df["Species"].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [56]:
stat, p = ttest_ind(df.loc[df["Species"] == "setosa", "Petal.Length"], 
                    df.loc[df["Species"] == "versicolor", "Petal.Length"])
print(round(stat, 3))
print(round(p, 3))

-39.493
0.0


In [57]:
df.loc[df["Species"] == "setosa", "Petal.Length"].head(2)

0    1.4
1    1.4
Name: Petal.Length, dtype: float64

## 1. 자료가 수집된 지역의 평균온도는 20도라고 한다. 수집된 데이터를 사용하여 양측 검점을 실시 했을 때 p-value는 얼마인가?
- bike.csv 파일 사용

In [58]:
df = pd.read_csv("bike.csv")
df

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


In [59]:
stat, p = ttest_1samp(df["temp"], popmean=20)
print(round(stat, 3))
print(round(p, 3))

3.091
0.002


## 2. 2011년 1월의 데이터를 대상으로 동 시간대의 casual과 registered의 평균 차이 검정 시 검정통계량은?
- bike.csv 파일사용
- 양측 검정 실시하고 검정통계량 절대값의 정수부분을 확인

In [60]:
df = pd.read_csv("bike.csv")
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [61]:
df["datetime"] = pd.to_datetime(df["datetime"])

In [62]:
df["datetime"]

0       2011-01-01 00:00:00
1       2011-01-01 01:00:00
2       2011-01-01 02:00:00
3       2011-01-01 03:00:00
4       2011-01-01 04:00:00
                ...        
10881   2012-12-19 19:00:00
10882   2012-12-19 20:00:00
10883   2012-12-19 21:00:00
10884   2012-12-19 22:00:00
10885   2012-12-19 23:00:00
Name: datetime, Length: 10886, dtype: datetime64[ns]

In [63]:
df['date'] = df["datetime"].dt.date

In [64]:
df

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,date
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,2011-01-01
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,2011-01-01
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,2011-01-01
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,2011-01-01
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,2011-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336,2012-12-19
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241,2012-12-19
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168,2012-12-19
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129,2012-12-19


In [65]:
df = df.groupby("date")["casual", "registered"].mean().reset_index()

  df = df.groupby("date")["casual", "registered"].mean().reset_index()


In [66]:
df = df.iloc[:19]

In [67]:
df2 = (df['casual'] - df['registered'])

In [68]:
(df['casual'] - df['registered']).mean()

-45.81425525275639

In [69]:
ttest_rel(df['casual'], df['registered'])

Ttest_relResult(statistic=-13.190291952592151, pvalue=1.0851829821843262e-10)

정답

In [70]:
df = pd.read_csv("bike.csv")
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [71]:
df["datetime"] = pd.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year
df["month"] = df["datetime"].dt.month
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,2011,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,2011,1


In [72]:
df_sub = df[(df["year"] == 2011) & (df["month"] == 1)]
len(df_sub)

431

In [73]:
df_sub

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16,2011,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40,2011,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32,2011,1
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13,2011,1
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1,2011,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426,2011-01-19 19:00:00,1,0,1,1,13.12,14.395,57,27.9993,4,108,112,2011,1
427,2011-01-19 20:00:00,1,0,1,1,13.12,15.150,49,19.9995,2,74,76,2011,1
428,2011-01-19 21:00:00,1,0,1,1,13.12,14.395,49,27.9993,4,55,59,2011,1
429,2011-01-19 22:00:00,1,0,1,1,12.30,15.150,52,11.0014,6,53,59,2011,1


In [74]:
stat, p = ttest_rel(df_sub["casual"], df_sub["registered"])
print(round(abs(stat), 3))
print(round(p, 3))

21.41
0.0


## 3. 주중과 주말의 registered 평균 검정 시 검정통계량은?
- bike.csv
- 양측검정을 실시하고 검정통계량 절대값의 정수부분을 확인

In [75]:
import pandas as pd
from scipy.stats import ttest_1samp
from scipy.stats import ttest_rel
from scipy.stats import ttest_ind

In [76]:
df = pd.read_csv("bike.csv")
df

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


In [77]:
df1 = df[(df["holiday"] == 1)]
df1

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
372,2011-01-17 00:00:00,1,1,0,2,8.20,9.850,47,15.0013,1,16,17
373,2011-01-17 01:00:00,1,1,0,2,8.20,9.850,44,12.9980,1,15,16
374,2011-01-17 02:00:00,1,1,0,2,7.38,8.335,43,16.9979,0,8,8
375,2011-01-17 03:00:00,1,1,0,2,7.38,9.090,43,12.9980,0,2,2
376,2011-01-17 04:00:00,1,1,0,2,7.38,9.850,43,8.9981,1,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...
10257,2012-11-12 19:00:00,4,1,0,1,22.14,25.760,73,19.0012,30,323,353
10258,2012-11-12 20:00:00,4,1,0,2,21.32,25.000,77,19.0012,31,273,304
10259,2012-11-12 21:00:00,4,1,0,3,22.14,25.760,73,15.0013,10,145,155
10260,2012-11-12 22:00:00,4,1,0,1,21.32,25.000,77,16.9979,12,100,112


In [78]:
df2 = df[df["workingday"] == 1]
df2

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
47,2011-01-03 00:00:00,1,0,1,1,9.02,9.850,44,23.9994,0,5,5
48,2011-01-03 01:00:00,1,0,1,1,8.20,8.335,44,27.9993,0,2,2
49,2011-01-03 04:00:00,1,0,1,1,6.56,6.820,47,26.0027,0,1,1
50,2011-01-03 05:00:00,1,0,1,1,6.56,6.820,47,19.0012,0,3,3
51,2011-01-03 06:00:00,1,0,1,1,5.74,5.305,50,26.0027,0,30,30
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


정답

In [79]:
df = pd.read_csv("bike.csv")

In [80]:
df["datetime"] = pd.to_datetime(df["datetime"])
df["wday"] = df["datetime"].dt.weekday
df["wend"] = (df["wday"] >= 5) + 0 # + 0을 해주어 숫자로 표시 1 또는 0
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,wday,wend
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,5,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,5,1


In [81]:
stat, p = ttest_ind(df.loc[df['wend']==1, "registered"],
                    df.loc[df['wend']==0, "registered"])
print(abs(round(stat, 3)))
print(round(p, 3))

12.073
0.0


In [82]:
df.loc[df['wend']==1, "registered"]

0         13
1         32
2         27
3         10
4          1
        ... 
10809     99
10810    108
10811     92
10812     83
10813     29
Name: registered, Length: 3163, dtype: int64

In [83]:
df.loc[df['wend']==0, "registered"]

47         5
48         2
49         1
50         3
51        30
        ... 
10881    329
10882    231
10883    164
10884    117
10885     84
Name: registered, Length: 7723, dtype: int64