## Key words
### 임의추출, 층화추출, 계통추출, 군집추출, pandas, sample, groupby, sklearn, model_selection, train_test_split

## (확률적)표본 추출의 종류
### 단순 임의 추출(Simple Random Sampling)
- 별도의 규칙이 존재하지않은 보통의 임의 추출(그냥 랜덤 추출)
 - ex) 10명을 뽑을건데 여자만 10명 뽑힐수도있음
 - 이것보단 층화 표본 추출을 씀

### 층화 표본 추출(Stratified Random Sampling)
- 군집별로 지정한 비율 만큼의 데이터를 임의 추출
 - ex) 10명 뽑을건데 반반 정해서 5명 5명 뽑을수있음
----
### 아래 2개는 여기서 안배움
### 계통 추출(Systematic Sampling)
- 첫 표본을 무작위로 추출하고 표집 간격 k만큼 떨어진 곳의 데이터 추출
 - ex) 첫 표본은 너다 하고 그뒤 세(k)명 단위로 한명씩 뽑는것
 - 생각보다 잘 쓰이지 않음

### 군집 추출(Cluster Sampling)
- 소수의 군집으로 분할하고 일정 수의 소집단을 임의 표본 추출


## 주요 함수와 메서드 소개
### pandas - sample()
- `단순임의추출`을 수행하는 메서드
- n은 표본개수, frac는 비율, random_state는 표본 추출 결과를 고정
- groupby() 메서드를 추가하면 `층화표본추출` 가능

### sklearn - train_test_split()
- 입력 데이터프레임이나 배열을 두 세트(학습, 평가)로 나누는 함수
- 데이터 여러개를 한번에 분리도 가능
- train_size 또는 test_size에 개수 또는 비율을 입력하여 표본 개수 조절
- random_state는 표본 추출 결과를 고정

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [17]:
df.sample(n =2) # 옵션에 replace는 복원추출여부, weights는 가중치를 둘건지, random_state는 고정할것이냐
# 표본추출데이터가 계속 바뀜

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
3195,2011-08-03 01:00:00,3,0,1,2,31.98,35.605,52,11.0014,7,9,16
6774,2012-03-19 13:00:00,1,0,1,1,25.42,30.305,61,16.9979,69,194,263


In [35]:
df.sample(n =2, random_state= 34) # 표본추출데이터 고정

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
4219,2011-10-07 20:00:00,4,0,1,1,22.14,25.76,49,0.0,30,167,197
1409,2011-04-04 14:00:00,2,0,1,2,30.34,32.575,27,32.9975,47,76,123


groupby() 사용하여층화표본추출 처럼 사용하기

In [36]:
df["season"].unique()

array([1, 2, 3, 4], dtype=int64)

In [37]:
# 고유한 값의 개수 볼때 2가지방법
print(len(df["season"].unique()))
print(df["season"].nunique())

4
4


부록 - df["season"]은 시리즈 객체이다.

In [39]:
df.groupby("season").sample(n = 2, random_state= 34)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
5779,2012-01-15 23:00:00,1,0,0,1,6.56,9.85,43,6.0032,3,26,29
6133,2012-02-11 19:00:00,1,0,0,2,8.2,7.575,40,36.9974,2,85,87
7668,2012-05-18 21:00:00,2,0,1,1,24.6,31.06,40,0.0,49,209,258
7831,2012-06-06 16:00:00,2,0,1,1,26.24,31.06,41,8.9981,107,345,452
3375,2011-08-10 13:00:00,3,0,1,1,33.62,36.365,38,19.0012,41,150,191
9443,2012-09-16 20:00:00,3,0,0,1,24.6,31.06,53,7.0015,57,267,324
10514,2012-12-04 12:00:00,4,0,1,1,21.32,25.0,68,12.998,39,273,312
4971,2011-12-01 05:00:00,4,0,1,1,10.66,12.12,56,16.9979,1,23,24


frac 옵션 사용:  frac는 전체 데이터 개수 중에 들고올 비율

In [48]:
df.sample(frac = 0.005, random_state= 34)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
4219,2011-10-07 20:00:00,4,0,1,1,22.14,25.76,49,0.0,30,167,197
1409,2011-04-04 14:00:00,2,0,1,2,30.34,32.575,27,32.9975,47,76,123
6289,2012-02-18 07:00:00,1,0,0,1,9.84,14.395,70,0.0,8,33,41
7506,2012-05-12 03:00:00,2,0,0,1,19.68,23.485,59,0.0,14,20,34
7509,2012-05-12 06:00:00,2,0,0,1,17.22,21.21,67,6.0032,10,23,33
2717,2011-07-02 03:00:00,3,0,0,1,26.24,31.06,53,0.0,5,21,26
4094,2011-10-02 15:00:00,4,0,0,3,14.76,16.665,81,16.9979,29,144,173
1526,2011-04-09 11:00:00,2,0,0,2,14.76,18.18,81,0.0,51,91,142
9325,2012-09-11 22:00:00,3,0,1,1,22.96,26.515,64,7.0015,27,189,216
5508,2012-01-04 15:00:00,1,0,1,2,7.38,7.575,37,22.0028,9,81,90


In [46]:
# 몇개인지 확인
print(len(df.sample(frac = 0.005, random_state= 34)))
print(df.sample(frac = 0.005, random_state= 34).shape)

54
(54, 12)


## train_test_split()

In [49]:
df_train, df_test = train_test_split(df, train_size = 0.7, random_state = 123)
df_train.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
4046,2011-09-19 15:00:00,3,0,1,2,24.6,30.305,60,15.0013,44,143,187
9262,2012-09-09 07:00:00,3,0,0,1,22.14,25.76,73,11.0014,20,50,70


In [51]:
# 몇개씩 나왔나 테스트가능
print(len(df_train))
print(len(df_test))

7620
3266


## 1. 주어진 데이터의 1.23%를 추출하면 몇개의 행이 추출되는가?
- bike.csv

In [54]:
import pandas as pd

In [55]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [57]:
df.sample(frac = 0.0123).shape[0]

134

## 2. Season 기준 5%씩 추출 시 추출되는 총 행의 수는?
- bike.csv

In [58]:
import pandas as pd

In [59]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [62]:
df.groupby("season").sample(frac = 0.05,).shape[0]

545

## 3. 학습과 평가용 데이터 세트로 8:2 분리 시 평가용 데이터의 최고기온은?
- bike.csv
- Seed : 123 으로 설정

In [63]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [64]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [66]:
df_train, df_test = train_test_split(df, train_size = 0.8, random_state = 123)
print(max(df_test["temp"]))
print(df_test["temp"].max())

39.36
39.36
