# 01 파생변수 개요
## 파생변수의 정의
- 기존 변수를 조합하여 만들어내는 새로운 변수

## 파생변수의 예시
- 기온, 습도, 풍속을 조합하여 만든 체감온도 변수
- 물건 주문 건수와 환불 건수를 조합하여 만든 환불 비율 변수
- 기존 방문 매장 정보를 활용한 주 방문 매장 변수

# 02 주요 함수 및 메서드 소개
## numpy - where()
- 조건에 따라 두 개의 출력을 내는 함수
- if() 함수를 대체할 수 있으며 조건, True일 때 반환값, False일 때 반환값을 차례대로 기입

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("강의자료/실습파일/iris.csv")
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [4]:
df["is_setosa"] = np.where(df["Species"] == "setosa", 1, 0)  # where() : 조건에 따라 두 개의 출력을 내는 함수
df.head() # is_setosa 칼럼이 생성된 것을 확인

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1


In [5]:
df.tail()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa
145,6.7,3.0,5.2,2.3,virginica,0
146,6.3,2.5,5.0,1.9,virginica,0
147,6.5,3.0,5.2,2.0,virginica,0
148,6.2,3.4,5.4,2.3,virginica,0
149,5.9,3.0,5.1,1.8,virginica,0


In [6]:
pd.crosstab(df["Species"], df["is_setosa"])

is_setosa,0,1
Species,Unnamed: 1_level_1,Unnamed: 2_level_1
setosa,0,50
versicolor,50,0
virginica,50,0


In [8]:
# np.where() 대신 다른 방법, True=1 을 활용 
df["is_setosa_2"] = (df["Species"] == "setosa") + 0 
df["is_setosa_3"] = (df["Species"] == "setosa") * 1 
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa,is_setosa_2,is_setosa_3
0,5.1,3.5,1.4,0.2,setosa,1,1,1
1,4.9,3.0,1.4,0.2,setosa,1,1,1
2,4.7,3.2,1.3,0.2,setosa,1,1,1
3,4.6,3.1,1.5,0.2,setosa,1,1,1
4,5.0,3.6,1.4,0.2,setosa,1,1,1


## pandas - rename()
- 데이터프레임의 변수명을 변경할 때 사용하는 메서드
- columns 인자에 기존 변수명과 신규 변수명의 쌍을 딕셔너리로 구성하여 입력

In [10]:
df.rename(columns = {"Sepal.Length": "SL"})

Unnamed: 0,SL,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa,is_setosa_2,is_setosa_3
0,5.1,3.5,1.4,0.2,setosa,1,1,1
1,4.9,3.0,1.4,0.2,setosa,1,1,1
2,4.7,3.2,1.3,0.2,setosa,1,1,1
3,4.6,3.1,1.5,0.2,setosa,1,1,1
4,5.0,3.6,1.4,0.2,setosa,1,1,1
...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0,0,0
146,6.3,2.5,5.0,1.9,virginica,0,0,0
147,6.5,3.0,5.2,2.0,virginica,0,0,0
148,6.2,3.4,5.4,2.3,virginica,0,0,0


In [11]:
df.head() #rename()은 덮어쓰지 않음

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa,is_setosa_2,is_setosa_3
0,5.1,3.5,1.4,0.2,setosa,1,1,1
1,4.9,3.0,1.4,0.2,setosa,1,1,1
2,4.7,3.2,1.3,0.2,setosa,1,1,1
3,4.6,3.1,1.5,0.2,setosa,1,1,1
4,5.0,3.6,1.4,0.2,setosa,1,1,1


In [12]:
df = df.rename(columns = {"Sepal.Length": "SL"})  #기존 df에 넣어야 덮어써짐
df.head()

Unnamed: 0,SL,Sepal.Width,Petal.Length,Petal.Width,Species,is_setosa,is_setosa_2,is_setosa_3
0,5.1,3.5,1.4,0.2,setosa,1,1,1
1,4.9,3.0,1.4,0.2,setosa,1,1,1
2,4.7,3.2,1.3,0.2,setosa,1,1,1
3,4.6,3.1,1.5,0.2,setosa,1,1,1
4,5.0,3.6,1.4,0.2,setosa,1,1,1


## pandas - apply()
- row 또는 column 방향으로 일괄 계산하는 메서드
- axis 인자 설정으로 연산 방향 설정 가능(axis = 0은 row, axis = 1은 column 방향)
- 사용자 정의 함수 또는 lambda 함수(일회성 함수)로 복잡한 연산 가능

In [7]:
bike = pd.read_csv("강의자료/실습파일/bike.csv")
bike.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [8]:
bike.columns # 변수명 확인 casula 9 registered 10

Index(['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count'],
      dtype='object')

In [9]:
bike_sub = bike.iloc[:, 9:11]  # iloc는 index번호로 불러옴
bike_sub.head()

Unnamed: 0,casual,registered
0,3,13
1,8,32
2,5,27
3,3,10
4,0,1


In [11]:
bike_sub = bike.loc[:, ["casual", "registered"]]
bike_sub.head()

Unnamed: 0,casual,registered
0,3,13
1,8,32
2,5,27
3,3,10
4,0,1


In [12]:
bike_sub = bike.loc[:, "casual": "registered"]
bike_sub.head()

Unnamed: 0,casual,registered
0,3,13
1,8,32
2,5,27
3,3,10
4,0,1


In [14]:
bike_sub.sum()

casual         392135
registered    1693341
dtype: int64

In [15]:
bike_sub.sum(axis = 1)

0         16
1         40
2         32
3         13
4          1
        ... 
10881    336
10882    241
10883    168
10884    129
10885     88
Length: 10886, dtype: int64

In [16]:
bike_sub.apply(func = sum)

casual         392135
registered    1693341
dtype: int64

In [17]:
bike_sub.apply(func = sum, axis = 1)

0         16
1         40
2         32
3         13
4          1
        ... 
10881    336
10882    241
10883    168
10884    129
10885     88
Length: 10886, dtype: int64

In [18]:
bike_sub.apply(func = lambda x: round(x.mean()))

casual         36
registered    156
dtype: int64

## pandas - astype()
- 시리즈의 속성을 변경할 때 사용하는 메서드
- int / float / str 은 각각 정수 / 실수 / 문자열을 뜻하며 원하는 속성을 지정 및 변경

In [19]:
bike_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   casual      10886 non-null  int64
 1   registered  10886 non-null  int64
dtypes: int64(2)
memory usage: 170.2 KB


In [20]:
bike_sub["casual"]

0         3
1         8
2         5
3         3
4         0
         ..
10881     7
10882    10
10883     4
10884    12
10885     4
Name: casual, Length: 10886, dtype: int64

In [21]:
3 + "대"  # 정수와 문자열은 더할 수 없음

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [22]:
"3" + "대" # 문자열끼리는 더할 수 있음

'3대'

In [25]:
bike_sub["casual"].astype("str") + "대" #문자열로 변경하여 '단위'를 붙여줌

0         3대
1         8대
2         5대
3         3대
4         0대
        ... 
10881     7대
10882    10대
10883     4대
10884    12대
10885     4대
Name: casual, Length: 10886, dtype: object

In [26]:
bike["datetime"][:3]

0    2011-01-01 00:00:00
1    2011-01-01 01:00:00
2    2011-01-01 02:00:00
Name: datetime, dtype: object

In [27]:
bike["datetime"][:3].str.slice(0, 4)  # 연도만 뽑음

0    2011
1    2011
2    2011
Name: datetime, dtype: object

In [28]:
bike["datetime"][:3].str.slice(5, 7)  # 월만 뽑음

0    01
1    01
2    01
Name: datetime, dtype: object

In [29]:
bike_time = pd.to_datetime(bike["datetime"][:3])  # 데이터 타입을 datetime으로 변경 
bike_time

0   2011-01-01 00:00:00
1   2011-01-01 01:00:00
2   2011-01-01 02:00:00
Name: datetime, dtype: datetime64[ns]

In [30]:
bike_time.dt.year # 바로 연 / 월 / 일 등 확인 가능

0    2011
1    2011
2    2011
Name: datetime, dtype: int64

In [31]:
bike_time.dt.month

0    1
1    1
2    1
Name: datetime, dtype: int64

In [32]:
bike_time.dt.hour

0    0
1    1
2    2
Name: datetime, dtype: int64

In [34]:
bike_time.dt.weekday

0    5
1    5
2    5
Name: datetime, dtype: int64

## pandas - get_dummies()
- 편리한 가변수 생성(One Hot Encoding)을 도와주는 함수
- columns 인자에 명목형 변수 지정 및 처리 가능
- drop_first 인자에 True를 입력하면 마지막 가변수 제외 후 생성

In [38]:
bike.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [37]:
bike_dum = pd.get_dummies(data = bike, columns = ["season"]) # season의 더미 1~4 생성, 첫번재 더미가 기존파일과 같음
bike_dum.head(2)

Unnamed: 0,datetime,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,season_1,season_2,season_3,season_4
0,2011-01-01 00:00:00,0,0,1,9.84,14.395,81,0.0,3,13,16,1,0,0,0
1,2011-01-01 01:00:00,0,0,1,9.02,13.635,80,0.0,8,32,40,1,0,0,0
2,2011-01-01 02:00:00,0,0,1,9.02,13.635,80,0.0,5,27,32,1,0,0,0
3,2011-01-01 03:00:00,0,0,1,9.84,14.395,75,0.0,3,10,13,1,0,0,0
4,2011-01-01 04:00:00,0,0,1,9.84,14.395,75,0.0,0,1,1,1,0,0,0


In [39]:
bike_dum = pd.get_dummies(data = bike, columns = ["season"],
                         drop_first = True)  # 첫번째 더미를 버림
bike_dum.head(2)

Unnamed: 0,datetime,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,season_2,season_3,season_4
0,2011-01-01 00:00:00,0,0,1,9.84,14.395,81,0.0,3,13,16,0,0,0
1,2011-01-01 01:00:00,0,0,1,9.02,13.635,80,0.0,8,32,40,0,0,0


## Q1. temp변수와 atemp변수 차이의 절대값 평균은?
1) bike.csv파일 사용  
2) abs()매서드 활용

In [41]:
Q1 = pd.read_csv("강의자료/실습파일/bike.csv")
Q1.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [61]:
Q1["diff"] = Q1["temp"] - Q1["atemp"]
Q1["diff"].abs().mean()

3.5091985118501188

## Q2. casual 값의 최대값이 25가 넘은 날은 총 며칠인가?
1) bike.csv파일 사용  
2) date 어트리뷰트 활용

In [62]:
Q2 = pd.read_csv("강의자료/실습파일/bike.csv")
Q2.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [80]:
Q2["datetime"] = pd.to_datetime(Q2["datetime"])
Q2["date"] = Q2["datetime"].dt.date
Q2.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,date
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,2011-01-01
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,2011-01-01
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2011-01-01
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,2011-01-01
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,2011-01-01


In [73]:
Q2_agg = Q2.groupby("date")["casual"].max() # "date" 별 "casual"의 최대값
Q2_agg.head()

date
2011-01-01    47
2011-01-02    20
2011-01-03    14
2011-01-04    18
2011-01-05    12
Name: casual, dtype: int64

In [74]:
type(Q2_agg)

pandas.core.series.Series

In [75]:
Q2_agg_up25 = Q2_agg[Q2_agg > 25]
Q2_agg_up25.head()

date
2011-01-01    47
2011-01-15    33
2011-01-16    35
2011-02-06    52
2011-02-12    47
Name: casual, dtype: int64

In [77]:
Q2_agg_up25.count()

384

In [78]:
# 데이터프레임으로 처리하는 방법
Q2_agg_v2 = Q2.groupby("date")["casual"].max().reset_index()
Q2_agg_v2.head()

Unnamed: 0,date,casual
0,2011-01-01,47
1,2011-01-02,20
2,2011-01-03,14
3,2011-01-04,18
4,2011-01-05,12


## Q3. 시간대별 registered 평균을 산출했을 때 값이 가장 큰 시간은?
1) bike.csv 파일 사용  
2) hour 어트리뷰트 활용

In [79]:
Q3 = pd.read_csv("강의자료/실습파일/bike.csv")
Q3.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [85]:
Q3["datetime"] = pd.to_datetime(Q3["datetime"])
Q3["hour"] = Q3["datetime"].dt.hour
Q3.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,3
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,4


In [90]:
Q3_agg = Q3.groupby("hour")["registered"].mean().reset_index()
Q3_agg

Unnamed: 0,hour,registered
0,0,44.826374
1,1,27.345815
2,2,18.080357
3,3,9.076212
4,4,5.144796
5,5,18.311947
6,6,72.10989
7,7,202.202198
8,8,341.226374
9,9,190.824176


In [91]:
Q3_agg.loc[Q3_agg["registered"] == Q3_agg["registered"].max()]

Unnamed: 0,hour,registered
17,17,393.324561


In [92]:
Q3_agg.loc[Q3_agg["registered"].idxmax()] # 다른 방법

hour           17.000000
registered    393.324561
Name: 17, dtype: float64