## Key words
### 상관계수, corr, pearsonr, spearmanr, kendalltau

## 다양한 상관 분석
- 두 변수의 선형관계를 확인하기 위해서 상관분석을 실시
- 두 수치형 변수의 비교는 Pearson's Correlation Coefficient를 확인하며 그 외 다양한 상관계수 존재
- 상관계수가 0에 가까울수록 선형관계가 약하며, 절대값이 1에 가까울수록 선형관계가 강함

## pandas - corr()
- pandas에서 상관계수를 실시하는 데이터프레임 전용 메서드
- method 인자에 "pearson", "kendall", "spearman"는 각각의 상관계수로 계산
- `상관계수만 뽑을 때 사용`

## scipy - pearsonr()
- Pearson 상관분석을 실사하는 scipy 함수
- 입력은 두 일차원 벡터를 넣고 출력은 상관계수와 p-value가 차례대로 출력
- 수치형, 연속형 변수

## scipy - spearmanr()
- Spearman 상관분석을 실시하는 scipy 함수
- 입력은 두 일차원 벡터를 넣고 출력은 상관계수와 p-value가 차례대로 출력
- 순서형 변수

## scipy - kendalltau()
- Kendall 상관분석을 실시하는 scipy의 함수
- 입력은 두 일차원 벡터를 넣고 출력은 상관계수와 p-value가 차례대로 출력

In [1]:
import pandas as pd
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import kendalltau

In [2]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [3]:
df.corr() # 피어슨이 기본값

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
season,1.0,0.029368,-0.008126,0.008879,0.258689,0.264744,0.19061,-0.147121,0.096758,0.164011,0.163439
holiday,0.029368,1.0,-0.250491,-0.007074,0.000295,-0.005215,0.001929,0.008409,0.043799,-0.020956,-0.005393
workingday,-0.008126,-0.250491,1.0,0.033772,0.029966,0.02466,-0.01088,0.013373,-0.319111,0.11946,0.011594
weather,0.008879,-0.007074,0.033772,1.0,-0.055035,-0.055376,0.406244,0.007261,-0.135918,-0.10934,-0.128655
temp,0.258689,0.000295,0.029966,-0.055035,1.0,0.984948,-0.064949,-0.017852,0.467097,0.318571,0.394454
atemp,0.264744,-0.005215,0.02466,-0.055376,0.984948,1.0,-0.043536,-0.057473,0.462067,0.314635,0.389784
humidity,0.19061,0.001929,-0.01088,0.406244,-0.064949,-0.043536,1.0,-0.318607,-0.348187,-0.265458,-0.317371
windspeed,-0.147121,0.008409,0.013373,0.007261,-0.017852,-0.057473,-0.318607,1.0,0.092276,0.091052,0.101369
casual,0.096758,0.043799,-0.319111,-0.135918,0.467097,0.462067,-0.348187,0.092276,1.0,0.49725,0.690414
registered,0.164011,-0.020956,0.11946,-0.10934,0.318571,0.314635,-0.265458,0.091052,0.49725,1.0,0.970948


In [4]:
# 일부만 뽑기
df[["casual", "registered", "count"]].corr()

Unnamed: 0,casual,registered,count
casual,1.0,0.49725,0.690414
registered,0.49725,1.0,0.970948
count,0.690414,0.970948,1.0


In [5]:
df[["casual", "registered", "count"]].corr(method = "kendall") # 순서형아니지만 그냥 kendall 사용 해봄

Unnamed: 0,casual,registered,count
casual,1.0,0.582213,0.666411
registered,0.582213,1.0,0.919346
count,0.666411,0.919346,1.0


In [6]:
df[["casual", "registered", "count"]].corr(method = "spearman")

Unnamed: 0,casual,registered,count
casual,1.0,0.775785,0.847378
registered,0.775785,1.0,0.988901
count,0.847378,0.988901,1.0


피어슨

In [7]:
pearsonr(df["casual"], df["registered"]) # 상관계수, p값이 나옴

(0.49724968508700823, 0.0)

In [9]:
stat, p = pearsonr(df["casual"], df["registered"]) # 상관계수, p값이 나옴
print(stat)
print(p)

0.49724968508700823
0.0


### 1. 기온, 체감온도, 상대습도, 총 자전거 대여숫자의 상관관계를 분석하였을 때 가장 낮은 상관계수는 얼마인가?
- bike.csv
- 데이터속성에 맞는 적절한 상관분석 기법 사용
- 자전거 대여 숫자는 casual변수를 사용

In [10]:
import pandas as pd
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import kendalltau

In [11]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [16]:
df[["temp", "atemp", "humidity", "casual"]].corr() # 정답임, 디폴트값이 pearson 임

Unnamed: 0,temp,atemp,humidity,casual
temp,1.0,0.984948,-0.064949,0.467097
atemp,0.984948,1.0,-0.043536,0.462067
humidity,-0.064949,-0.043536,1.0,-0.348187
casual,0.467097,0.462067,-0.348187,1.0


정답

In [19]:
df[["temp", "atemp", "humidity", "casual"]].corr(method = "pearson").round(2) # 수치형 데이터이므로 pearson 사용 = 디폴트

Unnamed: 0,temp,atemp,humidity,casual
temp,1.0,0.98,-0.06,0.47
atemp,0.98,1.0,-0.04,0.46
humidity,-0.06,-0.04,1.0,-0.35
casual,0.47,0.46,-0.35,1.0


### 2. 계절별로 체감온도와 자전거 대여 숫자의 상관관계를 알아보고자 한다. 이때 적절한 상관분석기법을 사용하였을 때 상관계수로 옳은 것은?
- bike.csv
- 자전거 대여 숫자는 casual 변수를 사용

In [20]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [30]:
df1 = df.loc[(df["season"] == 1)]

In [32]:
df1[["atemp", "casual"]].corr()

Unnamed: 0,atemp,casual
atemp,1.0,0.478312
casual,0.478312,1.0


In [33]:
df2 = df.loc[(df["season"] == 2)]

In [34]:
df2[["atemp", "casual"]].corr()

Unnamed: 0,atemp,casual
atemp,1.0,0.378122
casual,0.378122,1.0


정답

In [35]:
df[["season", "atemp", "casual"]].groupby("season").corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,atemp,casual
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,atemp,1.0,0.478312
1,casual,0.478312,1.0
2,atemp,1.0,0.378122
2,casual,0.378122,1.0
3,atemp,1.0,0.381423
3,casual,0.381423,1.0
4,atemp,1.0,0.443751
4,casual,0.443751,1.0


In [37]:
df[["season", "atemp", "casual"]].groupby("season").max()

Unnamed: 0_level_0,atemp,casual
season,Unnamed: 1_level_1,Unnamed: 2_level_1
1,32.575,367
2,43.94,361
3,45.455,350
4,34.09,362


번외) 값 1인 부분 없애기

In [40]:
df_corr = df[["season", "atemp", "casual"]].groupby("season").corr()
df_corr = df_corr.reset_index()
df_corr

Unnamed: 0,season,level_1,atemp,casual
0,1,atemp,1.0,0.478312
1,1,casual,0.478312,1.0
2,2,atemp,1.0,0.378122
3,2,casual,0.378122,1.0
4,3,atemp,1.0,0.381423
5,3,casual,0.381423,1.0
6,4,atemp,1.0,0.443751
7,4,casual,0.443751,1.0


In [41]:
df_corr = df_corr.loc[df_corr["atemp"] < 1, ]
df_corr

Unnamed: 0,season,level_1,atemp,casual
1,1,casual,0.478312,1.0
3,2,casual,0.378122,1.0
5,3,casual,0.381423,1.0
7,4,casual,0.443751,1.0


### 3. 날씨에 따른 기온과 자전거 대여의 상관계수 변화를 알아보고자 한다. 날씨가 맑은 날과 그렇지 않은 날의 상관계수 차이의 절대값은 얼마인가?
- bike.csv 파일 사용하여 weather 변수의 값이 1 인것이 맑은 날
- 자전거 대여숫자는 casual 변수를 사용

In [44]:
df = pd.read_csv("bike.csv")
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40


In [53]:
df1 = df.loc[df["weather"] == 1]

In [54]:
df1

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.910,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129


In [55]:
df2 = df.loc[df["weather"] != 1]

In [56]:
df2

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
5,2011-01-01 05:00:00,1,0,0,2,9.84,12.880,75,6.0032,0,1,1
13,2011-01-01 13:00:00,1,0,0,2,18.86,22.725,72,19.9995,47,47,94
14,2011-01-01 14:00:00,1,0,0,2,18.86,22.725,72,19.0012,35,71,106
15,2011-01-01 15:00:00,1,0,0,2,18.04,21.970,77,19.9995,40,70,110
16,2011-01-01 16:00:00,1,0,0,2,17.22,21.210,82,19.9995,41,52,93
...,...,...,...,...,...,...,...,...,...,...,...,...
10837,2012-12-17 23:00:00,4,0,1,3,17.22,21.210,94,15.0013,6,41,47
10838,2012-12-18 00:00:00,4,0,1,2,18.04,21.970,94,8.9981,0,18,18
10839,2012-12-18 01:00:00,4,0,1,2,18.04,21.970,94,8.9981,0,15,15
10840,2012-12-18 02:00:00,4,0,1,2,18.04,21.970,88,15.0013,2,5,7


In [61]:
df1 = df1[["temp", "casual"]].corr()

In [62]:
df2 = df2[["temp", "casual"]].corr()

In [64]:
df1.iloc[1, 0] - df2.iloc[1, 0]

0.024691646193832073

정답

In [65]:
df["is_sunny"] = (df["weather"] == 1) +0
df.head(2)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,is_sunny
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,1
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1


In [69]:
df_corr = df.groupby("is_sunny")[["temp", "casual"]].corr()
df_corr

Unnamed: 0_level_0,Unnamed: 1_level_0,temp,casual
is_sunny,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,temp,1.0,0.446361
0,casual,0.446361,1.0
1,temp,1.0,0.471053
1,casual,0.471053,1.0


In [67]:
round(abs(df_corr.iloc[1, 0] - df_corr.iloc[3, 0]), 3)

0.025