# 로지스틱 회귀 - Money Ball

> money ball 데이터셋을 분석하여 어느 팀이 플레이오프(play-off)에 진출하는지 로지스틱 회귀(이항) 분석 수행

어떤 팀이 플레이오프(가을야구)에 진출하는지 분석합니다. Wins 피처를 제외한 피처들 가운데 가장 영향을 많이 주는 피처는 무엇일까요? 어떻게 팀은 더 많은 게임에서 이기고, 더 높은 점수를 얻고, 결국 플레이오프에 진출하게 될까요?

# 1. Import Library

In [5]:
import numpy as np # 선형대수
import pandas as pd # csv 파일의 처리
import os
# print(os.listdir("./")) # 현재 디렉토리 내 파일 목록

In [6]:
from sklearn.linear_model import LogisticRegression # sklearn을 사용하여 Logistic 회귀분석을 할 경우 필요
import matplotlib.pyplot as plt # 시각화를 위한 library
import warnings
warnings.filterwarnings('ignore')

# 향후 버전이 올라갈 때 변경될 사항 등을 알려주는 경고 메시지를 무시한다. (참고: https://rfriend.tistory.com/346)

# 2. Load Data & Data Exploration 

* dataset 불러오기 
* pandas를 이용해서 CSV파일을 불러오기
* 불러온 데이터를 파악

In [7]:
data = pd.read_csv("./baseball.csv")
data.head() # pandas로 data를 불러온 후 head()를 찍어 확인함

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


데이터셋 파악 : https://www.kaggle.com/wduckett/moneyball-mlb-stats-19622012

## 2-1. 각각의 데이터는 무엇을 의미하는가?
일반적으로 알고 있는 데이터셋이 아닐 경우 각각의 피처의 의미를 파악할 필요가 있다. 

    - Team: Major League Team 이름
      League: 소속 League
      Year: 데이터가 기록된 년도
      Rs: (Runs Scored) 득점 스코어
      RA: (Runs Allowed) 실점스코어
      W: (Wins) 승
      OBP: (On Base Percentage) 출루율, 즉 타자가 안타를 치던 볼넷으로 베이스에 진출하는 확률
      SLG: 장타율. 타자의 파워 정도를 나타내는 정도
      BA: 타율
      Playoffs: 플레이오프 (바이너리)
      RankSeason: 시즌 순위
      RankPlayoffs: 플레이오프 순위
      G: 경기 수
      OOBP: 상대 출루율
      OSLG: 상대 장타율

In [8]:
# 어떤 Feature가 있을까?
data.columns

Index(['Team', 'League', 'Year', 'RS', 'RA', 'W', 'OBP', 'SLG', 'BA',
       'Playoffs', 'RankSeason', 'RankPlayoffs', 'G', 'OOBP', 'OSLG'],
      dtype='object')

In [9]:
# Feature는 몇 개일까?
len(data.columns)

15

# 3. Data Exploration

특정 feature는 종속 변수에 아무런 영향을 주지 않을 수 있다. 그런 feature들을 파악하고 제거한다면 우리의 모델은 더욱 정확해진다. 이를 위해 전처리와 EDA를 수행한다.

In [10]:
# 각종 시각화 패키지

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [11]:
# 한글 글꼴 사용하기

matplotlib.rc('font', family='NanumBarunGothic')
plt.rcParams['axes.unicode_minus'] = False

In [12]:
display(data.info())
# 1232개의 entries, 15개의 column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1232 entries, 0 to 1231
Data columns (total 15 columns):
Team            1232 non-null object
League          1232 non-null object
Year            1232 non-null int64
RS              1232 non-null int64
RA              1232 non-null int64
W               1232 non-null int64
OBP             1232 non-null float64
SLG             1232 non-null float64
BA              1232 non-null float64
Playoffs        1232 non-null int64
RankSeason      244 non-null float64
RankPlayoffs    244 non-null float64
G               1232 non-null int64
OOBP            420 non-null float64
OSLG            420 non-null float64
dtypes: float64(7), int64(6), object(2)
memory usage: 144.5+ KB


None

In [13]:
# 결측치 확인
data.isnull().sum()

Team              0
League            0
Year              0
RS                0
RA                0
W                 0
OBP               0
SLG               0
BA                0
Playoffs          0
RankSeason      988
RankPlayoffs    988
G                 0
OOBP            812
OSLG            812
dtype: int64

# 3-1. Null 값 처리

In [16]:
print("시즌 순위 값", data['RankSeason'].unique())
print("플레이오프 순위 값", data['RankPlayoffs'].unique())

print("상대 출루율", data['OOBP'].unique())
print("상대 장타율", data['OSLG'].unique())

# 시즌 순위는 8위까지만 매기고, 플레이오프 순위는 5위까지만 매긴다. 
# 상대 출루율과 장타율은 정보가 부족한 것으로 보인다.

시즌 순위 값 [nan  4.  5.  2.  6.  3.  1.  7.  8.]
플레이오프 순위 값 [nan  5.  4.  2.  3.  1.]
상대 출루율 [0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 0.337 0.339
 0.31  0.327 0.326 0.333 0.311 0.308 0.313 0.294 0.309 0.303 0.316 0.341
 0.322 0.325 0.323 0.321 0.304 0.34  0.332 0.296 0.338 0.307 0.328 0.334
 0.342 0.33  0.348 0.353 0.324 0.351 0.344 0.343 0.312 0.345 0.329 0.346
 0.352 0.318 0.361 0.32  0.362 0.354 0.364 0.349 0.358 0.356 0.347 0.367
 0.302 0.355 0.35  0.372 0.359 0.36  0.301 0.369 0.384 0.365 0.368 0.371
   nan]
상대 장타율 [0.415 0.378 0.403 0.428 0.424 0.405 0.39  0.43  0.47  0.402 0.427 0.423
 0.364 0.399 0.414 0.442 0.401 0.419 0.407 0.398 0.394 0.393 0.387 0.352
 0.408 0.438 0.373 0.409 0.361 0.46  0.392 0.413 0.397 0.431 0.396 0.432
 0.425 0.388 0.371 0.385 0.435 0.411 0.375 0.386 0.346 0.383 0.382 0.448
 0.376 0.434 0.395 0.416 0.4   0.417 0.379 0.449 0.368 0.37  0.404 0.41
 0.476 0.422 0.391 0.418 0.443 0.44  0.45  0.406 0.372 0.437 0.454 0.433
 0.455 0.38  0.436 0

In [18]:
# 결측치의 수가 많으므로 행 삭제와 같은 방법보다 값을 대체하는 것이 적절하다.
# 시즌 순위 결측치는 9, 플레이오프 순위 결측치는 6으로 채운다.
# 상대 출루율은 0.3xx... 상대 장타율은 0.4xx... 에 특별히 튀는 부분 없이 분포하고 있으므로 평균값으로 결측치를 채운다.

data['RankSeason'].fillna(value=9, inplace=True)
data['RankPlayoffs'].fillna(value=6, inplace=True)
data['OOBP'].fillna(data['OOBP'].mean(), inplace=True)
data['OSLG'].fillna(data['OSLG'].mean(), inplace=True)

data.isnull().sum()

Team            0
League          0
Year            0
RS              0
RA              0
W               0
OBP             0
SLG             0
BA              0
Playoffs        0
RankSeason      0
RankPlayoffs    0
G               0
OOBP            0
OSLG            0
dtype: int64

In [19]:
display(data.describe())

Unnamed: 0,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
count,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0
mean,1988.957792,715.081981,715.081981,80.904221,0.326331,0.397342,0.259273,0.198052,7.836039,5.349838,161.918831,0.332264,0.419743
std,14.819625,91.534294,93.079933,11.458139,0.015013,0.033267,0.012907,0.398693,2.467149,1.396357,0.624365,0.008924,0.015466
min,1962.0,463.0,472.0,40.0,0.277,0.301,0.214,0.0,1.0,1.0,158.0,0.294,0.346
25%,1976.75,652.0,649.75,73.0,0.317,0.375,0.251,0.0,9.0,6.0,162.0,0.332264,0.419743
50%,1989.0,711.0,709.0,81.0,0.326,0.396,0.26,0.0,9.0,6.0,162.0,0.332264,0.419743
75%,2002.0,775.0,774.25,89.0,0.337,0.421,0.268,0.0,9.0,6.0,162.0,0.332264,0.419743
max,2012.0,1009.0,1103.0,116.0,0.373,0.491,0.294,1.0,9.0,6.0,165.0,0.384,0.499


In [20]:
# 비율변수 W

## 2-2.  변수 종류 확인하기

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1232 entries, 0 to 1231
Data columns (total 15 columns):
Team            1232 non-null object
League          1232 non-null object
Year            1232 non-null int64
RS              1232 non-null int64
RA              1232 non-null int64
W               1232 non-null int64
OBP             1232 non-null float64
SLG             1232 non-null float64
BA              1232 non-null float64
Playoffs        1232 non-null int64
RankSeason      1232 non-null float64
RankPlayoffs    1232 non-null float64
G               1232 non-null int64
OOBP            1232 non-null float64
OSLG            1232 non-null float64
dtypes: float64(7), int64(6), object(2)
memory usage: 144.5+ KB


###  1) 범주형 변수 확인하기

In [22]:
# categorical variable
categorical_col = list(data.select_dtypes(include='object').columns)
categorical_col

['Team', 'League']

In [23]:
# play-off Feature의 경우 0과 1로 범주형이지만 데이터에는 int type로 저장되어있다.

###  2) 연속형 변수 확인하기

In [24]:
# numerical variable
numerical_col =  list(data.select_dtypes(include=('int64', 'float64')).columns)
numerical_col

['Year',
 'RS',
 'RA',
 'W',
 'OBP',
 'SLG',
 'BA',
 'Playoffs',
 'RankSeason',
 'RankPlayoffs',
 'G',
 'OOBP',
 'OSLG']

In [25]:
len(numerical_col)

13

### 3) 변수 종류 확인
- 2개의 categorical variable(Team, League)와 13개의 numerical variable

###  4) 각각의 변수에 들어있는 값 확인
- 각 변수별 unique값을 찍어본다.

In [26]:
# for categorical_col

for col in categorical_col:
    print(col + ': ', len(set(data[str(col)])))

Team:  39
League:  2


In [27]:
# 39개의 팀과 2개의 리그
# 리그는 지난 수십년간 2개였다.

In [28]:
# for numerical_col

for col in numerical_col:
    print(col + ': ', len(set(data[str(col)])))

Year:  47
RS:  374
RA:  381
W:  63
OBP:  87
SLG:  162
BA:  75
Playoffs:  2
RankSeason:  9
RankPlayoffs:  6
G:  8
OOBP:  73
OSLG:  113


In [29]:
# 지난 수십년은 46년이었다.
# Playoffs가 2인 것으로 보아 categorical 변수로 바꿔줘도 무관할 것 같다.
# Q. G(Games Played)는 어떤 값을 가지고 있을까?

In [30]:
data.G.head()

0    162
1    162
2    162
3    162
4    162
Name: G, dtype: int64

In [31]:
# Game 수이다. 한 시즌에 치뤄진 경기수를 나타낸다.

In [32]:
data.G.mean()

161.91883116883116

In [33]:
# 47년간 평균 161.918경기가 치뤄졌다. 

# 2. Data Preprocessing 

### Task1. Column 삭제하기
- W(Wins), 승리 외에 팀의 가을야구 진출에 영향을 많이 미치는 Feature가 알고 싶다.
- del or drop을 사용하여 W column을 삭제한다.

In [34]:
data.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,9.0,6.0,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,9.0,6.0,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,9.0,6.0,162,0.335,0.424


In [35]:
del data['W']
data.head()

Unnamed: 0,Team,League,Year,RS,RA,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,0.328,0.418,0.259,0,9.0,6.0,162,0.317,0.415
1,ATL,NL,2012,700,600,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,0.315,0.415,0.26,0,9.0,6.0,162,0.331,0.428
4,CHC,NL,2012,613,759,0.302,0.378,0.24,0,9.0,6.0,162,0.335,0.424


### Task2. 인코딩: League
- League Feature는 AL과 NL로 이루어져 있다.

In [36]:
set(data.League)

{'AL', 'NL'}

In [37]:
data.League.replace({'AL':0, 'NL':1}, inplace=True)

In [38]:
data.League.head()

0    1
1    1
2    0
3    0
4    1
Name: League, dtype: int64

### Task3. column 삭제하기
- Team column을 삭제
- Team column이 없어도 모델에는 큰 영향이 없을 것 같다.

In [39]:
data.head()

Unnamed: 0,Team,League,Year,RS,RA,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,1,2012,734,688,0.328,0.418,0.259,0,9.0,6.0,162,0.317,0.415
1,ATL,1,2012,700,600,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,0,2012,712,705,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,0,2012,734,806,0.315,0.415,0.26,0,9.0,6.0,162,0.331,0.428
4,CHC,1,2012,613,759,0.302,0.378,0.24,0,9.0,6.0,162,0.335,0.424


In [40]:
del data['Team']
data.head()

Unnamed: 0,League,Year,RS,RA,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,1,2012,734,688,0.328,0.418,0.259,0,9.0,6.0,162,0.317,0.415
1,1,2012,700,600,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,0,2012,712,705,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,0,2012,734,806,0.315,0.415,0.26,0,9.0,6.0,162,0.331,0.428
4,1,2012,613,759,0.302,0.378,0.24,0,9.0,6.0,162,0.335,0.424


### Task4. NaN값 처리하기
- head를 찍어보니 NaN값이 보인다. 
- NaN값을 처리해준다.

In [41]:
data.isnull().sum()

League          0
Year            0
RS              0
RA              0
OBP             0
SLG             0
BA              0
Playoffs        0
RankSeason      0
RankPlayoffs    0
G               0
OOBP            0
OSLG            0
dtype: int64

In [42]:
# RankSeason, RankPlayoffs, OOBP, OSLG 변수에 Null 값이 있으므로 처리한다.

data.RankSeason.head()

In [43]:
data.RankSeason.head()

0    9.0
1    4.0
2    5.0
3    9.0
4    9.0
Name: RankSeason, dtype: float64

In [44]:
data.RankPlayoffs.head()

0    6.0
1    5.0
2    4.0
3    6.0
4    6.0
Name: RankPlayoffs, dtype: float64

In [45]:
# OOBP는 Opponent On-Base Percentage. 
# OSLG는 Opponent Slugging Percentage.

data.describe()

Unnamed: 0,League,Year,RS,RA,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
count,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0,1232.0
mean,0.5,1988.957792,715.081981,715.081981,0.326331,0.397342,0.259273,0.198052,7.836039,5.349838,161.918831,0.332264,0.419743
std,0.500203,14.819625,91.534294,93.079933,0.015013,0.033267,0.012907,0.398693,2.467149,1.396357,0.624365,0.008924,0.015466
min,0.0,1962.0,463.0,472.0,0.277,0.301,0.214,0.0,1.0,1.0,158.0,0.294,0.346
25%,0.0,1976.75,652.0,649.75,0.317,0.375,0.251,0.0,9.0,6.0,162.0,0.332264,0.419743
50%,0.5,1989.0,711.0,709.0,0.326,0.396,0.26,0.0,9.0,6.0,162.0,0.332264,0.419743
75%,1.0,2002.0,775.0,774.25,0.337,0.421,0.268,0.0,9.0,6.0,162.0,0.332264,0.419743
max,1.0,2012.0,1009.0,1103.0,0.373,0.491,0.294,1.0,9.0,6.0,165.0,0.384,0.499


* OOBP와 OSLG값에 sklearndml SimpleImputer를 사용하여 평균값을 넣어준다.
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [46]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=0)
imputer = imputer.fit(data[['OOBP', 'OSLG']])

data[['OOBP', 'OSLG']] = imputer.transform(data[['OOBP', 'OSLG']])

In [47]:
data.OOBP.isnull().sum()

0

In [48]:
data.OSLG.isnull().sum()

0

In [49]:
data.RankPlayoffs.isnull().sum()

0

In [50]:
data.RankSeason.isnull().sum()

0

In [51]:
del data['RankPlayoffs']
del data['RankSeason']

In [52]:
data.head()

Unnamed: 0,League,Year,RS,RA,OBP,SLG,BA,Playoffs,G,OOBP,OSLG
0,1,2012,734,688,0.328,0.418,0.259,0,162,0.317,0.415
1,1,2012,700,600,0.32,0.389,0.247,1,162,0.306,0.378
2,0,2012,712,705,0.311,0.417,0.247,1,162,0.315,0.403
3,0,2012,734,806,0.315,0.415,0.26,0,162,0.331,0.428
4,1,2012,613,759,0.302,0.378,0.24,0,162,0.335,0.424


In [53]:
del data['Year']

In [54]:
data.head()

Unnamed: 0,League,RS,RA,OBP,SLG,BA,Playoffs,G,OOBP,OSLG
0,1,734,688,0.328,0.418,0.259,0,162,0.317,0.415
1,1,700,600,0.32,0.389,0.247,1,162,0.306,0.378
2,0,712,705,0.311,0.417,0.247,1,162,0.315,0.403
3,0,734,806,0.315,0.415,0.26,0,162,0.331,0.428
4,1,613,759,0.302,0.378,0.24,0,162,0.335,0.424


# 3. train_test_split

In [68]:
from sklearn.linear_model import LogisticRegression

y=data['Playoffs'].as_matrix()
X=data.loc[:, data.columns != 'Playoffs']

  This is separate from the ipykernel package so we can avoid doing imports until


In [69]:
X.head()

Unnamed: 0,League,RS,RA,OBP,SLG,BA,G,OOBP,OSLG
0,1,734,688,0.328,0.418,0.259,162,0.317,0.415
1,1,700,600,0.32,0.389,0.247,162,0.306,0.378
2,0,712,705,0.311,0.417,0.247,162,0.315,0.403
3,0,734,806,0.315,0.415,0.26,162,0.331,0.428
4,1,613,759,0.302,0.378,0.24,162,0.335,0.424


In [70]:
# Feature G를 삭제해줍니다

del X['G'] 

In [71]:
X.head()

Unnamed: 0,League,RS,RA,OBP,SLG,BA,OOBP,OSLG
0,1,734,688,0.328,0.418,0.259,0.317,0.415
1,1,700,600,0.32,0.389,0.247,0.306,0.378
2,0,712,705,0.311,0.417,0.247,0.315,0.403
3,0,734,806,0.315,0.415,0.26,0.331,0.428
4,1,613,759,0.302,0.378,0.24,0.335,0.424


In [72]:
y

array([0, 1, 1, ..., 1, 0, 0], dtype=int64)

In [73]:
# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Q1. train_tset_split module
train_test_split() 함수에 들어가는 각각의 인자 값이 의미하는 것

(1) Parameter
- arrays : 분할시킬 데이터를 입력 (Python list, Numpy array, Pandas dataframe 등..)
- test_size : 테스트 데이터셋의 비율(float)이나 갯수(int) (default = 0.25)
- train_size : 학습 데이터셋의 비율(float)이나 갯수(int) (default = test_size의 나머지)
- random_state : 데이터 분할시 셔플이 이루어지는데 이를 위한 시드값 (int나 RandomState로 입력)
- shuffle : 셔플여부설정 (default = True)
- stratify : 지정한 Data의 비율을 유지한다. 예를 들어, Label Set인 Y가 25%의 0과 75%의 1로 이루어진 Binary Set일 때, stratify=Y로 설정하면 나누어진 데이터셋들도 0과 1을 각각 25%, 75%로 유지한 채 분할된다.

(2) Return
- X_train, X_test, Y_train, Y_test : arrays에 데이터와 레이블을 둘 다 넣었을 경우의 반환이며, 데이터와 레이블의 순서쌍은 유지된다.
- X_train, X_test : arrays에 레이블 없이 데이터만 넣었을 경우의 반환



[출처] sklearn의 train_test_split() 사용법

# 4. Feature Scaling

경우에 따라 값이 특정 범위에서 매우 높은 범위로 변환되어 피쳐 스케일링을 사용.

In [74]:
# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
warnings.filterwarnings(action='once')

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  


## Q2. Scaling


Scaling을 통해 우리가 하고자 하는 것은 독립변수의 구간을 표준화하는 것이다. 표준화란 서로 다른 정규분포 사이에 비교를 해야 할 때 등 필요에 따라 정규분포의 분산과 표준편차를 표준에 맞게 통일시키는 작업이다. 표준화가 되지 않으면 단위체계가 일정하지 않은 셈이라 분석에 어려움을 겪는다. 표준화의 단위로 표준편차(시그마)가 사용되는데, 평균을 0으로, 표준 편차를 1로 만들어준다.

# 5. Modeling 

## Q3. LogisticRegression() 모델과 모델 인자값의 의미

- random_state: 데이터를 섞을 때 사용하는 난수 생성 프로그램의 시드. 모델을 일정하게 유지하기 위해 사용한다.
- solver: 최적화 문제에서 사용하는 알고리즘. 작은 데이터셋에서는 liblinear, 큰 데이터셋에서는 saga가 효율적이다.
- multi_class: 이 인자에서 'ovr'이면, 이진 분류를 말한다. 'multinomial'은 solver가 liblinear일 때 쓸 수 없고, 전체 확률 분포에 대한 손실 적합성을 최소화한다. 'auto'는 이진 분류거나 solver = 'liblinear'인 경우 'ovr'을 선택하고, 그렇지 않으면 'multinomial'을 선택한다.

In [78]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=12, solver='liblinear', multi_class='ovr')
model.fit(X_train,y_train)
model.score(X_test, y_test)

0.8502024291497976

## Q4. 교차검증 (10-fold cross_validation)

In [94]:
from sklearn.model_selection import cross_val_score

print('LR', cross_val_score(model, X, y, cv=10).mean(), cross_val_score(model, X, y, cv=10).std())

# 10회의 교차 검증을 수행한 점수 산정의 평균 점수와 95 % 신뢰 구간



LR 0.8749769984479328 0.030557610905218604




## 6.  Feature Selection
- ref: https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python

- 중요 Feature를 선택하는 방법.
- Kaggle 자료를 참고.
- 내가 만든 모델에서 어떤 변수가 중요한가?

In [86]:
X.head()

Unnamed: 0,League,RS,RA,OBP,SLG,BA,OOBP,OSLG
0,1,734,688,0.328,0.418,0.259,0.317,0.415
1,1,700,600,0.32,0.389,0.247,0.306,0.378
2,0,712,705,0.311,0.417,0.247,0.315,0.403
3,0,734,806,0.315,0.415,0.26,0.331,0.428
4,1,613,759,0.302,0.378,0.24,0.335,0.424


In [87]:
data.head()

Unnamed: 0,League,RS,RA,OBP,SLG,BA,Playoffs,G,OOBP,OSLG
0,1,734,688,0.328,0.418,0.259,0,162,0.317,0.415
1,1,700,600,0.32,0.389,0.247,1,162,0.306,0.378
2,0,712,705,0.311,0.417,0.247,1,162,0.315,0.403
3,0,734,806,0.315,0.415,0.26,0,162,0.331,0.428
4,1,613,759,0.302,0.378,0.24,0,162,0.335,0.424


In [88]:
from sklearn.feature_selection import RFE

cols = ["BA", "League", "OOBP", "OSLG", "RA", "RS", "SLG"]
X = data[cols]
y = data['Playoffs']

# Build a logreg and compute the feature importances
model = LogisticRegression()
# create the RFE and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)
# summarize the selection of the attributes
print('Selected features: %s' % list(X.columns[rfe.support_]))


Selected features: ['OSLG', 'RA', 'RS']




중요 Feature를 3개 정도 뽑아봤는데
Batting Average가 포함되지 않았다. 신기하다.

Q. How to calculate Odds ratio?

https://stackoverflow.com/questions/38646040/attributeerror-linearregression-object-has-no-attribute-coef

In [89]:
model.fit(X, y)
model.coef_



array([[-0.01307594,  0.00790457, -0.02705128, -0.03298225, -0.03250142,
         0.02834823, -0.00785103]])

In [90]:
model.coef_

array([[-0.01307594,  0.00790457, -0.02705128, -0.03298225, -0.03250142,
         0.02834823, -0.00785103]])

In [91]:
X.head()

Unnamed: 0,BA,League,OOBP,OSLG,RA,RS,SLG
0,0.259,1,0.317,0.415,688,734,0.418
1,0.247,1,0.306,0.378,600,700,0.389
2,0.247,0,0.315,0.403,705,712,0.417
3,0.26,0,0.331,0.428,806,734,0.415
4,0.24,1,0.335,0.424,759,613,0.378


참고

- https://scikit-learn.org/stable/modules/cross_validation.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html