## 타이타닉 데이터 분석
+ 다양한 머신러닝 알고리즘을 이용하여 교차검증 방식으로 모델을 훈련시키고 예측 정확도를 평가

In [36]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [3]:
titanic = pd.read_csv('data/titanic.csv')

In [28]:
    titanic.head(20)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,life,seat,port,gender,harbour
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S,live,1st,southampthon,0,1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,S,live,1st,southampthon,1,1
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S,dead,1st,southampthon,0,1
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S,dead,1st,southampthon,1,1
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S,dead,1st,southampthon,0,1
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,S,live,1st,southampthon,1,1
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,S,live,1st,southampthon,0,1
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,S,dead,1st,southampthon,1,1
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,S,live,1st,southampthon,0,1
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,C,dead,1st,cherbourg,1,0


In [6]:
titanic.info() # 결측치 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306 entries, 0 to 1305
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1306 non-null   int64  
 1   survived  1306 non-null   int64  
 2   name      1306 non-null   object 
 3   sex       1306 non-null   object 
 4   age       1306 non-null   float64
 5   sibsp     1306 non-null   int64  
 6   parch     1306 non-null   int64  
 7   ticket    1306 non-null   object 
 8   fare      1306 non-null   float64
 9   embarked  1306 non-null   object 
 10  life      1306 non-null   object 
 11  seat      1306 non-null   object 
 12  port      1306 non-null   object 
dtypes: float64(2), int64(4), object(7)
memory usage: 132.8+ KB


### 레이블 분포 확인

In [7]:
titanic.life.value_counts()

dead    808
live    498
Name: life, dtype: int64

### 여러 특성들 중 좌석 분포 확인

In [8]:
titanic.seat.value_counts()

3rd    708
1st    321
2nd    277
Name: seat, dtype: int64

### 특성들 중 성별 확인

In [11]:
titanic.sex.value_counts()

male      842
female    464
Name: sex, dtype: int64

### 특성들 중 승선위치 분포 확인

In [12]:
titanic.port.value_counts()

southampthon    913
cherbourg       270
qeenstown       123
Name: port, dtype: int64

### 데이터 분석시 문자형 값보다는 숫자형 값을 더 잘 인식함
+ 문자형 값 -> 숫자형 값으로 변환하는 과정 필요

### 성별을 레이블 인코딩으로 숫자형으로 변환 -> 파생변수

In [18]:
titanic['gender'] = titanic['sex'].apply(lambda x: 0 if x == 'female' else 1)

In [21]:
titanic.iloc[:, [3,13]].head(5)
# titanic.loc[:, ['sex','gender']].head(5)

Unnamed: 0,sex,gender
0,female,0
1,male,1
2,female,0
3,male,1
4,female,0


In [26]:
titanic['harbour'] = titanic['embarked'].apply(lambda x: 0 if x == 'C' else (1 if x == 'S' else 2))

In [27]:
titanic.iloc[:, [9,14]].head(20)

Unnamed: 0,embarked,harbour
0,S,1
1,S,1
2,S,1
3,S,1
4,S,1
5,S,1
6,S,1
7,S,1
8,S,1
9,C,0


In [23]:
titanic.embarked.value_counts()

S    913
C    270
Q    123
Name: embarked, dtype: int64

### 분석에 필요한 컬럼을 뽑아 특성/레이블을 만들기

In [38]:
data = titanic.iloc[:, [0, 4, 5, 6, 8, 13, 14]]
target = titanic.survived

### 훈련/평가 데이터 분할

In [41]:
Xtrain, Xtest, ytrain, ytest = train_test_split(data, target, train_size=0.7, random_state=2111041110)

### 의사결정나무

In [42]:
dtclf = DecisionTreeClassifier()
dtclf.fit(Xtrain, ytrain)

DecisionTreeClassifier()

In [43]:
pred = dtclf.predict(Xtest)

In [44]:
accuracy_score(ytest, pred)

0.7755102040816326

### 로지스틱 회귀

In [45]:
lrclf = LogisticRegression()
lrclf.fit(Xtrain, ytrain)

LogisticRegression()

In [46]:
pred2 = lrclf.predict(Xtest)

In [47]:
accuracy_score(ytest, pred2)

0.8010204081632653

### 랜덤포레스트

In [48]:
rfclf = RandomForestClassifier()
rfclf.fit(Xtrain, ytrain)

RandomForestClassifier()

In [49]:
pred3 = rfclf.predict(Xtest)

In [51]:
accuracy_score(ytest, pred3)

0.7882653061224489

### 교차검증

In [61]:
dtclf = DecisionTreeClassifier(max_depth=3)
scores = cross_val_score(dtclf, data, target, cv=10, scoring = 'accuracy')
np.mean(scores)

0.7487081620669407

In [63]:
lrclf = LogisticRegression(max_iter=300)
scores = cross_val_score(lrclf, data, target, cv=10, scoring = 'accuracy')
np.mean(scores)

0.7501820317087493

In [65]:
rfclf = RandomForestClassifier()
scores = cross_val_score(rfclf, data, target, cv=10, scoring = 'accuracy')
np.mean(scores)

0.7364004697592484

### 머신러닝 모델 평가
+ 정확도만으로 모델의 성능을 평가하는 것이 옳은 것인가의 여부

In [66]:
titanic.life.value_counts()

dead    808
live    498
Name: life, dtype: int64

### 성별에 따른 생존여부

In [69]:
titanic.groupby(['sex', 'life'])['life'].count()

sex     life
female  dead    127
        live    337
male    dead    681
        live    161
Name: life, dtype: int64

#### 여성의 생존률이 남성의 생존률보다 높기 때문에 간단한 조건만으로 모델을 만들 수도 있음
+ 입력값 : 여성 -> 생존
+ 입력값 : 남성 -> 사망
+ 이렇게만 만들어도 모델과 정확성이 비등