## 1. 머신러닝을 이용한 타이타닉 생존 예측 연습

---

### 1.1 라이브러리 로드

- **pandas**: 데이터프레임을 다루기 위한 라이브러리
- **sklearn**: 머신러닝을 다루기 위한 라이브러리
- **lightgbm**: LightGBM을 다루기 위한 라이브러리
- **xgboost**: xgboost를 다루기 위한 라이브 러리

In [8]:
# !pip install lightgbm

In [9]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.preprocessing import LabelEncoder

from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from tqdm import tqdm

### 1.2 데이터 전처리

**`pd.read_csv`를 이용하여 titanic 데이터를 로드합니다.**

In [10]:
data=pd.read_csv('./titanic_train.csv')

**data의 상위 5개 row를 확인합니다.**

In [11]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**각 컬럼의 Null 갯수를 확인합니다.**

In [12]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

**`Age`의 max, min, mean 값을 확인합니다.**


In [13]:
print(f"max: {data['Age'].max()}")
print(f"min: {data['Age'].min()}")
print(f"mean: {data['Age'].mean()}")

max: 80.0
min: 0.42
mean: 29.69911764705882


**`Name` 컬럼에서 initial 부분만 가져와 `data`의 `Initial`컬럼에 저장합니다.**

In [14]:
data['Initial']=0

for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.')
    
data['Initial']

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: Initial, Length: 891, dtype: object

**Sex 별 Initial 수를 확인합니다.**

`pd.crosstab`을 이용하여 `data.Initial`, `data.Sex` 의 교차표를 구합니다.

In [15]:
pd.crosstab(data.Initial, data.Sex).T

Initial,Capt,Col,Countess,Don,Dr,Jonkheer,Lady,Major,Master,Miss,Mlle,Mme,Mr,Mrs,Ms,Rev,Sir
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
female,0,0,1,0,1,0,1,0,0,182,2,1,0,125,1,0,0
male,1,2,0,1,6,1,0,2,40,0,0,0,517,0,0,6,1


**`Initial`중 이상치를 수정합니다.**

In [16]:
data['Initial'].replace(
    ['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],
    ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],
    inplace=True
    )

In [19]:
data['Initial']

0         Mr
1        Mrs
2       Miss
3        Mrs
4         Mr
       ...  
886    Other
887     Miss
888     Miss
889       Mr
890       Mr
Name: Initial, Length: 891, dtype: object

**`Initials`별로 `Age`의 평균을 출력합니다.**

In [20]:
data.groupby('Initial')['Age'].mean()

Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64

**`Age`가 null인 경우 `Initials`별로 `Age`의 평균을 참고하여 값을 채웁니다.**

In [21]:
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

**`Age`에 null 값이 있는지 확인합니다.**

In [22]:
data.Age.isnull().any() 

False

가족 구성원의 수를 담은 컬럼을 추가합니다.

In [23]:
data['FamilySize'] = data.SibSp + data.Parch + 1

혼자 탑승했는지에 대한 유무를 담은 컬럼을 추가합니다.

In [24]:
data['IsAlone'] = 1 
data['IsAlone'].loc[data['FamilySize'] > 1] = 0 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['IsAlone'].loc[data['FamilySize'] > 1] = 0


텍스트 데이터이지만 카테고리컬 한 의미를 가진 컬럼들을 정수형으로 변환한다.

In [25]:
label = LabelEncoder()

data['Sex_Code'] = label.fit_transform(data['Sex'])
data['Embarked_Code'] = label.fit_transform(data['Embarked'])
data['Initial_Code'] = label.fit_transform(data['Initial'])

학습 시 예측 할 대상을 `target`에 저장합니다.

In [26]:
target = data["Survived"]

학습 시 데이터로 사용 할 대상을 `data`에 저장합니다.

In [27]:
data = data.drop(columns=["PassengerId", "Name", "Survived", "Sex", "Ticket", "Initial", "Cabin", "Embarked"])

In [28]:
data

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Sex_Code,Embarked_Code,Initial_Code
0,3,22.0,1,0,7.2500,2,0,1,2,2
1,1,38.0,1,0,71.2833,2,0,0,0,3
2,3,26.0,0,0,7.9250,1,1,0,2,1
3,1,35.0,1,0,53.1000,2,0,0,2,3
4,3,35.0,0,0,8.0500,1,1,1,2,2
...,...,...,...,...,...,...,...,...,...,...
886,2,27.0,0,0,13.0000,1,1,1,2,4
887,1,19.0,0,0,30.0000,1,1,0,2,1
888,3,22.0,1,2,23.4500,4,0,0,2,1
889,1,26.0,0,0,30.0000,1,1,1,0,2


train_test_split을 사용하여 train 데이터와 test를 분리합니다.train,test size는 디폴트 값을 활용합니다.


In [29]:
X_train, X_test, Y_train, Y_test = train_test_split(data, target, random_state=75)

LGBM을 선언 후 X_train, Y_train을 이용하여 학습을 진행합니다.

In [30]:
lgbm = LGBMClassifier(random_state=75)
lgbm.fit(X_train, Y_train)

학습된 모델에 test data를 넣어 예측된 값을 Y_pred에 저장합니다.


In [31]:
Y_pred = lgbm.predict(X_test)

In [32]:
Y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0], dtype=int64)

예측된 값인 Y_pred와 실제 값인 Y_test를 이용하여 accuracy, recall, precision을 출력합니다.

In [33]:
accuracy = accuracy_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)

print(f"accuracy: {accuracy}")
print(f"recall: {recall}")
print(f"precision: {precision}")

accuracy: 0.8071748878923767
recall: 0.6756756756756757
precision: 0.7246376811594203


XGB를 선언 후 X_train, Y_train을 이용하여 학습을 진행합니다.

In [34]:
xgb = XGBClassifier(random_state=75)
xgb.fit(X_train, Y_train)

학습된 모델에 test data를 넣어 예측된 값을 Y_pred에 저장합니다.


In [35]:
Y_pred = xgb.predict(X_test)

예측된 값인 Y_pred와 실제 값인 Y_test를 이용하여 accuracy, recall, precision을 출력합니다.

In [36]:
accuracy = accuracy_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)

print(f"accuracy: {accuracy}")
print(f"recall: {recall}")
print(f"precision: {precision}")

accuracy: 0.820627802690583
recall: 0.6891891891891891
precision: 0.75


DecisionTree를 선언 후 X_train, Y_train을 이용하여 학습을 진행합니다.

In [37]:
tree = DecisionTreeClassifier(random_state=75)
tree.fit(X_train, Y_train)

학습된 모델에 test data를 넣어 예측된 값을 Y_pred에 저장합니다.


In [38]:
Y_pred = tree.predict(X_test)

예측된 값인 Y_pred와 실제 값인 Y_test를 이용하여 accuracy, recall, precision을 출력합니다.

In [39]:
accuracy = accuracy_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)

print(f"accuracy: {accuracy}")
print(f"recall: {recall}")
print(f"precision: {precision}")

accuracy: 0.7623318385650224
recall: 0.6486486486486487
precision: 0.64
