## Q. 타이타닉 생존자 예측모델 개발을 위한 Titanic 분석용 데이터셋

#### Titanic data 전처리
- 분석 데이터 : titanic3.csv
- 재사용 가능한 전처리 사용자 함수 작성 하여 전처리
    - Null 값 처리 : Age는 평균나이, 나머지 칼럼은 'N'값으로 변경
    - 불필요한 속성 칼럼 삭제
    - 문자열 칼럼 레이블 인코딩
- 통계적, 시각적 탐색을 통한 다양한 인사이트 도출
- 탐색적 분석을 통한 feature engineering, 파생변수 

#### 컬럼 정보

- survived : 생존여부(1: 생존, 0 : 사망)
- pclass : 승선권 클래스(1 : 1st, 2 : 2nd ,3 : 3rd)
- name : 승객 이름
- sex : 승객 성별
- age : 승객 나이
- sibsp : 동반한 형제자매, 배우자 수
- parch : 동반한 부모, 자식 수
- ticket : 티켓의 고유 넘버
- fare 티켓의 요금
- cabin : 객실 번호
- embarked : 승선한 항구명(C : Cherbourg, Q : Queenstown, S : Southampton)
- boat
- body
- home.dest

## 모델 성능 개선 및 평가

### 데이터셋 개선을 위한 검토사항 예시
* 변수 = 'age_cat','male','female','fare_cat','family',{town_C','town_Q','town_S'}
* age의 Null 값을 평균값으로 대체하면 전체적인 데이터의 왜곡이 심함을 확인
* pclass는 fare_cat이랑 같이 모델에 넣을 경우 정확도가 떨어지고(0.82) 각각 넣었을 때는 fare_cat을 넣었을 때의 정확도가 더 높음(0.82, 3% 차이). 이상치에 가까울 정도로 요금이 높은 사람의 경우 사망률이 3클래스 승객과 비슷한 수치를 보이는 점이 pclass 변수에서는 반영이 되지 않았던 것이 원인으로 추측
* sex, embarked 변수로 집어넣은 것보다 원핫 인코더(dummies)로 처리해서 넣는 것이 정확도를 대략 5% 정도 높여주며 디시전 트리의 분기를 더 쉽게하는 효과 확인
* parch와 sibsp를 각각 변수에 적용하면  의미 있는 양상이 보이지 않고 정확도를 떨어뜨리지만 두 변수를 합쳐서 family라는 파생변수를 생성하면 생존율이 높은 여성 승객일지라도 가족 구성원 수가 많으면 생존율이 낮아지는 것을 확인

### 전처리 내역 예시
- age null 처리방법 변경 : 평균값 대체 > 삭제
- pclass와 fare_cat중 분석변수 선택
- embarked 원핫인코딩

In [222]:
import pandas as pd
titanic_df = pd.read_csv('./dataset/titanic3.csv')
titanic_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [223]:
df = titanic_df.copy()

In [224]:
df.drop(['pclass','name','ticket','cabin','boat','body','home.dest'],axis=1,inplace=True)

In [225]:
df['age'].fillna(df['age'].mean().astype(int),inplace=True)

In [226]:
most_em = df['embarked'].value_counts(dropna=True).idxmax()
df['embarked'].fillna(most_em,inplace=True)

In [227]:
df['fare'].fillna(df['fare'].mean(),inplace=True)
df.isnull().sum()

survived    0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
dtype: int64

In [228]:
def age_cat(age):
    cat = ''
    if age <= 10: cat = 'young'
    elif age <= 20: cat = 'teen'
    elif age <= 30: cat = 'adult'
    elif age <= 60: cat = 'mature'
    else: cat = 'elder'
    return cat

df['age_cat'] = df['age'].apply(age_cat)

In [229]:
df['family'] = df['sibsp'] + df['parch']

def family(family):
    if family <= 2: cat = 1
    elif family <= 4: cat = 2
    elif family <= 6: cat = 3
    else: cat = 4
    return cat

df['family_cat'] = df['family'].apply(family)

In [230]:
from sklearn.preprocessing import LabelEncoder
def Laen(a):
    le = LabelEncoder()
    a = le.fit_transform(a)
    return a


df['sex'] = Laen(df.sex)
df['embarked'] = Laen(df.embarked)
df['age_cat'] = Laen(df.age_cat)
df.corr()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,embarked,age_cat,family,family_cat
survived,1.0,-0.528693,-0.047227,-0.027825,0.08266,0.244208,-0.175313,0.121724,0.026876,-0.070919
sex,-0.528693,1.0,0.05567,-0.109609,-0.213125,-0.185484,0.09796,-0.085891,-0.188583,-0.097398
age,-0.047227,0.05567,1.0,-0.190464,-0.128572,0.175036,-0.067573,-0.178175,-0.195553,-0.193382
sibsp,-0.027825,-0.109609,-0.190464,1.0,0.373587,0.160224,0.065567,0.200094,0.861952,0.774986
parch,0.08266,-0.213125,-0.128572,0.373587,1.0,0.221522,0.044772,0.245551,0.792296,0.693298
fare,0.244208,-0.185484,0.175036,0.160224,0.221522,1.0,-0.23797,0.094164,0.226465,0.163028
embarked,-0.175313,0.09796,-0.067573,0.065567,0.044772,-0.23797,1.0,0.048581,0.067598,0.115604
age_cat,0.121724,-0.085891,-0.178175,0.200094,0.245551,0.094164,0.048581,1.0,0.265823,0.198438
family,0.026876,-0.188583,-0.195553,0.861952,0.792296,0.226465,0.067598,0.265823,1.0,0.888688
family_cat,-0.070919,-0.097398,-0.193382,0.774986,0.693298,0.163028,0.115604,0.198438,0.888688,1.0


In [231]:
def fare_cat(f):
    cat = ''
    if f <= 10: cat = 5
    elif f <= 40: cat = 4
    elif f <= 60: cat = 3
    elif f <= 100: cat = 2
    else: cat = 1
    return cat

df['fate_cat'] = df['fare'].apply(fare_cat)
df.drop(['fare','age','embarked'],axis=1,inplace=True)

In [232]:
df

Unnamed: 0,survived,sex,sibsp,parch,age_cat,family,family_cat,fate_cat
0,1,0,0,0,0,0,1,1
1,1,1,1,2,4,3,2,1
2,0,0,1,2,4,3,2,1
3,0,1,1,2,0,3,2,1
4,0,0,1,2,0,3,2,1
...,...,...,...,...,...,...,...,...
1304,0,0,1,0,3,1,1,4
1305,0,0,1,0,0,1,1,4
1306,0,1,0,0,0,0,1,5
1307,0,1,0,0,0,0,1,5


In [233]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# 독립변수, 종속변수 분리
y_t_df = df['survived'] # 종속변수
X_t_df = df.drop('survived', axis = 1) # 독립변수

# 독립변수 정규화
# X_t_df = preprocessing.StandardScaler().fit(X_t_df).transform(X_t_df)

# 학습용 데이터와 평가용 데이터를 8:2 혹은 7:3으로 분리
X_train, X_test, y_train, y_test = train_test_split(X_t_df, y_t_df, test_size = 0.2,
                                                   random_state = 11)

print(X_train.shape)
print(X_test.shape)


(1047, 7)
(262, 7)


In [264]:
# 모델 학습 및 평가
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, rf_pred).round(2)

lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)
lr_pred = lr_model.predict(X_test)
accuracy_lr = accuracy_score(y_test,lr_pred).round(2)

print('rf 정확도:{}, lr 정확도:{}'.format(accuracy_rf,accuracy_lr))

rf 정확도:0.82, lr 정확도:0.79
