## 타이타닉 데이터셋 도전

- 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표

- [캐글](https://www.kaggle.com)의 [타이타닉 챌린지](https://www.kaggle.com/c/titanic)에서 `train.csv`와 `test.csv`를 다운로드
- 두 파일을 각각 datasets 디렉토리에 titanic_train.csv titanic_test.csv로 저장

## 1. 데이터 탐색

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### 1.1 데이터 적재

In [11]:
df1 = pd.read_csv('C:/Users/User/Desktop/코딩/JAY 연습장/eximg/titanic_train.csv')
df2 = pd.read_csv('C:/Users/User/Desktop/코딩/JAY 연습장/eximg/titanic_test.csv')

#### 1.2 titanic_df 살펴보기

In [12]:
df1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


* **Survived**: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
* **Pclass**: 승객 등급. 1, 2, 3등석.
* **Name**, **Sex**, **Age**: 이름 그대로의 의미
* **SibSp**: 함께 탑승한 형제, 배우자의 수
* **Parch**: 함께 탑승한 자녀, 부모의 수
* **Ticket**: 티켓 아이디
* **Fare**: 티켓 요금 (파운드)
* **Cabin**: 객실 번호
* **Embarked**: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)


#### 1.3 누락 데이터 살펴보기

In [16]:
df1.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### 1.4 통계치 살펴보기

In [22]:
df1.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### 1.5 Survived 컬럼 값의 빈도수 확인

In [25]:
df1['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

#### 1.6 범주형(카테고리) 특성들의 빈도수 확인
- **Pclass**, **Sex**, **Embarked**
- **Embarked** 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.

In [26]:
df1['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [27]:
df1['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [64]:
df1['Embarked'].value_counts(dropna=False)# 드롭나를 거짓으로 바꾸면 널 값도 가져옴 !!

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

#### 1.7 Name과 Age 열 을 Age 순으로 정렬해서 보기

In [35]:
df1[['Name','Age']].sort_values(by='Age').head(10)

Unnamed: 0,Name,Age
803,"Thomas, Master. Assad Alexander",0.42
755,"Hamalainen, Master. Viljo",0.67
644,"Baclini, Miss. Eugenie",0.75
469,"Baclini, Miss. Helene Barbara",0.75
78,"Caldwell, Master. Alden Gates",0.83
831,"Richards, Master. George Sibley",0.83
305,"Allison, Master. Hudson Trevor",0.92
827,"Mallet, Master. Andre",1.0
381,"Nakid, Miss. Maria (""Mary"")",1.0
164,"Panula, Master. Eino Viljami",1.0


#### 1.8 나이(Age)가 60 이상인 사람들의 Name과 Age 확인해 보기

In [40]:
df1[df1['Age']>=60][['Name','Age']].sort_values(by='Age')

Unnamed: 0,Name,Age
587,"Frolicher-Stehli, Mr. Maxmillian",60.0
694,"Weir, Col. John",60.0
684,"Brown, Mr. Thomas William Solomon",60.0
366,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",60.0
625,"Sutton, Mr. Frederick",61.0
326,"Nysveen, Mr. Johan Hansen",61.0
170,"Van der hoef, Mr. Wyckoff",61.0
570,"Harris, Mr. George",62.0
829,"Stone, Mrs. George Nelson (Martha Evelyn)",62.0
555,"Wright, Mr. George",62.0


In [65]:
df1.loc[df1['Age']>=60,['Name','Age']].sort_values(by='Age') # ']['요러고 마주보고 있으면 ',' 로 바꿔서 쓸수있음 'loc'도 포함해야 댐!!

Unnamed: 0,Name,Age
587,"Frolicher-Stehli, Mr. Maxmillian",60.0
694,"Weir, Col. John",60.0
684,"Brown, Mr. Thomas William Solomon",60.0
366,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",60.0
625,"Sutton, Mr. Frederick",61.0
326,"Nysveen, Mr. Johan Hansen",61.0
170,"Van der hoef, Mr. Wyckoff",61.0
570,"Harris, Mr. George",62.0
829,"Stone, Mrs. George Nelson (Martha Evelyn)",62.0
555,"Wright, Mr. George",62.0


#### 1.9 나이가(Age)가 60 이상이고 1등석에 탔으며 여성인 탑승자 확인해 보기

In [55]:
a=df1[df1['Age']>=60][['Name','Age','Pclass','Sex']]
b=a[a['Pclass']==1][['Name','Age','Pclass','Sex']]
c=b[b['Sex']=='female'][['Name','Age','Pclass','Sex']]
c

Unnamed: 0,Name,Age,Pclass,Sex
275,"Andrews, Miss. Kornelia Theodosia",63.0,1,female
366,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",60.0,1,female
829,"Stone, Mrs. George Nelson (Martha Evelyn)",62.0,1,female


In [56]:
df1[(df1["Age"] >= 60) & (df1["Pclass"] == 1) & (df1["Sex"] == 'female')] 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
366,367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",female,60.0,1,0,110813,75.25,D37,C
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


#### 1.10 요금(Fare)의 최대값 최소값 확인해 보기

In [58]:
df1['Fare'].min()

0.0

In [59]:
df1['Fare'].max()

512.3292

#### 1.11 등급(Pcalss) 그룹별 생존률 확인해보기

In [60]:
df1.groupby(["Pclass","Survived"]).size()

Pclass  Survived
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119
dtype: int64

In [68]:
df1['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [66]:
df1.groupby('Pclass').mean()['Survived']

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

## 2. 데이터 전처리 (누락 데이터 처리, 범주화 등)

#### 2.1 Cabin 열 : 전체 삭제하기

In [71]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [63]:
df1.drop('Cabin',axis=1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


In [70]:
df1.dropna(thresh=600,axis=1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


#### 2.2  Embarked 열 : 누락데이터를 승선도시 최고 빈도수 값으로 대체하기

In [72]:
df1["Embarked"].value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

In [74]:
a = df1["Embarked"].value_counts(dropna=False).idxmax()
a

'S'

In [76]:
df1['Embarked'].fillna(a, inplace=True)


In [77]:
df1["Embarked"].value_counts(dropna=False)

S    646
C    168
Q     77
Name: Embarked, dtype: int64

#### 2.3  Age 열 : 중간값으로 대체하기

In [78]:
df1["Age"].isnull().sum()

177

In [79]:
df1["Age"].fillna(df1["Age"].median(), inplace=True)

#### 2.4  Age 열: 범주로 나눠보기

* 0~18세
* 18~25세
* 25~35세
* 35~60세
* 60~80세

In [80]:
bins = [0,18, 25, 35, 60, 80]
group_names = ['Children','Youth', 'YoungAdult', 'MiddleAged', 'Senior']
age_cats = pd.cut(df1["Age"], bins, labels=group_names)
age_cats

0           Youth
1      MiddleAged
2      YoungAdult
3      YoungAdult
4      YoungAdult
          ...    
886    YoungAdult
887         Youth
888    YoungAdult
889    YoungAdult
890    YoungAdult
Name: Age, Length: 891, dtype: category
Categories (5, object): ['Children' < 'Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [81]:
pd.value_counts(age_cats)

YoungAdult    373
MiddleAged    195
Youth         162
Children      139
Senior         22
Name: Age, dtype: int64

* 범주 데이터를 dummy 변수로 바꿔보기 (One-Hot Encoding)

In [82]:
Age_dummies = pd.get_dummies(age_cats)
Age_dummies= Age_dummies.add_prefix('Age_')
Age_dummies

Unnamed: 0,Age_Children,Age_Youth,Age_YoungAdult,Age_MiddleAged,Age_Senior
0,0,1,0,0,0
1,0,0,0,1,0
2,0,0,1,0,0
3,0,0,1,0,0
4,0,0,1,0,0
...,...,...,...,...,...
886,0,0,1,0,0
887,0,1,0,0,0
888,0,0,1,0,0
889,0,0,1,0,0


#### 2.5 중복 데이터 확인

In [83]:
df1.duplicated().sum()

0