### 문제 정의(목표 설정)

- 타이타닉 데이터를 학습해서 생존자와 사망자를 예측해 보자.  
- 머신러닝의 전체 과정을 진행해보면서 프로세스를 이해해보자.

In [1]:
# 필요한 라이브러리 import (numpy / pandas / matplotlib / seaborn)

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

### 데이터 수집

In [2]:
# data = pd.read_csv('./data/gender_submission.csv')
data = pd.read_csv('./data/train.csv', index_col='PassengerId')

data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
print(data.shape)

(891, 11)


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [5]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# 데이터 불러오기 - 기존  PassengerId 컬럼을 인덱스로 설정
train = pd.read_csv('./data/train.csv', index_col='PassengerId')
test = pd.read_csv('./data/test.csv', index_col='PassengerId')

In [8]:
# 데이터 확인 (shape)
train.shape, test.shape

((891, 11), (418, 10))

In [9]:
# 질문 왜 컬럼의 수가 다를까?
# test에 대한 정답은 kaggle이 가지고 있음
# 우리는 test 데이터로 예측을 진행해서 Kaggle에 제출해야한다.

In [10]:
train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**타이타닉 데이터의 구조**
- 3개의 파일로 이루어져 있다.  
- train.csv : 학습용 / 훈련용 데이터
- test.csv : 평가용 데이터
- gender_submission.csv  : 제출용 답안지 서식 파일

In [11]:
# 불러온 데이터에 컬럼을 살펴보자
train.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [12]:
test.columns

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')

**타이타닉 데이터 내부의 컬럼 정보**
- 'PassengerId : 승객의 번호
- 'Survived' : 생존 여부(1 : 생존, 0 : 사망) / train에만 존재한다!
- 'Pclass' : 승객의 등급(1~ 3) / 1 : 1등급
- 'Name' : 이름
- 'Sex' : 성별
- 'Age' : 나이
- 'SibSp' : 동승한 형제 또는 배우자의 수
- 'Parch' : 동승한 부모 자식 수
- 'Ticket' : 티켓 번호
- 'Fare' : 승객이 지불한 요금
- 'Cabin' : 객실의 번호
- 'Embarked' : 승선지(C : 쉘부르크 / Q : 퀸즈타운 / S : 사우스햄튼) 

In [13]:
# 데이터프레임에 대한 간략한 정보
# train
train.info()

#비어있는 값들이 있다

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [14]:
# test
test.info()
test.describe()

# 결측치들이 있다

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,418.0,332.0,418.0,418.0,417.0
mean,2.26555,30.27259,0.447368,0.392344,35.627188
std,0.841838,14.181209,0.89676,0.981429,55.907576
min,1.0,0.17,0.0,0.0,0.0
25%,1.0,21.0,0.0,0.0,7.8958
50%,3.0,27.0,0.0,0.0,14.4542
75%,3.0,39.0,1.0,0.0,31.5
max,3.0,76.0,8.0,9.0,512.3292


**결측치 정리**
- train : Age/ Carbin / Embarked  
- test : Age / Fare / Cabin  

### 데이터 전처리

**Age 살펴보기**

In [15]:
# train의 Age 살펴보기

train['Age']


PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     NaN
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

In [16]:
# test의 Age 살펴보기

test['Age']

PassengerId
892     34.5
893     47.0
894     62.0
895     27.0
896     22.0
        ... 
1305     NaN
1306    39.0
1307    38.5
1308     NaN
1309     NaN
Name: Age, Length: 418, dtype: float64

In [17]:
# 나이의 기술 통계
train['Age'].describe()
# Age -> 결축치를 채워야하는데 -- 데이터가 최소값쪽으로 쏠려있다.

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

**Age 컬럼의 특성**
- 타입은 실수형이다. 나이인데?
- 나이 데이터는 최소값 쪽으로 쏠린 모습을 보인다.(분포가 치우쳐져 있다.)
-  0 ~ 80 까지의 데이터를 가지고 있다. - 일반적인 평균을 구하기엔 범위가 넓다.
- 다른 컬럼과 상관관계를 살펴보고 연관성 있는 컬럼을 엮어서 상세하게 결측치를 채워보자.


In [18]:
# 상관관계 : 각 튻성별로 영향도를 수치로 파악할 수 있다.(범위 -1(반비례) ~ (1비례))
# 절대값이 클수록(값이 1에 가까울 수록) 영향도가 높다.
train.corr()

# 상관관계를 살펴볼때 나오는 데이터는 (값이 숫자인 데이터만 나온다.)
# 숫자값이 아니라면 모두 자동으로 필터링됨


Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
Survived,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [19]:
# 상관관계가 높은 데이터를 가지고 피봇테이블 만들어 보자
pt1 = train.pivot_table(values='Age', # 데이터로 사용할 컬럼 지정
                       index = ['Pclass', 'Sex'],#인덱스를 설정하겠다(멀티 인덱스)
                        # 생존 여부에 영향을 많이 미치는 성별도 추가해보겠다.
                        #인덱스를 설정할때 1차로 Pclass로 나눈뒤 성별로 한번더 나누겠다.
                        aggfunc = 'mean' # 데이터 요약시 사용하는 함수를 지정
                        #(mean:평균 / sum : 합 count : 갯수)
                       )  #피벗테이블
pt1

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Pclass,Sex,Unnamed: 2_level_1
1,female,34.611765
1,male,41.281386
2,female,28.722973
2,male,30.740707
3,female,21.75
3,male,26.507589


1 등급 남자 평균 나이는 41세  
2 등급 여자 평균 나이는 28  
3 등금 남자 나이는 26  
= 나이가 많을수록 등급이 숫자가 낮아진다?

In [20]:
# 멀티 인덱스 인덱싱  
pt1.loc[1, 'male']

Age    41.281386
Name: (1, male), dtype: float64

In [21]:
# 결측치 확인 - pd.isna()
pd.isna(train['Age'])

PassengerId
1      False
2      False
3      False
4      False
5      False
       ...  
887    False
888    False
889     True
890    False
891    False
Name: Age, Length: 891, dtype: bool

In [22]:
# 불리언 인덱싱으로 확인해보기
train[pd.isna(train['Age'])]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...
860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


train[] # 괄호안에 번호 인덱싱
        # 번호 대신  불리언 데이터 -> 불리언 인덱싱
    
#위의 df는 ture 인 값만 나온상태

-177 개에대한 데이터의 조건을 맞춰서 나이값을 넣어주기 어렵다.
- apply 함수를 이용해서 한번에 값을 처리해주겠다.
- apply() : 다른 함수를 pandas의 객체에 연결시켜주는 함수

In [23]:
# 나이를 채워주는 함수를 만들어보자
# 에이지를 채워주겟다 data 로
def fill_age(data) : # 매개변수 data에는 train 또는 test 데이터가 들어가게 된다!
    # 만약 data에 Age 컬럼이 결측치라면 피봇 테이블에서 값을 가져와 넣어줘라.
    if pd.isna(data['Age']) :  
    # isna 결측치를 체크하겠다
    # 위에서 만든 피봇 테이블 멀티 인덱싱한 값을 리턴
        return pt1.loc[data['Pclass'],data['Sex']]
    
    # 멀티 인덱싱
    # Age 컬럼에 결측치가 없다면 기존의 값을 사용하자.
    else :
        return data['Age']
        

In [24]:
# Age 결측치 채우기
train['Age'] = train.apply(fill_age, axis = 1).astype('int64')
# axis = 축

In [25]:
# Age 결측치 채우기
test['Age'] = test.apply(fill_age, axis = 1).astype('int64')

# astype('int64') = float 자료형을 int64로 바꾼다

In [26]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    int64  
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(1), int64(5), object(5)
memory usage: 83.5+ KB


In [27]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       418 non-null    int64  
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(1), int64(4), object(5)
memory usage: 35.9+ KB


In [28]:
train['Age']

PassengerId
1      22
2      38
3      26
4      35
5      35
       ..
887    27
888    19
889    21
890    26
891    32
Name: Age, Length: 891, dtype: int64

In [29]:
train['Age'].describe()

count    891.000000
mean      29.191919
std       13.313598
min        0.000000
25%       21.000000
50%       26.000000
75%       36.000000
max       80.000000
Name: Age, dtype: float64

**Embarked 데이터 채워주기**
- 결측치 2개 있따