생존여부 예측모델 만들기
학습용 데이터 (X_train, y_train)을 이용하여 생존 예측 모형을 만든 후, 이를 평가용 데이터(X_test)에 적용하여 얻은 예측값을 다음과 같은 형식의 CSV파일로 생성하시오(제출한 모델의 성능은 accuracy 평가지표에 따라 채점)
(가) 제공 데이터 목록

    y_train: 생존여부(학습용)
    X_trian, X_test : 승객 정보 (학습용 및 평가용)
(나) 데이터 형식 및 내용
    y_trian (712명 데이터)

In [104]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [105]:
def exam_data_load(df, target, id_name="", null_name=""):
    if id_name == "":
        df = df.reset_index().rename(columns={"index": "id"})
        id_name = 'id'
    else:
        id_name = id_name
    
    if null_name != "":
        df[df == null_name] = np.nan
    
    X_train, X_test = train_test_split(df, test_size=0.2, random_state=2021)
    
    y_train = X_train[[id_name, target]]
    X_train = X_train.drop(columns=[target])

    
    y_test = X_test[[id_name, target]]
    X_test = X_test.drop(columns=[target])
    return X_train, X_test, y_train, y_test 

In [106]:
df = '/Users/jochaeyeon/Downloads/titanic/train.csv'
df = pd.read_csv(df)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [107]:
X_train, X_test, y_train, y_test = exam_data_load(df, target='Survived', id_name='PassengerId')

# 일단 여기서 PassengerId는 타이타닉 데이터셋에서 각 승객을 고유하게 식별하는 컬럼
# 중복되지 않는 유일한 값

In [108]:
X_train.shape, y_train.shape, X_test.shape

((712, 11), (712, 2), (179, 11))

In [109]:
X_train.head() 

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
90,91,3,"Christmann, Mr. Emil",male,29.0,0,0,343276,8.05,,S
103,104,3,"Johansson, Mr. Gustaf Joel",male,33.0,0,0,7540,8.6542,,S
577,578,1,"Silvey, Mrs. William Baird (Alice Munger)",female,39.0,1,0,13507,55.9,E44,S
215,216,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C
191,192,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S


object 변수란?
Pandas의 object 타입은 문자열 데이터를 포함할 수 있는 데이터 유형입니다.
이 안에는 다음과 같은 두 가지 데이터 유형이 포함될 수 있습니다:

범주형 변수: 값이 제한된 몇 가지 카테고리로 나뉘는 데이터.
예: Sex ("male", "female"), Embarked ("S", "C", "Q")

일반 문자열 데이터: 의미 있는 분류가 아닌 단순 텍스트 데이터.
예: Name ("John Doe"), Ticket ("A/5 21171")

일단 몇가지 카테고리로 분류 할 수 있는 것 : 범주형 변수 !

In [110]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 90 to 116
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Pclass       712 non-null    int64  
 2   Name         712 non-null    object 
 3   Sex          712 non-null    object 
 4   Age          575 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Ticket       712 non-null    object 
 8   Fare         712 non-null    float64
 9   Cabin        170 non-null    object 
 10  Embarked     711 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 66.8+ KB


In [111]:
y_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 90 to 116
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  712 non-null    int64
 1   Survived     712 non-null    int64
dtypes: int64(2)
memory usage: 16.7 KB


In [112]:
y_train["Survived"].value_counts()
# y_train 데이터에서 Survived 컬럼의 각 값(0과 1)의 개수를 세는 것

Survived
0    441
1    271
Name: count, dtype: int64

In [113]:
y = y_train["Survived"]
features =["Pclass", "Sex","SibSp", "Parch"]
X = pd.get_dummies(X_train[features])
test = pd.get_dummies(X_test[features])

In [114]:
X.shape, X_test.shape

((712, 5), (179, 11))

In [115]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 200, max_depth=7, random_state=2021)
model.fit(X,y)
predictions = model.predict(test)

In [116]:
model.score(X,y)

0.8356741573033708

In [117]:
output = pd.DataFrame({'PassengerId': X_test.PassengerId, 'Survived': predictions})
output.head()

Unnamed: 0,PassengerId,Survived
210,211,0
876,877,0
666,667,0
819,820,0
736,737,0


In [118]:
output.to_csv("1234567.csv", index=False)