1. hyperparameter가 전부 기본값으로 설정된 KNN, random forest로 타이타닉 데이터의 생존을 예측해봅시다. 어떤 모델이 성능이 더 좋나요?

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [12]:
df = pd.read_csv('titanic.csv')

In [13]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [14]:
df = df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

In [16]:
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])

In [17]:
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
df['Embarked'].fillna('S', inplace=True)
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

In [18]:
X = df.drop(columns='Survived')
y = df['Survived']

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
knn = KNeighborsClassifier()  
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

In [21]:
rf = RandomForestClassifier()  
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

In [22]:
accuracy_knn = accuracy_score(y_test, y_pred_knn)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

In [23]:
print(f"KNN 정확도: {accuracy_knn:.4f}")
print(f"Random Forest 정확도: {accuracy_rf:.4f}")

KNN 정확도: 0.7207
Random Forest 정확도: 0.8045


2. 이번에는 hyperparameter를 설정해봅시다. KNN은 k의 값, random forest는 재민이가 원하는 hyperparameter를 하나 골라봅시다. hyperparameter의 값은 5종류로 설정해봅시다. 그럼 서로 다른 hyperparameter를 가진 KNN 5개, random forest 5개가 만들어지겠죠? 그 중 성능이 가장 좋은 건 누구인가요?

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [25]:
data = pd.read_csv('titanic.csv')

In [26]:
data = data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

In [27]:
imputer = SimpleImputer(strategy='mean')
data['Age'] = imputer.fit_transform(data[['Age']])

In [28]:
label_encoder = LabelEncoder()
data['Sex'] = label_encoder.fit_transform(data['Sex'])
data['Embarked'].fillna('S', inplace=True)
data['Embarked'] = label_encoder.fit_transform(data['Embarked'])

In [29]:
X = data.drop(columns='Survived')
y = data['Survived']

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [31]:
knn_params = [3, 5, 7, 9, 11]  
rf_params = [50, 100, 200, 300, 400]  

In [32]:
best_knn_accuracy = 0
best_knn_model = None
for k in knn_params:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred_knn = knn.predict(X_test)
    accuracy_knn = accuracy_score(y_test, y_pred_knn)
    print(f"KNN (k={k}) 정확도: {accuracy_knn:.4f}")
    if accuracy_knn > best_knn_accuracy:
        best_knn_accuracy = accuracy_knn
        best_knn_model = knn

KNN (k=3) 정확도: 0.7151
KNN (k=5) 정확도: 0.7207
KNN (k=7) 정확도: 0.7207
KNN (k=9) 정확도: 0.7263
KNN (k=11) 정확도: 0.7263


In [33]:
best_rf_accuracy = 0
best_rf_model = None
for n in rf_params:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    y_pred_rf = rf.predict(X_test)
    accuracy_rf = accuracy_score(y_test, y_pred_rf)
    print(f"Random Forest (n_estimators={n}) 정확도: {accuracy_rf:.4f}")
    if accuracy_rf > best_rf_accuracy:
        best_rf_accuracy = accuracy_rf
        best_rf_model = rf

Random Forest (n_estimators=50) 정확도: 0.8156
Random Forest (n_estimators=100) 정확도: 0.8101
Random Forest (n_estimators=200) 정확도: 0.8045
Random Forest (n_estimators=300) 정확도: 0.7989
Random Forest (n_estimators=400) 정확도: 0.8045
