# 3주차 타이타닉 데이터셋과 다양한 분류모델 비교 실습

자 이번에는 저희가 배운 지도학습 모델의 분류모델을 다양하게 비교해볼까요?

### 요구 사항

1. 캐글에서 타이타닉 데이터셋을 다양한 방법으로 다운로드 후 로드해주세요
(https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv)
2. 이전과제의 코드를 사용해 전처리를 진행하고 타겟 즉 레이블 분리 작업을 해봅시다.
3. 3개 이상의 분류모델을 만들어주세요
4. 성능을 비교해볼까요?

### **숙제 정보**

■ 난이도 : 🟡중

■ 실습 범위 : 3주차

■ 사용 언어 및 라이브러리 : scikit-learn, pandas

In [14]:
#필요 라이브러리 임포트
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [15]:
#데이터 로드
train_data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [16]:
#데이터 출력
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [17]:
#필요한 열 선택 및 전처리
df = train_data[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
df.loc[:, 'Age'] = df['Age'].fillna(df['Age'].mean())
X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
#라이브러리 임포트
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#모델 정의
models = {
    'LogisticRegression': LogisticRegression(max_iter=200),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Support Vector Machine': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

In [22]:
# 모델 학습 및 평가
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results[model_name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred)
    }

results_df = pd.DataFrame(results).T


results_df

Unnamed: 0,Accuracy,Precision,Recall,F1 Score
LogisticRegression,0.731844,0.770833,0.5,0.606557
Decision Tree,0.636872,0.571429,0.486486,0.525547
Random Forest,0.715084,0.671642,0.608108,0.638298
Support Vector Machine,0.653631,0.75,0.243243,0.367347
K-Nearest Neighbors,0.675978,0.642857,0.486486,0.553846
