Task1_0725. 타이타닉 생존자 예측 데이터 세트 train.csv에 대하여 다음 사항을 수행하세요.
- 일괄 전처리 사용자 함수 transform_features(df) 작성
- 분류 모델 학습 및 평가 사용자 함수 작성
- dt, lr, rf 모델링 및 평가(정확도)

- GridSearchCV의 최적 하이퍼 파라미터로 학습된 Estimator로 예측 및 평가 수행.
  - Decision Tree, Random Forest, Logistic Regression 모델별 수행
  - 선택한 모델에 적합한 parameter greed 적용
  - cv=5 적용

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV

def categorize_age(age):
  if age < 13:
      return 'Child'
  elif age < 20:
      return 'Teenager'
  elif age < 60:
      return 'Adult'
  else:
      return 'Senior'

# 일괄 전처리 사용자 함수 transform_features(df)
def transform_features(df):
  # 이상치 처리
  Q1 = df['Fare'].quantile(0.25)
  Q3 = df['Fare'].quantile(0.75)
  IQR = Q3 - Q1
  fare_outliers = df[(df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3 + 1.5 * IQR))]

  df = df.drop(fare_outliers.index)

  # 결측치 처리
  imputer_most_frequent = SimpleImputer(strategy='most_frequent')
  df['Age'] = imputer_most_frequent.fit_transform(df[['Age']])
  df['Fare'] = imputer_most_frequent.fit_transform(df[['Fare']])
  df['Embarked'] = df['Embarked'].fillna('S')

  # has_family 컬럼 생성
  df['family_size'] = df['SibSp'] + df['Parch']

  df['AgeGroup'] = df['Age'].apply(lambda x: categorize_age(x))

  df['Pclass_Fare'] = df['Pclass'] * df['Fare']

  df['TicketCount'] = df.groupby('Ticket')['Ticket'].transform('count')

  # 원-핫 인코딩
  df = pd.get_dummies(df, columns=['Embarked', 'Sex', 'SibSp', 'Parch', 'family_size', 'AgeGroup', 'TicketCount', 'Ticket'])

  # 필요없는 데이터
  df.drop(columns=['PassengerId', 'Name', 'Cabin'], inplace=True)

  return df

# 데이터 불러오기
df = pd.read_csv('/content/drive/MyDrive/KDT_2404/dataset/train.csv')

df = transform_features(df)

# 변수 선택 및 데이터 분리
X = df.drop(columns=['Survived'])
y = df['Survived']
df.drop(columns=['Survived'], inplace=True)

# 8. 학습용과 테스트용 데이터셋으로 나누기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=60)

# 7. 데이터 표준화
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


# 모델 및 하이퍼파라미터 설정
models = {
    'Logistic Regression': (LogisticRegression(max_iter=1000), {
        'C': [0.1, 1, 10, 100],
        'solver': ['newton-cg', 'lbfgs', 'liblinear']
    }),
    'Decision Tree': (DecisionTreeClassifier(), {
        'max_depth': [None, 10, 20, 30, 40],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }),
    'Random Forest': (RandomForestClassifier(), {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    })
}

results = {}

# 하이퍼파라미터 튜닝 및 모델 학습
for model_name, (model, params) in models.items():
    grid_search = GridSearchCV(model, params, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
    results[model_name] = {
        'Best Parameters': grid_search.best_params_,
        'Accuracy': accuracy,
        'ROC AUC': roc_auc
    }

# 결과 출력
for model_name, result in results.items():
    print(f'{model_name} - Best Parameters: {result["Best Parameters"]}, Accuracy: {result["Accuracy"]}, ROC AUC: {result["ROC AUC"]}')

# 결과 (5분 소요)
# Logistic Regression - Best Parameters: {'C': 1, 'solver': 'newton-cg'}, Accuracy: 0.8774193548387097, ROC AUC: 0.8546666666666667
# Decision Tree - Best Parameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 5}, Accuracy: 0.864516129032258, ROC AUC: 0.7828571428571429
# Random Forest - Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}, Accuracy: 0.8580645161290322, ROC AUC: 0.8727619047619047

# 모델 학습 및 평가

# 가장 베스트 값으로 하이퍼파라메터 튜닝
# models = {
#   'Logistic Regression': LogisticRegression(C= 0.1, solver= 'saga'),
#   'Decision Tree': DecisionTreeClassifier(max_depth= 10, min_samples_leaf= 4, min_samples_split= 5),
#   'Random Forest': RandomForestClassifier(max_depth= 10, min_samples_leaf= 4, min_samples_split= 5)
# }

# # 10. 모델 학습 및 평가
# for name, model in models.items():
#   model.fit(X_train, y_train)
#   y_pred = model.predict(X_test)
#   accuracy = accuracy_score(y_test, y_pred)
#   conf_matrix = confusion_matrix(y_test, y_pred)
#   class_report = classification_report(y_test, y_pred)
#   roc_auc = roc_auc_score(y_test, y_pred)

#   print(f'Model: {name}')
#   print(f'Accuracy: {accuracy:.4f}')
#   print('Confusion Matrix:')
#   print(conf_matrix)
#   print('Classification Report:')
#   print(class_report)
#   print(f'ROC AUC: {roc_auc:.4f}')
#   print('\n' + '='*60 + '\n')

Logistic Regression - Best Parameters: {'C': 1, 'solver': 'newton-cg'}, Accuracy: 0.8774193548387097, ROC AUC: 0.8546666666666667
Decision Tree - Best Parameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 5}, Accuracy: 0.864516129032258, ROC AUC: 0.7828571428571429
Random Forest - Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}, Accuracy: 0.8580645161290322, ROC AUC: 0.8727619047619047


In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, Binarizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

# 1. 데이터 로드
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
           'hours-per-week', 'native-country', 'income']

# na_values = ? 는 ?로 되어있는 값들을 None값으로 처리한다는 의미
data = pd.read_csv(url, header=None, names=columns, na_values='?', skipinitialspace=True)

data.dropna(inplace=True)

Q1 = data['fnlwgt'].quantile(0.25)
Q3 = data['fnlwgt'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
capital_fnlwgt_outliers = data[(data['fnlwgt'] < lower_bound) | (data['fnlwgt'] > upper_bound)]
data = data.drop(capital_fnlwgt_outliers.index)

Q1 = data['capital-gain'].quantile(0.25)
Q3 = data['capital-gain'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
capital_gain_outliers = data[(data['capital-gain'] < lower_bound) | (data['capital-gain'] > upper_bound)]
capital_loss_outliers = data[(data['capital-loss'] < lower_bound) | (data['capital-loss'] > upper_bound)]
data = data.drop(capital_gain_outliers.index)
data = data.drop(capital_loss_outliers.index)

# 범주형 변수 인코딩
categorical_features = ['race', 'sex', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'native-country']
data = pd.get_dummies(data, columns=categorical_features, drop_first=True)

# 변수 선택 및 데이터 분리
# 'income' 변수를 0과 1로 변환
data['income'] = data['income'].apply(lambda x: 1 if x.strip() == '>50K' else 0)
X = data.drop('income', axis=1)
y = data['income']

# 파생변수1 : age_group
data['age_group'] = pd.cut(data['age'], bins=[0, 18, 30, 45, 60, 100], labels=['0-18', '19-30', '31-45', '46-60', '61+'])

# 파생변수2 : hours_group
data['hours_group'] = pd.cut(data['hours-per-week'], bins=[0, 20, 40, 60, 100], labels=['0-20', '21-40', '41-60', '61+'])

# 파생변수3 : capital
data['capital'] = data['capital-gain'] - data['capital-loss']

# 불필요 레이블 삭제
data.drop(columns=['age', 'fnlwgt', 'education-num', 'income', 'capital-gain', 'capital-loss', 'hours-per-week'], inplace=True)

# 학습 데이터와 테스트 데이터로 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 표준화
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

models = {
  'Logistic Regression': LogisticRegression(C= 0.1, solver= 'saga'),
  'Decision Tree': DecisionTreeClassifier(max_depth= 10, min_samples_leaf= 2, min_samples_split= 10),
  'Random Forest': RandomForestClassifier(max_depth= 10, min_samples_leaf= 2, min_samples_split= 10)
}

# 10. 모델 학습 및 평가
for name, model in models.items():
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  accuracy = accuracy_score(y_test, y_pred)
  conf_matrix = confusion_matrix(y_test, y_pred)
  class_report = classification_report(y_test, y_pred)
  roc_auc = roc_auc_score(y_test, y_pred)

  print(f'Model: {name}')
  print(f'Accuracy: {accuracy:.4f}')
  print('Confusion Matrix:')
  print(conf_matrix)
  print('Classification Report:')
  print(class_report)
  print(f'ROC AUC: {roc_auc:.4f}')
  print('\n' + '='*60 + '\n')



Model: Logistic Regression
Accuracy: 0.8573
Confusion Matrix:
[[3889  228]
 [ 497  465]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.94      0.91      4117
           1       0.67      0.48      0.56       962

    accuracy                           0.86      5079
   macro avg       0.78      0.71      0.74      5079
weighted avg       0.85      0.86      0.85      5079

ROC AUC: 0.7140


Model: Decision Tree
Accuracy: 0.8466
Confusion Matrix:
[[3935  182]
 [ 597  365]]
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.96      0.91      4117
           1       0.67      0.38      0.48       962

    accuracy                           0.85      5079
   macro avg       0.77      0.67      0.70      5079
weighted avg       0.83      0.85      0.83      5079

ROC AUC: 0.6676


Model: Random Forest
Accuracy: 0.8539
Confusion Matrix:
[[3968  149]
 [ 593  369]]
Classific

In [None]:
Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best parameters for Logistic Regression: {'C': 0.1, 'solver': 'saga'}
Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best parameters for Decision Tree: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 5}
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters for GBM: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Fitting 3 folds for each of 32 candidates, totalling 96 fits
Best parameters for SVM: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
Fitting 3 folds for each of 16 candidates, totalling 48 fits
Best parameters for KNN: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'uniform'}
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Best parameters for LightGBM: {'learning_rate': 0.1, 'n_estimators': 100, 'num_leaves': 31}

In [None]:
from sklearn.preprocessing import LabelEncoder

# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(),inplace=True)
    df['Cabin'].fillna('N',inplace=True)
    df['Embarked'].fillna('N',inplace=True)
    df['Fare'].fillna(0,inplace=True)
    return df

# 머신러닝 알고리즘에 불필요한 속성 제거
def drop_features(df):
    df.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)
    return df

# 레이블 인코딩 수행.
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin','Sex','Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

# 앞에서 설정한 Data Preprocessing 함수 호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df




In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

# 사용자 정의 함수
def train_and_evaluate(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_proba)
    confusion = confusion_matrix(y_test, y_pred)

    print(f'오차 행렬:\n{confusion}')
    print(f'정확도: {accuracy:.4f}')
    print(f'정밀도: {precision:.4f}')
    print(f'재현율: {recall:.4f}')
    print(f'F1 스코어: {f1:.4f}')
    print(f'ROC AUC: {roc_auc:.4f}')
    print('')

In [None]:
# 원본 데이터를 재로딩 하고, feature데이터 셋과 Label 데이터 셋 추출.

y_titanic_df = titanic_df['Survived']
X_titanic_df= titanic_df.drop('Survived',axis=1)

X_titanic_df = transform_features(X_titanic_df)

In [None]:
from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test=train_test_split(X_titanic_df, y_titanic_df, test_size=0.2, random_state=11)
X_train, X_test, y_train, y_test=train_test_split(X_titanic_df, y_titanic_df, test_size=0.2, random_state=11, stratify=y_titanic_df)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 결정트리, Random Forest, 로지스틱 회귀를 위한 사이킷런 Classifier 클래스 생성
dt_clf = DecisionTreeClassifier(random_state=10)
rf_clf = RandomForestClassifier(random_state=10)
lr_clf = LogisticRegression(max_iter=2000, random_state=10)
print('dt_clf 학습')
print('='*12)
train_and_evaluate(dt_clf, X_train, X_test, y_train, y_test)
print('rf_clf 학습')
print('='*12)
train_and_evaluate(rf_clf, X_train, X_test, y_train, y_test)
print('lr_clf 학습')
print('='*12)
train_and_evaluate(lr_clf, X_train, X_test, y_train, y_test)

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators':[10,100,200], 'max_depth':[2,3,5,10,12],
             'min_samples_split':[2,3,5], 'min_samples_leaf':[1,5,8,10]}

grid_rfclf = GridSearchCV(rf_clf , param_grid=parameters , scoring='accuracy' , cv=5)
grid_rfclf.fit(X_train , y_train)


print('GridSearchCV 최적 하이퍼 파라미터 :',grid_rfclf.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_rfclf.best_score_))
best_rfclf = grid_rfclf.best_estimator_

train_and_evaluate(best_rfclf, X_train, X_test, y_train, y_test)