### Overview   
미국 거주자의 천식, 신장 질환, 당뇨, 피부암 여부 등 심장병 발생 가능성에 영향을 미칠 수 있는 정보를 통해 심장병 여부를 예측하는 Competition 입니다.

### 평가 지표
ROC AUC(Receiver Operating Characteristic - Area Under Curve) Score

### 변수 설명
- State : 거주하고 있는 미국의 주
- Sex : 성별
- GeneralHealth : 개인이 스스로 평가한 전반적인 건강 상태
- PhysicalHealthDays : 지난 30일 동안 신체적으로 건강하지 않았던 일수
- MentalHealthDays : 지난 30일 동안 정신적으로 건강하지 않았던 일수
- LastCheckupTime : 마지막으로 건강 검진을 받은 시
- PhysicalActivities : 신체 활동 또는 운동을 한 빈도
- SleepHours : 하루 평균 수면 시간
- RemovedTeeth : 영구 치아를 발치한 개수
- HadAngina : 협심증 경험 여부
- HadStroke : 뇌졸중 경험 여부
- HadAsthma : 천식 경험 여부
- HadSkinCancer : 피부암 경험 여부
- HadCOPD : 만성 폐쇄성 폐질환(COPD) 경험 여부
- HadDepressiveDisorder : 우울증 경험 여부
- HadKidneyDisease : 신장 질환 경험 여부
- HadArthritis : 관절염 경험 여부
- HadDiabetes : 당뇨병 경험 여부
- DeafOrHardOfHearing : 청각 장애 여부
- BlindOrVisionDifficulty : 시각 장애 여부
- DifficultyConcentrating : 집중에 어려움이 있는지 여부
- DifficultyWalking : 보행에 어려움이 있는지 여부
- DifficultyDressingBathing : 옷 입기나 목욕에 어려움이 있는지 여부
- DifficultyErrands : 혼자 외출하거나 볼일을 보는데 어려움이 있는지 여부
- SmokerStatus : 현재 흡연 상태
- ECigaretteUsage : 전자담배 사용 여부
- ChestScan : 흉부 CT나 X-ray를 촬영한 경험 여부
- RaceEthnicityCategory : 인종 및 민족 범주
- AgeCategory : 연령 범주
- HeightInMeters : 키 (미터 단위)
- WeightInKilograms : 체중 (킬로그램 단위)
- BMI : 체질량지수
- AlcoholDrinkers : 알코올을 섭취하는지 여부
- HIVTesting : HIV 검사를 받은 적이 있는지 여부
- FluVaxLast12 : 지난 12개월 내 독감 예방 접종 여부
- PneumoVaxEver : 폐렴구균 예방 접종 경험 여부
- TetanusLast10Tdap : 최근 10년 내 파상풍 예방접종(Tdap)을 받은 적이 있는지 여부
- HighRiskLastYear : 지난 1년 동안 고위험군으로 간주된 적이 있는지 여부
- CovidPos : 코로나19 양성 판정을 받은 적이 있는지 여부
- HadHeartAttack : 심장병 여부

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')
display( train_df, test_df )

Unnamed: 0,ID,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,...,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,HadHeartAttack
0,346631,Montana,Male,Excellent,0.0,0.0,5 or more years ago,Yes,7.0,1 to 5,...,85.73,21.29,No,No,Yes,No,"Yes, received tetanus shot, but not Tdap",No,Yes,0
1,147983,Kansas,Female,Very good,2.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,...,77.11,,Yes,No,Yes,No,"Yes, received Tdap",No,No,0
2,63785,Indiana,Male,Good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,8.0,1 to 5,...,89.81,31.96,Yes,,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No,0
3,43439,Virginia,Female,Very good,1.0,1.0,Within past year (anytime less than 12 months ...,Yes,7.0,None of them,...,68.04,24.96,Yes,No,No,Yes,"Yes, received Tdap",No,No,0
4,285789,Ohio,Female,Good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,None of them,...,81.65,27.37,Yes,Yes,Yes,No,"Yes, received Tdap",No,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353648,78018,New Jersey,Female,Very good,0.0,2.0,Within past year (anytime less than 12 months ...,Yes,7.0,None of them,...,,,,,,,,,,0
353649,167103,Maryland,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,None of them,...,52.16,19.14,Yes,Yes,Yes,No,"Yes, received Tdap",No,No,0
353650,53730,Ohio,Male,Good,2.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,1 to 5,...,118.84,41.03,No,,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,Yes,1
353651,182288,Alabama,Male,Poor,30.0,0.0,Within past year (anytime less than 12 months ...,No,5.0,"6 or more, but not all",...,68.04,24.96,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes,1


Unnamed: 0,ID,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,425295,Massachusetts,Male,Good,5.0,0.0,Within past 5 years (2 years but less than 5 y...,Yes,7.0,None of them,...,,83.91,,Yes,Yes,Yes,Yes,"Yes, received Tdap",No,No
1,169359,Louisiana,Male,Good,0.0,21.0,Within past year (anytime less than 12 months ...,No,8.0,1 to 5,...,1.75,65.77,21.41,Yes,Yes,No,Yes,"Yes, received Tdap",No,No
2,69449,New Hampshire,Male,Good,4.0,2.0,Within past year (anytime less than 12 months ...,Yes,7.0,None of them,...,1.88,92.08,26.06,No,No,Yes,,"Yes, received tetanus shot but not sure what type",No,No
3,10517,Ohio,Male,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,1 to 5,...,1.70,72.57,25.06,Yes,No,No,No,"Yes, received tetanus shot, but not Tdap",No,Yes
4,62046,Indiana,Male,Good,0.0,20.0,Within past year (anytime less than 12 months ...,No,5.0,,...,,,,Yes,No,Yes,No,"No, did not receive any tetanus shot in the pa...",No,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88409,228743,Wyoming,Female,Very good,3.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,1 to 5,...,1.70,95.25,32.89,,,,,,,
88410,259973,New York,Female,Good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,None of them,...,1.57,,,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
88411,218347,Indiana,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,8.0,None of them,...,1.91,,,No,,Yes,No,"Yes, received tetanus shot but not sure what type",No,No
88412,373556,Rhode Island,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,8.0,"6 or more, but not all",...,1.70,68.04,23.49,Yes,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No


In [2]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353653 entries, 0 to 353652
Data columns (total 41 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   ID                         353653 non-null  int64  
 1   State                      353653 non-null  object 
 2   Sex                        353653 non-null  object 
 3   GeneralHealth              352798 non-null  object 
 4   PhysicalHealthDays         345169 non-null  float64
 5   MentalHealthDays           346556 non-null  float64
 6   LastCheckupTime            347270 non-null  object 
 7   PhysicalActivities         352859 non-null  object 
 8   SleepHours                 349494 non-null  float64
 9   RemovedTeeth               344795 non-null  object 
 10  HadAngina                  350785 non-null  object 
 11  HadStroke                  352757 non-null  object 
 12  HadAsthma                  352484 non-null  object 
 13  HadSkinCancer              35

In [3]:
missing_percentage = train_df.isnull().mean() * 100

# 결과를 데이터프레임으로 정리
missing_data = pd.DataFrame({
    'Column': train_df.columns,
    'MissingPercentage': missing_percentage
}).sort_values(by='MissingPercentage', ascending=False)

# 결측치 비율이 있는 열만 표시
missing_data = missing_data[missing_data['MissingPercentage'] > 0]

print(missing_data)

                                              Column  MissingPercentage
TetanusLast10Tdap                  TetanusLast10Tdap          18.448592
PneumoVaxEver                          PneumoVaxEver          17.226208
HIVTesting                                HIVTesting          14.780307
ChestScan                                  ChestScan          12.478899
CovidPos                                    CovidPos          11.322963
HighRiskLastYear                    HighRiskLastYear          11.300060
BMI                                              BMI          10.848204
FluVaxLast12                            FluVaxLast12          10.494751
AlcoholDrinkers                      AlcoholDrinkers          10.356762
WeightInKilograms                  WeightInKilograms           9.355498
ECigaretteUsage                      ECigaretteUsage           7.921890
SmokerStatus                            SmokerStatus           7.876647
HeightInMeters                        HeightInMeters           6

In [4]:
# 결측치 처리를 할 때, 결측치 비율이 적은 것부터 처리하는 것이 일반적으로 더 효율적입니다. 
# 적은 결측치를 먼저 처리함으로써 데이터 손실 최소화

# 결측치 처리

In [5]:
# HadDiabetes : 당뇨병 경험 여부
# 임신 중에 발생하는 호르몬 변화 등 생리학적 변화와 연관이 있으며, 출산 후에는 대부분 정상혈당으로 돌아옵니
# Yes, but only during pregnancy (female)' 를 'No'로 처리 NA는 No 로 처리
train_df['HadDiabetes'].value_counts(dropna = False)

train_df['HadDiabetes'].fillna(train_df['HadDiabetes'].mode()[0], inplace=True)
train_df['HadDiabetes'] = train_df['HadDiabetes'].replace('Yes, but only during pregnancy (female)', 'No')
test_df['HadDiabetes'].fillna(test_df['HadDiabetes'].mode()[0], inplace=True)
test_df['HadDiabetes'] = test_df['HadDiabetes'].replace('Yes, but only during pregnancy (female)', 'No')

In [6]:
# PhysicalActivities : 신체 활동 또는 운동을 한 빈도
# 최빈값
train_df['PhysicalActivities'].value_counts(dropna = False)
train_df['PhysicalActivities'].fillna(train_df['PhysicalActivities'].mode()[0], inplace=True)
test_df['PhysicalActivities'].fillna(test_df['PhysicalActivities'].mode()[0], inplace=True)

In [7]:
# GeneralHealth : 개인이 스스로 평가한 전반적인 건강 상태 
# 최빈값
train_df['GeneralHealth'].value_counts(dropna = False)
train_df['GeneralHealth'].fillna(train_df['GeneralHealth'].mode()[0], inplace=True)
test_df['GeneralHealth'].fillna(test_df['GeneralHealth'].mode()[0], inplace=True)

In [8]:
# HadStroke : 뇌졸중 경험 여부
# 최빈값
train_df['HadStroke'].value_counts(dropna = False)
train_df['HadStroke'].fillna(train_df['HadStroke'].mode()[0], inplace=True)
test_df['HadStroke'].fillna(test_df['HadStroke'].mode()[0], inplace=True)

In [9]:
# HadAsthma : 천식 경험 여부
# 최빈값
train_df['HadAsthma'].value_counts(dropna = False)
train_df['HadAsthma'].fillna(train_df['HadAsthma'].mode()[0], inplace=True)
test_df['HadAsthma'].fillna(test_df['HadAsthma'].mode()[0], inplace=True)

In [10]:
# HadKidneyDisease : 신장 질환 경험 여부
# 최빈값
train_df['HadKidneyDisease'].value_counts(dropna = False)
train_df['HadKidneyDisease'].fillna(train_df['HadKidneyDisease'].mode()[0], inplace=True)
test_df['HadKidneyDisease'].fillna(test_df['HadKidneyDisease'].mode()[0], inplace=True)

In [11]:
# HadCOPD : 만성 폐쇄성 폐질환(COPD) 경험 여부
# 최빈값
train_df['HadCOPD'].value_counts(dropna = False)
train_df['HadCOPD'].fillna(train_df['HadCOPD'].mode()[0], inplace=True)
test_df['HadCOPD'].fillna(test_df['HadCOPD'].mode()[0], inplace=True)

In [12]:
# HadArthritis : 관절염 경험 여부
# 최빈값
train_df['HadArthritis'].value_counts(dropna = False)
train_df['HadArthritis'].fillna(train_df['HadArthritis'].mode()[0], inplace=True)
test_df['HadArthritis'].fillna(test_df['HadArthritis'].mode()[0], inplace=True)

In [13]:
# HadDepressiveDisorder : 우울증 경험 여부
# 최빈값
train_df['HadDepressiveDisorder'].value_counts(dropna = False)
train_df['HadDepressiveDisorder'].fillna(train_df['HadDepressiveDisorder'].mode()[0], inplace=True)
test_df['HadDepressiveDisorder'].fillna(test_df['HadDepressiveDisorder'].mode()[0], inplace=True)

In [14]:
# HadSkinCancer : 피부암 경험 여부
# 최빈값
train_df['HadSkinCancer'].value_counts(dropna = False)
train_df['HadSkinCancer'].fillna(train_df['HadSkinCancer'].mode()[0], inplace=True)
test_df['HadSkinCancer'].fillna(test_df['HadSkinCancer'].mode()[0], inplace=True)

In [15]:
# HadAngina : 협심증 경험 여부
# 최빈값
train_df['HadAngina'].value_counts(dropna = False)
train_df['HadAngina'].fillna(train_df['HadAngina'].mode()[0], inplace=True)
test_df['HadAngina'].fillna(test_df['HadAngina'].mode()[0], inplace=True)

In [16]:
# SleepHours : 하루 평균 수면 시간
# 최빈값
train_df['SleepHours'].value_counts(dropna = False)
train_df['SleepHours'].fillna(train_df['SleepHours'].mode()[0], inplace=True)
test_df['SleepHours'].fillna(test_df['SleepHours'].mode()[0], inplace=True)

In [17]:
# LastCheckupTime : 마지막으로 건강 검진을 받은 시
# 최빈값
train_df['LastCheckupTime'].value_counts(dropna = False)
train_df['LastCheckupTime'].fillna(train_df['LastCheckupTime'].mode()[0], inplace=True)
test_df['LastCheckupTime'].fillna(test_df['LastCheckupTime'].mode()[0], inplace=True)

In [18]:
# PhysicalHealthDays : 지난 30일 동안 신체적으로 건강하지 않았던 일수
# 최빈값
train_df['PhysicalHealthDays'].value_counts(dropna = False)  
train_df['PhysicalHealthDays'].fillna(train_df['PhysicalHealthDays'].mode()[0], inplace=True)
test_df['PhysicalHealthDays'].fillna(test_df['PhysicalHealthDays'].mode()[0], inplace=True)

In [19]:
# MentalHealthDays : 지난 30일 동안 정신적으로 건강하지 않았던 일수
# 최빈값
train_df['MentalHealthDays'].value_counts(dropna = False)  
train_df['MentalHealthDays'].fillna(train_df['MentalHealthDays'].mode()[0], inplace=True)
test_df['MentalHealthDays'].fillna(test_df['MentalHealthDays'].mode()[0], inplace=True)

In [20]:
#AgeCategory : 연령 범주 -> 비율에 따른 랜덤 샘플링 대체
train_df['AgeCategory'].value_counts(dropna = False)
nan_count = train_df['AgeCategory'].isna().sum()

# 각 연령 범주의 비율 계산
proportions = train_df['AgeCategory'].value_counts(normalize=True)

# 결측치 채우기
values = proportions.index  # 비결측치 값들
probabilities = proportions.values  # 각 값의 비율
random_samples = np.random.choice(values, size=nan_count, p=probabilities)

# 결측치 위치에 무작위 샘플 대체
train_df.loc[train_df['AgeCategory'].isna(), 'AgeCategory'] = random_samples
train_df['AgeCategory'].value_counts(dropna = False)


# geCategory : 연령 범주 -> 비율에 따른 랜덤 샘플링 대체
test_df['AgeCategory'].value_counts(dropna = False)
nan_count = test_df['AgeCategory'].isna().sum()

# 각 연령 범주의 비율 계산
proportions = test_df['AgeCategory'].value_counts(normalize=True)

# 결측치 채우기
values = proportions.index  # 비결측치 값들
probabilities = proportions.values  # 각 값의 비율
random_samples = np.random.choice(values, size=nan_count, p=probabilities)

# 결측치 위치에 무작위 샘플 대체
test_df.loc[test_df['AgeCategory'].isna(), 'AgeCategory'] = random_samples
test_df['AgeCategory'].value_counts(dropna = False)

Age 65 to 69       9663
Age 70 to 74       8914
Age 60 to 64       8897
Age 55 to 59       7445
Age 80 or older    7383
Age 50 to 54       6809
Age 75 to 79       6613
Age 40 to 44       5975
Age 45 to 49       5885
Age 35 to 39       5772
Age 18 to 24       5394
Age 30 to 34       5238
Age 25 to 29       4426
Name: AgeCategory, dtype: int64

In [21]:
# RemovedTeeth : 영구 치아를 발치한 개수
# 최빈값
train_df['RemovedTeeth'].value_counts(dropna = False)
train_df['RemovedTeeth'].fillna(train_df['RemovedTeeth'].mode()[0], inplace=True)
test_df['RemovedTeeth'].fillna(test_df['RemovedTeeth'].mode()[0], inplace=True)

In [22]:
# RaceEthnicityCategory : 인종 및 민족 범주
# 최빈값
train_df['RaceEthnicityCategory'].value_counts(dropna = False)
train_df['RaceEthnicityCategory'].fillna(train_df['RaceEthnicityCategory'].mode()[0], inplace=True)
test_df['RaceEthnicityCategory'].fillna(test_df['RaceEthnicityCategory'].mode()[0], inplace=True)

In [23]:
#DeafOrHardOfHearing : 청각 장애 여부 
# 최빈값
train_df['DeafOrHardOfHearing'].value_counts(dropna = False)
train_df['DeafOrHardOfHearing'].fillna(train_df['DeafOrHardOfHearing'].mode()[0], inplace=True)
test_df['DeafOrHardOfHearing'].fillna(test_df['DeafOrHardOfHearing'].mode()[0], inplace=True)

In [24]:
# BlindOrVisionDifficulty      
# 최빈값
train_df['BlindOrVisionDifficulty'].value_counts(dropna = False)
train_df['BlindOrVisionDifficulty'].fillna(train_df['BlindOrVisionDifficulty'].mode()[0], inplace=True)
test_df['BlindOrVisionDifficulty'].fillna(test_df['BlindOrVisionDifficulty'].mode()[0], inplace=True)

In [25]:
# - DifficultyDressingBathing : 옷 입기나 목욕에 어려움이 있는지 여부
# 최빈값
train_df['DifficultyDressingBathing'].value_counts(dropna = False)
train_df['DifficultyDressingBathing'].fillna(train_df['DifficultyDressingBathing'].mode()[0], inplace=True)
test_df['DifficultyDressingBathing'].fillna(test_df['DifficultyDressingBathing'].mode()[0], inplace=True)

In [26]:
# - DifficultyWalking : 보행에 어려움이 있는지 여부
# 최빈값
train_df['DifficultyWalking'].value_counts(dropna = False)
train_df['DifficultyWalking'].fillna(train_df['DifficultyWalking'].mode()[0], inplace=True)
test_df['DifficultyWalking'].fillna(test_df['DifficultyWalking'].mode()[0], inplace=True)

In [27]:
#  DifficultyConcentrating : 집중에 어려움이 있는지 여부
# 최빈값
train_df['DifficultyConcentrating'].value_counts(dropna = False)
train_df['DifficultyConcentrating'].fillna(train_df['DifficultyConcentrating'].mode()[0], inplace=True)
test_df['DifficultyConcentrating'].fillna(test_df['DifficultyConcentrating'].mode()[0], inplace=True)

In [28]:
# - DifficultyErrands : 혼자 외출하거나 볼일을 보는데 어려움이 있는지 여부
# 최빈값
train_df['DifficultyErrands'].value_counts(dropna = False)
train_df['DifficultyErrands'].fillna(train_df['DifficultyErrands'].mode()[0], inplace=True)
test_df['DifficultyErrands'].fillna(test_df['DifficultyErrands'].mode()[0], inplace=True)

In [29]:
#  HeightInMeters : 키 (미터 단위)
# 성별별로 평균값
train_df['HeightInMeters'].value_counts(dropna = False)
train_df['HeightInMeters'] = train_df.groupby('Sex')['HeightInMeters'].transform(lambda x: x.fillna(x.mean()))
test_df['HeightInMeters'] = test_df.groupby('Sex')['HeightInMeters'].transform(lambda x: x.fillna(x.mean()))

In [30]:
#  SmokerStatus : 현재 흡연 상태
# 최빈값
train_df['SmokerStatus'].value_counts(dropna = False)
train_df['SmokerStatus'].fillna(train_df['SmokerStatus'].mode()[0], inplace=True)
test_df['SmokerStatus'].fillna(test_df['SmokerStatus'].mode()[0], inplace=True)

In [31]:
# - ECigaretteUsage : 전자담배 사용 여부
# 최빈값
train_df['ECigaretteUsage'].value_counts(dropna = False)
train_df['ECigaretteUsage'].fillna(train_df['ECigaretteUsage'].mode()[0], inplace=True)
test_df['ECigaretteUsage'].fillna(test_df['ECigaretteUsage'].mode()[0], inplace=True)

In [32]:
# - WeightInKilograms : 체중 (킬로그램 단위)
# 성별별로 평균값 대체
train_df['WeightInKilograms'].value_counts(dropna = False)
train_df['WeightInKilograms'] = train_df.groupby('Sex')['WeightInKilograms'].transform(lambda x: x.fillna(x.mean()))
test_df['WeightInKilograms'] = test_df.groupby('Sex')['WeightInKilograms'].transform(lambda x: x.fillna(x.mean()))

In [33]:
 # AlcoholDrinkers : 알코올을 섭취하는지 여부
# 최빈값
train_df['AlcoholDrinkers'].value_counts(dropna = False)
train_df['AlcoholDrinkers'].fillna(train_df['AlcoholDrinkers'].mode()[0], inplace=True)
test_df['AlcoholDrinkers'].fillna(test_df['AlcoholDrinkers'].mode()[0], inplace=True)

In [34]:
# FluVaxLast12 : 지난 12개월 내 독감 예방 접종 여부
# 최빈값
train_df['FluVaxLast12'].value_counts(dropna = False)
train_df['FluVaxLast12'].fillna(train_df['FluVaxLast12'].mode()[0], inplace=True)
test_df['FluVaxLast12'].fillna(test_df['FluVaxLast12'].mode()[0], inplace=True)

In [35]:
# BMI : 체질량지수
# 성별별로 평균값
train_df['BMI'].value_counts(dropna = False)
train_df['BMI'] = train_df.groupby('Sex')['BMI'].transform(lambda x: x.fillna(x.mean()))
test_df['BMI'] = test_df.groupby('Sex')['BMI'].transform(lambda x: x.fillna(x.mean()))

In [36]:
# HighRiskLastYear : 지난 1년 동안 고위험군으로 간주된 적이 있는지 여부
# 최빈값
train_df['HighRiskLastYear'].value_counts(dropna = False)
train_df['HighRiskLastYear'].fillna(train_df['HighRiskLastYear'].mode()[0], inplace=True)
test_df['HighRiskLastYear'].fillna(test_df['HighRiskLastYear'].mode()[0], inplace=True)

In [37]:
# CovidPos : 코로나19 양성 판정을 받은 적이 있는지 여부
# 'Tested positive using home test without a health professional'  : yes로 처리
train_df['CovidPos'].value_counts(dropna = False)
train_df['CovidPos'] = train_df['CovidPos'].replace('Tested positive using home test without a health professional', 'Yes')
train_df['CovidPos'].fillna(train_df['CovidPos'].mode()[0], inplace=True)

test_df['CovidPos'] = test_df['CovidPos'].replace('Tested positive using home test without a health professional', 'Yes')
test_df['CovidPos'].fillna(test_df['CovidPos'].mode()[0], inplace=True)


In [38]:
# ChestScan : 흉부 CT나 X-ray를 촬영한 경험 여부
# 최빈값
train_df['ChestScan'].value_counts(dropna = False)
train_df['ChestScan'].fillna(train_df['ChestScan'].mode()[0], inplace=True)
test_df['ChestScan'].fillna(test_df['ChestScan'].mode()[0], inplace=True)

In [39]:
# HIVTesting : HIV 검사를 받은 적이 있는지 여부
# 최빈값
train_df['HIVTesting'].value_counts(dropna = False)
train_df['HIVTesting'].fillna(train_df['HIVTesting'].mode()[0], inplace=True)
test_df['HIVTesting'].fillna(test_df['HIVTesting'].mode()[0], inplace=True)

In [40]:
# PneumoVaxEver : 폐렴구균 예방 접종 경험 여부
# 최빈값
train_df['PneumoVaxEver'].value_counts(dropna = False)
train_df['PneumoVaxEver'].fillna(train_df['PneumoVaxEver'].mode()[0], inplace=True)
test_df['PneumoVaxEver'].fillna(test_df['PneumoVaxEver'].mode()[0], inplace=True)

In [41]:
# TetanusLast10Tdap : 최근 10년 내 파상풍 예방접종(Tdap)을 받은 적이 있는지 여부
# 아래처럼 처리
train_df['TetanusLast10Tdap'].value_counts(dropna = False)

train_df['TetanusLast10Tdap'] = train_df['TetanusLast10Tdap'].replace({
    'No, did not receive any tetanus shot in the past 10 years' : 'NO',
    'Yes, received tetanus shot but not sure what type': 'Unknown',
    'Yes, received Tdap': 'Tdap',
    'Yes, received tetanus shot, but not Tdap' : 'Not_Tdap'
})
test_df['TetanusLast10Tdap'] = test_df['TetanusLast10Tdap'].replace({
    'No, did not receive any tetanus shot in the past 10 years' : 'NO',
    'Yes, received tetanus shot but not sure what type': 'Unknown',
    'Yes, received Tdap': 'Tdap',
    'Yes, received tetanus shot, but not Tdap' : 'Not_Tdap'
})

train_df['TetanusLast10Tdap'].fillna(train_df['TetanusLast10Tdap'].mode()[0], inplace=True)
test_df['TetanusLast10Tdap'].fillna(test_df['TetanusLast10Tdap'].mode()[0], inplace=True)

In [42]:
missing_percentage = train_df.isnull().mean() * 100

# 결과를 데이터프레임으로 정리
missing_data = pd.DataFrame({
    'Column': train_df.columns,
    'MissingPercentage': missing_percentage
}).sort_values(by='MissingPercentage', ascending=False)

# 결측치 비율이 있는 열만 표시
missing_data = missing_data[missing_data['MissingPercentage'] > 0]

print(missing_data)

Empty DataFrame
Columns: [Column, MissingPercentage]
Index: []


# 시각화

In [43]:
categorical_columns = train_df.select_dtypes(include=['object']).columns
numerical_columns = train_df.select_dtypes(include=['int64', 'float64']).columns

In [44]:
'''

!!!!! 코드 빠르게 돌리기 위해 잠시 주석 처리 !!!!!


# 범주형 변수 시각화
for col in categorical_columns:
    plt.figure(figsize=(6, 4))
    sns.countplot(data=train_df, x=col)
    plt.title(f'{col} Distribution')
    plt.xticks(rotation=45)
    plt.show()
'''

"\n\n!!!!! 코드 빠르게 돌리기 위해 잠시 주석 처리 !!!!!\n\n\n# 범주형 변수 시각화\nfor col in categorical_columns:\n    plt.figure(figsize=(6, 4))\n    sns.countplot(data=train_df, x=col)\n    plt.title(f'{col} Distribution')\n    plt.xticks(rotation=45)\n    plt.show()\n"

In [45]:
'''

!!!!! 코드 빠르게 돌리기 위해 잠시 주석 처리 !!!!!


# 수치형 변수 시각화
from scipy.stats import skew

for col in numerical_columns:
    # 왜도 값 계산
    skew_val = skew(train_df[col].dropna())
    
    plt.figure(figsize=(6, 4))
    sns.histplot(train_df[col], kde=True)
    plt.title(f'{col} Distribution')
    
    plt.text(x=0.95, y=0.95, s=f'Skewness: {skew_val:.2f}', 
             ha='right', va='top', transform=plt.gca().transAxes, fontsize=10,
             bbox=dict(boxstyle="round,pad=0.3", edgecolor="gray", facecolor="lightgray"))
    
    plt.show()
'''

'\n\n!!!!! 코드 빠르게 돌리기 위해 잠시 주석 처리 !!!!!\n\n\n# 수치형 변수 시각화\nfrom scipy.stats import skew\n\nfor col in numerical_columns:\n    # 왜도 값 계산\n    skew_val = skew(train_df[col].dropna())\n    \n    plt.figure(figsize=(6, 4))\n    sns.histplot(train_df[col], kde=True)\n    plt.title(f\'{col} Distribution\')\n    \n    plt.text(x=0.95, y=0.95, s=f\'Skewness: {skew_val:.2f}\', \n             ha=\'right\', va=\'top\', transform=plt.gca().transAxes, fontsize=10,\n             bbox=dict(boxstyle="round,pad=0.3", edgecolor="gray", facecolor="lightgray"))\n    \n    plt.show()\n'

# Feature Engineering

## 1)  인코딩

In [46]:
# label encoding을 위한 dictionary
yes_or_no = {'No' : 1, 'Yes' : 2}

In [47]:
sta_map = {
    **dict.fromkeys(('Maine', 'Vermont', 'New Hampshire', 'Massachusetts', 'Rhode Island', 'Connecticut',
                     'New York', 'New Jersey', 'Pennsylvania', 'Maryland', 'Delaware'), '북동부'),
    **dict.fromkeys(('Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 
                     'Alabama', 'Tennessee', 'Kentucky', 'Mississippi', 'Arkansas', 
                     'Louisiana', 'West Virginia'), '남동부'),
    **dict.fromkeys(('Michigan', 'Ohio', 'Indiana', 'Illinois', 'Iowa', 'Missouri', 
                     'Kansas', 'Minnesota', 'Wisconsin', 'North Dakota', 
                     'South Dakota', 'Nebraska'), '중서부'),
    **dict.fromkeys(('Montana', 'Wyoming', 'Idaho', 'Colorado', 'Utah', 'Nevada', 
                     'Washington', 'Oregon', 'Alaska'), '북서부'),
    **dict.fromkeys(('California', 'Arizona', 'New Mexico', 'Texas', 'Oklahoma'), '남서부'),
    **dict.fromkeys(('Hawaii', 'Guam'), '태평양'),
    **dict.fromkeys(('Puerto Rico', 'Virgin Islands'), '카리브해'),
    'District of Columbia': '특수 구역'
}

train_df['State'] = train_df['State'].map(sta_map).fillna(train_df['State'])
test_df['State'] = test_df['State'].map(sta_map).fillna(test_df['State'])

In [48]:
sta_ord = { 
    '북동부': 1, '남서부': 2, '태평양': 3, '중서부': 4, 
    '북서부': 5, '남동부': 6, '카리브해': 7, '특수 구역': 8
}

train_df['State'] = train_df['State'].map(sta_ord).fillna(train_df['State'])
test_df['State'] = test_df['State'].map(sta_ord).fillna(test_df['State'])

train_df['State'].unique()

array([5, 4, 6, 2, 1, 3, 7, 8], dtype=int64)

In [49]:
sex_map = {'Male': 0, 'Female': 1}

train_df['Sex'] = train_df['Sex'].map(sex_map).fillna(train_df['Sex'])
test_df['Sex'] = test_df['Sex'].map(sex_map).fillna(test_df['Sex'])

In [50]:
gen_ord = {
    'Poor': 1, 'Fair': 2, 'Good': 3, 
    'Very good': 4, 'Excellent': 5
}

train_df['GeneralHealth'] = train_df['GeneralHealth'].map(gen_ord).fillna(train_df['GeneralHealth'])
test_df['GeneralHealth'] = test_df['GeneralHealth'].map(gen_ord).fillna(test_df['GeneralHealth'])

In [51]:
las_ord = {
    'Within past year (anytime less than 12 months ago)': 1, 'Within past 2 years (1 year but less than 2 years ago)': 2,
    'Within past 5 years (2 years but less than 5 years ago)': 3, '5 or more years ago': 4
}

train_df['LastCheckupTime'] = train_df['LastCheckupTime'].map(las_ord).fillna(train_df['LastCheckupTime'])
test_df['LastCheckupTime'] = test_df['LastCheckupTime'].map(las_ord).fillna(test_df['LastCheckupTime'])

In [52]:
# label encoding
train_df['PhysicalActivities'] = train_df['PhysicalActivities'].map(yes_or_no).fillna(train_df['PhysicalActivities'])
test_df['PhysicalActivities'] = test_df['PhysicalActivities'].map(yes_or_no).fillna(test_df['PhysicalActivities'])

In [53]:
rem_ord = {
    'None of them': 1, '1 to 5': 2,
    '6 or more, but not all': 3, 'All': 4
}

train_df['RemovedTeeth'] = train_df['RemovedTeeth'].map(rem_ord).fillna(train_df['RemovedTeeth'])
test_df['RemovedTeeth'] = test_df['RemovedTeeth'].map(rem_ord).fillna(test_df['RemovedTeeth'])

In [54]:
# label encoding
train_df['HadAngina'] = train_df['HadAngina'].map(yes_or_no).fillna(train_df['HadAngina'])
test_df['HadAngina'] = test_df['HadAngina'].map(yes_or_no).fillna(test_df['HadAngina'])

In [55]:
# label encoding
train_df['HadStroke'] = train_df['HadStroke'].map(yes_or_no).fillna(train_df['HadStroke'])
test_df['HadStroke'] = test_df['HadStroke'].map(yes_or_no).fillna(test_df['HadStroke'])

In [56]:
# label encoding
train_df['HadAsthma'] = train_df['HadAsthma'].map(yes_or_no).fillna(train_df['HadAsthma'])
test_df['HadAsthma'] = test_df['HadAsthma'].map(yes_or_no).fillna(test_df['HadAsthma'])

In [57]:
# label encoding
train_df['HadSkinCancer'] = train_df['HadSkinCancer'].map(yes_or_no).fillna(train_df['HadSkinCancer'])
test_df['HadSkinCancer'] = test_df['HadSkinCancer'].map(yes_or_no).fillna(test_df['HadSkinCancer'])

In [58]:
# label encoding
train_df['HadCOPD'] = train_df['HadCOPD'].map(yes_or_no).fillna(train_df['HadCOPD'])
test_df['HadCOPD'] = test_df['HadCOPD'].map(yes_or_no).fillna(test_df['HadCOPD'])

In [59]:
# label encoding
train_df['HadDepressiveDisorder'] = train_df['HadDepressiveDisorder'].map(yes_or_no).fillna(train_df['HadDepressiveDisorder'])
test_df['HadDepressiveDisorder'] = test_df['HadDepressiveDisorder'].map(yes_or_no).fillna(test_df['HadDepressiveDisorder'])

In [60]:
# label encoding
train_df['HadKidneyDisease'] = train_df['HadKidneyDisease'].map(yes_or_no).fillna(train_df['HadKidneyDisease'])
test_df['HadKidneyDisease'] = test_df['HadKidneyDisease'].map(yes_or_no).fillna(test_df['HadKidneyDisease'])

In [61]:
# label encoding
train_df['HadArthritis'] = train_df['HadArthritis'].map(yes_or_no).fillna(train_df['HadArthritis'])
test_df['HadArthritis'] = test_df['HadArthritis'].map(yes_or_no).fillna(test_df['HadArthritis'])

In [62]:
diabetes_ord = {
    'No': 1, 'No, pre-diabetes or borderline diabetes': 2,
    'Yes, but only during pregnancy (female)': 3, 'Yes': 4
}

train_df['HadDiabetes'] = train_df['HadDiabetes'].map(diabetes_ord).fillna(train_df['HadDiabetes'])
test_df['HadDiabetes'] = test_df['HadDiabetes'].map(diabetes_ord).fillna(test_df['HadDiabetes'])

In [63]:
# label encoding
train_df['DeafOrHardOfHearing'] = train_df['DeafOrHardOfHearing'].map(yes_or_no).fillna(train_df['DeafOrHardOfHearing'])
test_df['DeafOrHardOfHearing'] = test_df['DeafOrHardOfHearing'].map(yes_or_no).fillna(test_df['DeafOrHardOfHearing'])

In [64]:
# label encoding
train_df['BlindOrVisionDifficulty'] = train_df['BlindOrVisionDifficulty'].map(yes_or_no).fillna(train_df['BlindOrVisionDifficulty'])
test_df['BlindOrVisionDifficulty'] = test_df['BlindOrVisionDifficulty'].map(yes_or_no).fillna(test_df['BlindOrVisionDifficulty'])

In [65]:
# label encoding
train_df['DifficultyConcentrating'] = train_df['DifficultyConcentrating'].map(yes_or_no).fillna(train_df['DifficultyConcentrating'])
test_df['DifficultyConcentrating'] = test_df['DifficultyConcentrating'].map(yes_or_no).fillna(test_df['DifficultyConcentrating'])

In [66]:
# label encoding
train_df['DifficultyWalking'] = train_df['DifficultyWalking'].map(yes_or_no).fillna(train_df['DifficultyWalking'])
test_df['DifficultyWalking'] = test_df['DifficultyWalking'].map(yes_or_no).fillna(test_df['DifficultyWalking'])

In [67]:
# label encoding
train_df['DifficultyDressingBathing'] = train_df['DifficultyDressingBathing'].map(yes_or_no).fillna(train_df['DifficultyDressingBathing'])
test_df['DifficultyDressingBathing'] = test_df['DifficultyDressingBathing'].map(yes_or_no).fillna(test_df['DifficultyDressingBathing'])

In [68]:
# label encoding
train_df['DifficultyErrands'] = train_df['DifficultyErrands'].map(yes_or_no).fillna(train_df['DifficultyErrands'])
test_df['DifficultyErrands'] = test_df['DifficultyErrands'].map(yes_or_no).fillna(test_df['DifficultyErrands'])

In [69]:
smo_ord = {
    'Never smoked': 1, 'Former smoker': 2,
    'Current smoker - now smokes some days': 3, 'Current smoker - now smokes every day': 4
}

train_df['SmokerStatus'] = train_df['SmokerStatus'].map(smo_ord).fillna(train_df['SmokerStatus'])
test_df['SmokerStatus'] = test_df['SmokerStatus'].map(smo_ord).fillna(test_df['SmokerStatus'])

In [70]:
eci_ord = {
    'Never used e-cigarettes in my entire life': 1, 'Not at all (right now)': 2,
    'Use them some days': 3, 'Use them every day': 4
}

train_df['ECigaretteUsage'] = train_df['ECigaretteUsage'].map(eci_ord).fillna(train_df['ECigaretteUsage'])
test_df['ECigaretteUsage'] = test_df['ECigaretteUsage'].map(eci_ord).fillna(test_df['ECigaretteUsage'])

In [71]:
# label encoding
train_df['ChestScan'] = train_df['ChestScan'].map(yes_or_no).fillna(train_df['ChestScan'])
test_df['ChestScan'] = test_df['ChestScan'].map(yes_or_no).fillna(test_df['ChestScan'])

In [72]:
rac_map = {
    'White only, Non-Hispanic': 1, 'Hispanic': 2, 'Multiracial, Non-Hispanic': 3,
    'Other race only, Non-Hispanic': 4, 'Black only, Non-Hispanic': 5
}

train_df['RaceEthnicityCategory'] = train_df['RaceEthnicityCategory'].map(rac_map).fillna(train_df['RaceEthnicityCategory'])
test_df['RaceEthnicityCategory'] = test_df['RaceEthnicityCategory'].map(rac_map).fillna(test_df['RaceEthnicityCategory'])

In [73]:
age_ord = {
    'Age 18 to 24': 1, 'Age 25 to 29': 2, 'Age 30 to 34': 3, 'Age 35 to 39': 4,
    'Age 40 to 44': 5, 'Age 45 to 49': 6, 'Age 50 to 54': 7, 'Age 55 to 59': 8,
    'Age 60 to 64': 9, 'Age 65 to 69': 10, 'Age 70 to 74': 11, 'Age 75 to 79': 12, 'Age 80 or older': 13
}

train_df['AgeCategory'] = train_df['AgeCategory'].map(age_ord).fillna(train_df['AgeCategory'])
test_df['AgeCategory'] = test_df['AgeCategory'].map(age_ord).fillna(test_df['AgeCategory'])

In [74]:
# label encoding
train_df['AlcoholDrinkers'] = train_df['AlcoholDrinkers'].map(yes_or_no).fillna(train_df['AlcoholDrinkers'])
test_df['AlcoholDrinkers'] = test_df['AlcoholDrinkers'].map(yes_or_no).fillna(test_df['AlcoholDrinkers'])

In [75]:
# label encoding
train_df['HIVTesting'] = train_df['HIVTesting'].map(yes_or_no).fillna(train_df['HIVTesting'])
test_df['HIVTesting'] = test_df['HIVTesting'].map(yes_or_no).fillna(test_df['HIVTesting'])

In [76]:
# label encoding
train_df['FluVaxLast12'] = train_df['FluVaxLast12'].map(yes_or_no).fillna(train_df['FluVaxLast12'])
test_df['FluVaxLast12'] = test_df['FluVaxLast12'].map(yes_or_no).fillna(test_df['FluVaxLast12'])

In [77]:
# label encoding
train_df['PneumoVaxEver'] = train_df['PneumoVaxEver'].map(yes_or_no).fillna(train_df['PneumoVaxEver'])
test_df['PneumoVaxEver'] = test_df['PneumoVaxEver'].map(yes_or_no).fillna(test_df['PneumoVaxEver'])

In [78]:
tet_ord = {
    'NO': 1, 'Unknown': 2,
    'Not_Tdap': 3, 'Tdap': 4
}

train_df['TetanusLast10Tdap'] = train_df['TetanusLast10Tdap'].map(tet_ord).fillna(train_df['TetanusLast10Tdap'])
test_df['TetanusLast10Tdap'] = test_df['TetanusLast10Tdap'].map(tet_ord).fillna(test_df['TetanusLast10Tdap'])

In [79]:
# label encoding
train_df['HighRiskLastYear'] = train_df['HighRiskLastYear'].map(yes_or_no).fillna(train_df['HighRiskLastYear'])
test_df['HighRiskLastYear'] = test_df['HighRiskLastYear'].map(yes_or_no).fillna(test_df['HighRiskLastYear'])

In [80]:
cov_ord = {
    'No': 0,
    'Tested positive using home test without a health professional': 1, 'Yes': 2
}

train_df['CovidPos'] = train_df['CovidPos'].map(cov_ord).fillna(train_df['CovidPos'])
test_df['CovidPos'] = test_df['CovidPos'].map(cov_ord).fillna(test_df['CovidPos'])

## 2) Feature Selection

In [81]:
from sklearn.model_selection import train_test_split

X = train_df.drop(columns=['HadHeartAttack'])
y = train_df['HadHeartAttack']

# 훈련/검증 데이터 분리
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1004)

In [85]:
# 수치형 데이터
X_num = X[['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms', 'BMI']]

# 범주형 변수 (나머지 변수들)
X_cat = X.drop(columns=['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms', 'BMI'])

print('수치형은 anova, 범주형은 chi-square\n')
print('근데, anova와 chi-square은 데이터셋이 커지면 성능이 안 좋으니까,\n수치형엔 Kruskal-Wallis H 검정을 쓰고 범주형엔 Fisher\'s Exact Test을 쓸 거다.')
print('Kruskal-Wallis H 검정과  Fisher\'s Exact Test은 비모수적(샘플링) 방법이므로, 채택하겠다')

수치형은 anova, 범주형은 chi-square

근데, anova와 chi-square은 데이터셋이 커지면 성능이 안 좋으니까,
수치형엔 Kruskal-Wallis H 검정을 쓰고 범주형엔 Fisher's Exact Test을 쓸 거다.
Kruskal-Wallis H 검정과  Fisher's Exact Test은 비모수적(샘플링) 방법이므로, 채택하겠다


In [86]:
from imblearn.under_sampling import TomekLinks
from scipy.stats import kruskal, chi2_contingency

# 유의수준 설정
alpha = 0.05

# 샘플링 (1%) 3,000개의 데이터셋
sample_size_ratio = 0.01
sampled_data = X.sample(frac=sample_size_ratio, random_state=1004)

# 샘플링된 데이터셋에 대응되는 y 값 추가
sampled_y = y.loc[sampled_data.index]

# TomekLinks 적용
tl = TomekLinks()
X_resampled, y_resampled = tl.fit_resample(sampled_data, sampled_y)

# 선택된 수치형 변수 리스트
selected_num_features = []

# Kruskal-Wallis H 검정 (수치형 데이터)
for col in X_num.columns:  # X_num에 대해 직접 처리
    # 그룹화된 데이터로 Kruskal-Wallis H 검정
    groups = [X_resampled[X_resampled[col] == val][col].values for val in X_resampled[col].unique()]
    stat, p_value = kruskal(*groups)
    
    # p-value가 유의수준보다 작으면 수치형 피쳐 선택
    if p_value < alpha:
        selected_num_features.append(col)
    print(f'{col}에 대한 Kruskal-Wallis H 검정 결과 - p-value: {p_value}')

print()  # 띄어쓰기

# 선택된 범주형 변수 리스트
selected_cat_features = []

# 범주형 변수 (라벨 인코딩이 완료된 변수들)에 대해 Chi-Square 검정
for col in X_cat.columns:  # X_cat에 대해 직접 처리
    contingency_table = pd.crosstab(X_resampled[col], y_resampled)
    chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
    
    # p-value가 유의수준보다 작으면 범주형 피쳐 선택
    if p_value < alpha:
        selected_cat_features.append(col)
    print(f'{col}에 대한 Chi-Square 검정 결과 - p-value: {p_value}')

print()  # 띄어쓰기

# 최종 선택된 피쳐들 출력
print(f'\n선택된 수치형 피쳐들: {selected_num_features}')
print(f'선택된 범주형 피쳐들: {selected_cat_features}')

PhysicalHealthDays에 대한 Kruskal-Wallis H 검정 결과 - p-value: 0.0
MentalHealthDays에 대한 Kruskal-Wallis H 검정 결과 - p-value: 0.0
SleepHours에 대한 Kruskal-Wallis H 검정 결과 - p-value: 0.0
HeightInMeters에 대한 Kruskal-Wallis H 검정 결과 - p-value: 0.0
WeightInKilograms에 대한 Kruskal-Wallis H 검정 결과 - p-value: 0.0
BMI에 대한 Kruskal-Wallis H 검정 결과 - p-value: 1.7078209780233643e-278

ID에 대한 Chi-Square 검정 결과 - p-value: 0.49195209943163054
State에 대한 Chi-Square 검정 결과 - p-value: 0.5805145744825764
Sex에 대한 Chi-Square 검정 결과 - p-value: 7.617426579970585e-05
GeneralHealth에 대한 Chi-Square 검정 결과 - p-value: 8.01954878636721e-31
LastCheckupTime에 대한 Chi-Square 검정 결과 - p-value: 6.434782604430913e-08
PhysicalActivities에 대한 Chi-Square 검정 결과 - p-value: 1.814962356896757e-06
RemovedTeeth에 대한 Chi-Square 검정 결과 - p-value: 1.0511401714583255e-19
HadAngina에 대한 Chi-Square 검정 결과 - p-value: 1.0771814036316573e-128
HadStroke에 대한 Chi-Square 검정 결과 - p-value: 5.1901896751324495e-33
HadAsthma에 대한 Chi-Square 검정 결과 - p-value: 0.10581130481828376
Ha

In [87]:
# 최종 선택된 피쳐
X_train_selected = X_train[selected_num_features + selected_cat_features + ['ID']]
X_val_selected = X_val[selected_num_features + selected_cat_features + ['ID']]
print(X_train_selected)

        PhysicalHealthDays  MentalHealthDays  SleepHours  HeightInMeters  \
242524                 0.0               0.0         6.0        1.910000   
150644                 0.0               0.0         7.0        1.780768   
127341                 5.0               0.0         8.0        1.750000   
143102                 0.0               0.0         8.0        1.700000   
188099                30.0               4.0         6.0        1.700000   
...                    ...               ...         ...             ...   
173295                14.0               2.0         7.0        1.633174   
13637                  0.0               0.0         8.0        1.570000   
217119                30.0               0.0         4.0        1.630000   
240139                 0.0               0.0         8.0        1.780000   
48130                  4.0              30.0         7.0        1.630000   

        WeightInKilograms        BMI  Sex  GeneralHealth  LastCheckupTime  \
242524    

# Modeling

In [88]:
from imblearn.over_sampling import SMOTE

# 클래스 불균형 처리 (SMOTE 오버샘플링)
smote = SMOTE(random_state=1004, n_jobs=-1)  # n_jobs=-1 추가하여 병렬 처리
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_selected, y_train)

In [89]:
# 모델 스태킹
# 기본 모델 (XGBoost, Naive Bayes, CatBoost)
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import KFold

base_models = [
    LogisticRegression(max_iter=1000, random_state=1004),
    RandomForestClassifier(n_estimators=100, random_state=1004, n_jobs=-1),
    XGBClassifier(eval_metric='auc', random_state=1004, scale_pos_weight=16.606, n_jobs=-1),
    CatBoostClassifier(learning_rate=0.1, depth=6, iterations=500, random_seed=1004, silent=True)
]

# 메타 모델 (Loogistic Regression)
meta_model = LogisticRegression(max_iter=1000, random_state=42)

In [90]:
# Optuna를 사용한 베이지안 최적화
import optuna
from sklearn.metrics import roc_auc_score

# KFold 교차 검증 준비
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# KFold 기반 교차 검증 함수
def cross_val_score_model(model, X, y, kf):
    scores = []
    for train_idx, valid_idx in kf.split(X):
        X_train_fold, X_valid_fold = X.iloc[train_idx], X.iloc[valid_idx]
        y_train_fold, y_valid_fold = y.iloc[train_idx], y.iloc[valid_idx]

        model.fit(X_train_fold, y_train_fold)
        if hasattr(model, "predict_proba"):
            y_pred = model.predict_proba(X_valid_fold)[:, 1]
        else:
            y_pred = model.predict(X_valid_fold)
        scores.append(roc_auc_score(y_valid_fold, y_pred))
    return np.mean(scores)

In [91]:
# 각 모델의 베이지안 최적화 Objective 함수 정의
def optimize_logistic_regression(trial):
    params = {
        "C": trial.suggest_loguniform("C", 1e-4, 10),
        "solver": trial.suggest_categorical("solver", ["lbfgs", "liblinear"]),
    }
    model = LogisticRegression(max_iter=1000, random_state=1004, **params)
    return cross_val_score_model(model, X_train_resampled, y_train_resampled, kf)

# 모델별 베이지안 최적화 수행
study_logistic = optuna.create_study(direction="maximize")
study_logistic.optimize(optimize_logistic_regression, n_trials=19)

[I 2024-11-23 16:19:43,699] A new study created in memory with name: no-name-4c93ecfc-6484-40d8-a681-83f23feb7847
[I 2024-11-23 16:21:05,691] Trial 0 finished with value: 0.9373187528358352 and parameters: {'C': 5.832914362872767, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9373187528358352.
[I 2024-11-23 16:21:54,312] Trial 1 finished with value: 0.925794530241195 and parameters: {'C': 0.0003958259577309504, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9373187528358352.
[I 2024-11-23 16:22:58,191] Trial 2 finished with value: 0.9366285226714692 and parameters: {'C': 0.007123320975419033, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9373187528358352.
[I 2024-11-23 16:24:09,035] Trial 3 finished with value: 0.937303114089767 and parameters: {'C': 0.016780472538198976, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9373187528358352.
[I 2024-11-23 16:24:28,247] Trial 4 finished with value: 0.7055294546794354 and parameters: {'C': 0.0020378630235286833, '

In [92]:
def optimize_random_forest(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 3, 20),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
    }
    model = RandomForestClassifier(random_state=1004, n_jobs=-1, **params)
    return cross_val_score_model(model, X_train_resampled, y_train_resampled, kf)

study_rf = optuna.create_study(direction="maximize")
study_rf.optimize(optimize_random_forest, n_trials=10, n_jobs=-1)

[I 2024-11-23 16:38:51,220] A new study created in memory with name: no-name-932ece09-c070-4688-b0a1-27e918a4b1f6
[I 2024-11-23 16:55:35,200] Trial 1 finished with value: 0.9741960876624619 and parameters: {'n_estimators': 68, 'max_depth': 14, 'min_samples_split': 7, 'min_samples_leaf': 4}. Best is trial 1 with value: 0.9741960876624619.
[I 2024-11-23 16:58:54,808] Trial 2 finished with value: 0.9365976442324058 and parameters: {'n_estimators': 196, 'max_depth': 5, 'min_samples_split': 3, 'min_samples_leaf': 1}. Best is trial 1 with value: 0.9741960876624619.
[I 2024-11-23 17:01:51,438] Trial 3 finished with value: 0.942624831767049 and parameters: {'n_estimators': 198, 'max_depth': 6, 'min_samples_split': 5, 'min_samples_leaf': 4}. Best is trial 1 with value: 0.9741960876624619.
[I 2024-11-23 17:03:24,713] Trial 0 finished with value: 0.9370640687682211 and parameters: {'n_estimators': 250, 'max_depth': 5, 'min_samples_split': 2, 'min_samples_leaf': 4}. Best is trial 1 with value: 0.9

In [93]:
def optimize_xgb(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.2),
        "n_estimators": trial.suggest_int("n_estimators", 100, 500),
        "subsample": trial.suggest_uniform("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_uniform("colsample_bytree", 0.6, 1.0),
        "gamma": trial.suggest_uniform("gamma", 0, 5),
        "reg_alpha": trial.suggest_loguniform("reg_alpha", 1e-5, 10),
        "reg_lambda": trial.suggest_loguniform("reg_lambda", 1e-5, 10),
    }
    model = XGBClassifier(eval_metric="auc", use_label_encoder=False, random_state=1004, **params)
    return cross_val_score_model(model, X_train_resampled, y_train_resampled, kf)

study_xgb = optuna.create_study(direction="maximize")
study_xgb.optimize(optimize_xgb, n_trials=19, n_jobs=-1)

[I 2024-11-23 17:58:50,686] A new study created in memory with name: no-name-d5b06a4d-a95e-4de8-b260-74ffbd10ba89
[I 2024-11-23 18:02:04,812] Trial 0 finished with value: 0.9912729064463404 and parameters: {'max_depth': 6, 'learning_rate': 0.17303438765633802, 'n_estimators': 159, 'subsample': 0.9473949778114675, 'colsample_bytree': 0.8527704845692217, 'gamma': 4.671578502857488, 'reg_alpha': 0.8922968268261561, 'reg_lambda': 0.6550637071946124}. Best is trial 0 with value: 0.9912729064463404.
[I 2024-11-23 18:02:10,727] Trial 1 finished with value: 0.9663978067296691 and parameters: {'max_depth': 5, 'learning_rate': 0.019355638696036823, 'n_estimators': 160, 'subsample': 0.7376871704166348, 'colsample_bytree': 0.6482610166802991, 'gamma': 4.814496799974494, 'reg_alpha': 1.5056588376237805, 'reg_lambda': 0.012335759286032689}. Best is trial 0 with value: 0.9912729064463404.
[I 2024-11-23 18:05:14,632] Trial 2 finished with value: 0.9913519854375575 and parameters: {'max_depth': 6, 'lea

In [94]:
def optimize_catboost(trial):
    params = {
        "depth": trial.suggest_int("depth", 3, 10),
        "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.2),
        "iterations": trial.suggest_int("iterations", 100, 500),
    }
    model = CatBoostClassifier(silent=True, random_seed=1004, **params)
    return cross_val_score_model(model, X_train_resampled, y_train_resampled, kf)

study_catboost = optuna.create_study(direction="maximize")
study_catboost.optimize(optimize_catboost, n_trials=10)

[I 2024-11-23 18:11:39,927] A new study created in memory with name: no-name-56ee6ccb-f6f7-44b1-9558-195c6ac01b15
[I 2024-11-23 18:14:08,993] Trial 0 finished with value: 0.9911636597413404 and parameters: {'depth': 6, 'learning_rate': 0.10092336079775877, 'iterations': 228}. Best is trial 0 with value: 0.9911636597413404.
[I 2024-11-23 18:17:01,553] Trial 1 finished with value: 0.9893651586545683 and parameters: {'depth': 5, 'learning_rate': 0.04523476442360142, 'iterations': 310}. Best is trial 0 with value: 0.9911636597413404.
[I 2024-11-23 18:19:38,133] Trial 2 finished with value: 0.9866421149025107 and parameters: {'depth': 6, 'learning_rate': 0.023879857718413465, 'iterations': 250}. Best is trial 0 with value: 0.9911636597413404.
[I 2024-11-23 18:25:18,947] Trial 3 finished with value: 0.9917410566331712 and parameters: {'depth': 8, 'learning_rate': 0.06052656535047921, 'iterations': 438}. Best is trial 3 with value: 0.9917410566331712.
[I 2024-11-23 18:30:33,453] Trial 4 finis

In [96]:
# 최적화된 모델 구성
optimized_models = [
    LogisticRegression(max_iter=1000, random_state=1004, solver='liblinear'),
    RandomForestClassifier(random_state=1004, n_jobs=-1, n_estimators=267, min_samples_split=4, min_samples_leaf=4),
    XGBClassifier(eval_metric="auc", use_label_encoder=False, random_state=1004, max_depth=5, learning_rate=0.1177379236787768, subsample=0.704766141809691, colsample_bytree=0.9143688947775925, reg_alpha=0.015381931343944649, reg_lambda=1.2066576659412456),
    CatBoostClassifier(silent=True, random_seed=1004, learning_rate=0.06052656535047921),
]

# 최적화된 모델로 base_models 업데이트
base_models = optimized_models

# C값이 상대적으로 크기 때문에 규제가 약해져 과적합 위험
# max_depth=20 깊이가 커서 트리가 과도하게 세분화될 가능성
# n_estimators=483 트리 개수가 많아 과적합 위험
# gamma=1.9053 분할 기준이 낮아 필요 이상의 분할 가능성
# depth=8 깊이가 다소 커서 데이터에 따라 과적합 위험
# iterations=438 반복 횟수가 많아 모델 복잡도 증가

In [97]:
from sklearn.model_selection import KFold

# 메타 피처 생성 함수
def generate_meta_features(models, X, y, kf):
    meta_features = np.zeros((X.shape[0], len(models)))  # 모델 수 만큼 열을 생성
    
    for i, model in enumerate(models):
        print(f"Training base model {i + 1}/{len(models)}: {type(model).__name__}")
        
        # KFold 교차 검증을 통해 예측값 생성 (훈련 데이터만 해당)
        for train_idx, valid_idx in kf.split(X):
            X_train_fold, X_valid_fold = X.iloc[train_idx], X.iloc[valid_idx]  # .iloc로 인덱스를 사용
            y_train_fold = y.iloc[train_idx]
            
            model.fit(X_train_fold, y_train_fold)
            
            # 예측 확률값 또는 예측값 생성
            if hasattr(model, "predict_proba"):
                meta_features[valid_idx, i] = model.predict_proba(X_valid_fold)[:, 1]
            else:
                meta_features[valid_idx, i] = model.predict(X_valid_fold)
    
    return meta_features


# 메타 데이터 생성
X_train_meta = generate_meta_features(base_models, X_train_resampled, y_train_resampled, kf=kf)

Training base model 1/4: LogisticRegression
Training base model 2/4: RandomForestClassifier
Training base model 3/4: XGBClassifier
Training base model 4/4: CatBoostClassifier


In [98]:
# 메타 모델 학습 및 평가
meta_model.fit(X_train_meta, y_train_resampled)

In [100]:
train_features = X_train_resampled.columns

for feature in train_features:
    if feature not in test_df.columns:
        test_df[feature] = 0

# 훈련 데이터에 없는 특성 제거
test_df = test_df[train_features]

assert list(X_train_resampled.columns) == list(test_df.columns), "Features do not match between train and test datasets!"

# 테스트 데이터에 대한 메타 피처 생성 함수
def generate_meta_features_test(models, X):
    meta_features = np.zeros((X.shape[0], len(models)))  # 모델 수 만큼 열 생성

    for i, model in enumerate(models):
        print(f"Generating meta features for test data with model {i + 1}/{len(models)}: {type(model).__name__}")

        # 모델이 예측한 확률값 또는 예측값을 저장
        if hasattr(model, "predict_proba"):
            meta_features[:, i] = model.predict_proba(X)[:, 1]
        else:
            meta_features[:, i] = model.predict(X)

    return meta_features

X_test_meta = generate_meta_features_test(base_models, test_df)

Generating meta features for test data with model 1/4: LogisticRegression
Generating meta features for test data with model 2/4: RandomForestClassifier
Generating meta features for test data with model 3/4: XGBClassifier
Generating meta features for test data with model 4/4: CatBoostClassifier


# Prediction

In [101]:
from sklearn.metrics import roc_auc_score

# 검증 데이터에 대한 메타 피처 생성
X_val_meta = generate_meta_features(base_models, X_val, y_val, kf=kf)

# 메타 모델로 검증 데이터 예측
y_val_pred = meta_model.predict_proba(X_val_meta)[:, 1]
roc_auc = roc_auc_score(y_val, y_val_pred)
print(f"ROC AUC Score on Validation Data: {roc_auc:.5f}")

Training base model 1/4: LogisticRegression
Training base model 2/4: RandomForestClassifier
Training base model 3/4: XGBClassifier
Training base model 4/4: CatBoostClassifier
ROC AUC Score on Validation Data: 0.87621


# Submission

In [102]:
# 결과 제출 파일 생성
submission = pd.read_csv('submission.csv')
predictions = meta_model.predict(X_test_meta)
submission['ID'] = test_df['ID']
submission['HadHeartAttack'] = predictions
submission.to_csv('submission.csv', index=False)

# 제출 파일에서 클래스 분포 확인
print(submission['HadHeartAttack'].value_counts())

0    83323
1     5091
Name: HadHeartAttack, dtype: int64
