# 심장병 사망률 예측 모델 
* 데이터셋 - Kaggle
* 형태 - 파일데이터(csv)
* 다운로드 - https://www.kaggle.com/nandvard/microsoft-data-science-capstone

# Columns 
**area — 도시 정보**

* area__rucc — 인구 수에 따른 대도시 분류 3가지, 도시화 정도 및 대도시 근접성 따른 비수도권 분류 6가지 코드 사용 (USDA Economic Research Service, https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/)

* area_urbaninfluence — 인구 수에 따른 대도시,  비수도권,  대도시와 소도시의 근접성 (USDA Economic Research Service, https://www.ers.usda.gov/data-products/urban-influence-codes/)

**econ — 경제 지표**

* econ_economictypology — 6개 범주의 경제 의존성과 정책 관련 테마의 6개 범주에 따라 카운티를 분류. 경제 의존 유형 (농업, 광업, 제조업, 연방/주 정부, 휴양 및 비전문 카운티), 정책 관련 유형(낮은 교육, 낮은 고용, 지속적인 빈곤, 지속적인 아동 빈곤, 인구 감소, 은퇴 목적지) (USDA Economic Research Service, https://www.ers.usda.gov/data-products/county-typology-codes.aspx)

* econ_pctcivilian_labor — 민간 노동력, 연평균, 인구의 비율(Bureau of Labor Statistics, http://www.bls.gov/lau/))

* econ_pctunemployment - 실업률, 연평균, 인구의 비율(Bureau of Labor Statistics, http://www.bls.gov/lau/)

* econpctuninsuredadults — 건강보험이 없는 성인의 백분율(Bureau of Labor Statistics, http://www.bls.gov/lau/) econpctuninincipled children) — 건강보험이 없는 아동의 백분율(Bureau of Labor Statistics, http://www.bls.gov/lau/)


**health — 건강 지표**

* health_pctadult_obesity — 비만에 대한 임상적 정의를 충족하는 성인의 비율(전국 만성질환 예방 및 건강 증진 센터)

* health_pctadult_smoking — 흡연 성인 비율 (Behavioral Risk Factor Surveillance System)

* health_pctdiabetes — 당뇨병 인구의 비율(National Center for Chronic Disease Prevention and Health Promotion, Division of Diabetes Translation)

* health_pctlow_birthweight — 저출생아 비율(National Center for Health Statistics)

* health_pctexcessive_drinking — 과음주 성인 인구 비율 (Behavioral Risk Factor Surveillance System)

* health_pctphysical_inacticity — 신체적으로 활동하지 않는 성인 인구 비율(National Center for Lateral Disease and Health Provency)

* health_airpollutionparticulatematter — µg/m³의 미세 입자 물질(CDC WONDER, https://wonder.cdc.gov/wonder/help/pm.html)

* health_homicidesper_100k — 인구 10만 명당 살인에 의한 사망 (National Center for Health Statistics)

* health_motorvehiclecrashdeathsper100k — 인구 10만 명당 자동차 충돌로 인한 사망(National Center for Health Statistics)

* health_popper_dentist — 치과 의사당 인구 (HRSA Area Resource File)

* health_popperprimarycare_physician — 1차병원 의사당 인구 (HRSA Area Resource File)

**demo — 인구통계 정보**

* demo_pctfemale — 여성인구의 백분율 (US Census Population Estimates)

* demo_pctbelow18yearsofage — 18세 미만 인구의 비율 (US Census Population Estimates)

* demo_pctaged65yearsandolder — 65세 이상 인구의 비율 (US Census Population Estimates)

* demo_pcthispanic — 히스패닉 인구의 비율 (US Census Population Estimates)

* demo_pctnonhispanicafrican_american — 아프리카계 미국인 인구의 비율 (US Census Population Estimates)

* demo_pctnonhispanicwhite — 히스패닉과 백인 인구의 비율 (US Census Population Estimates)

* demo_pctamericanindianoralaskannative — 미국인 인구의 비율 (US Census Population Estimates)

* demo_pctasian — 아시아 인구의 비율 (US Census Population Estimates)

* demo_pctadultslessthanahighschooldiploma — 고등학교 졸업장이 없는 성인 인구 비율(US Census, American Community Survey)

* demo_pctadultswithhighschooldiploma — 고등학교 졸업한 성인 인구 비율 (US Census, American Community Survey)

* demo_pctadultswithsome_college — 대학을 졸업한 성인 인구 비율 (US Census, American Community Survey)

* demo_pctadultsbachelorsor_higher — 학사 졸업한 성인 인구 비율 (US Census, American Community Survey)

* demo_birthrateper1k — 인구 1,000명당 출생자 (US Census Population Estimates)

* demo_deathrateper1k — 인구 1,000명당 사망자 (US Census Population Estimates)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings


warnings.filterwarnings('ignore')
plt.rc('font', family='NanumBarunGothic') 
plt.rcParams['figure.figsize'] = (10, 7)

pd.set_option('display.float_format', lambda x: '%.2f' % x)

%matplotlib inline

  import pandas.util.testing as tm


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
filename = '/content/sample_data/Training_values.csv'

In [7]:
filename1 = '/content/sample_data/Training_labels.csv'

In [8]:
df1 = pd.read_csv(filename1)

In [9]:
df = pd.read_csv(filename)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3198 entries, 0 to 3197
Data columns (total 34 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   row_id                                            3198 non-null   int64  
 1   area__rucc                                        3198 non-null   object 
 2   area__urban_influence                             3198 non-null   object 
 3   econ__economic_typology                           3198 non-null   object 
 4   econ__pct_civilian_labor                          3198 non-null   float64
 5   econ__pct_unemployment                            3198 non-null   float64
 6   econ__pct_uninsured_adults                        3196 non-null   float64
 7   econ__pct_uninsured_children                      3196 non-null   float64
 8   demo__pct_female                                  3196 non-null   float64
 9   demo__pct_below_18_

In [11]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3198 entries, 0 to 3197
Data columns (total 2 columns):
 #   Column                            Non-Null Count  Dtype
---  ------                            --------------  -----
 0   row_id                            3198 non-null   int64
 1   heart_disease_mortality_per_100k  3198 non-null   int64
dtypes: int64(2)
memory usage: 50.1 KB


In [12]:
df = pd.merge(df, df1, on = 'row_id', how = 'left')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3198 entries, 0 to 3197
Data columns (total 35 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   row_id                                            3198 non-null   int64  
 1   area__rucc                                        3198 non-null   object 
 2   area__urban_influence                             3198 non-null   object 
 3   econ__economic_typology                           3198 non-null   object 
 4   econ__pct_civilian_labor                          3198 non-null   float64
 5   econ__pct_unemployment                            3198 non-null   float64
 6   econ__pct_uninsured_adults                        3196 non-null   float64
 7   econ__pct_uninsured_children                      3196 non-null   float64
 8   demo__pct_female                                  3196 non-null   float64
 9   demo__pct_below_18_

In [14]:
df.describe() #어떤 정보를 파악해야 할까? 

Unnamed: 0,row_id,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__homicides_per_100k,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician,heart_disease_mortality_per_100k
count,3198.0,3198.0,3198.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3196.0,3198.0,3198.0,3198.0,3198.0,3198.0,3198.0,3196.0,2734.0,3196.0,3016.0,2220.0,3196.0,3170.0,1231.0,2781.0,2954.0,2968.0,3198.0
mean,3116.99,0.47,0.06,0.22,0.09,0.5,0.23,0.17,0.09,0.09,0.77,0.02,0.01,0.15,0.35,0.3,0.2,11.68,10.3,0.31,0.21,0.11,0.08,0.16,0.28,11.63,5.95,21.13,3431.43,2551.34,279.37
std,1830.24,0.07,0.02,0.07,0.04,0.02,0.03,0.04,0.14,0.15,0.21,0.08,0.03,0.07,0.07,0.05,0.09,2.74,2.79,0.04,0.06,0.02,0.02,0.05,0.05,1.56,5.03,10.49,2569.45,2100.46,58.95
min,0.0,0.21,0.01,0.05,0.01,0.28,0.09,0.04,0.0,0.0,0.05,0.0,0.0,0.02,0.07,0.11,0.01,4.0,0.0,0.13,0.05,0.03,0.03,0.04,0.09,7.0,-0.4,3.14,339.0,189.0,109.0
25%,1504.25,0.42,0.04,0.17,0.06,0.49,0.21,0.14,0.02,0.01,0.65,0.0,0.0,0.1,0.31,0.26,0.14,10.0,8.0,0.28,0.17,0.09,0.07,0.13,0.24,10.0,2.62,13.49,1812.25,1420.0,237.0
50%,3113.5,0.47,0.06,0.22,0.08,0.5,0.23,0.17,0.04,0.02,0.85,0.01,0.01,0.13,0.36,0.3,0.18,11.0,10.0,0.31,0.21,0.11,0.08,0.16,0.28,12.0,4.7,19.63,2690.0,1999.0,275.0
75%,4724.75,0.51,0.07,0.26,0.11,0.51,0.25,0.2,0.09,0.1,0.94,0.01,0.01,0.19,0.4,0.34,0.23,13.0,12.0,0.33,0.25,0.12,0.1,0.2,0.31,13.0,7.89,26.49,4089.75,2859.0,317.0
max,6276.0,1.0,0.25,0.5,0.28,0.57,0.42,0.35,0.93,0.86,0.99,0.86,0.34,0.47,0.56,0.47,0.8,29.0,27.0,0.47,0.51,0.2,0.24,0.37,0.44,15.0,50.49,110.45,28130.0,23399.0,512.0


In [76]:
df[df['econ__pct_uninsured_adults'].isna()] # 단순 통계치를 넣을지?  area_rucc 기준 평균치를 넣을지? 

Unnamed: 0,row_id,area__rucc,area__urban_influence,econ__economic_typology,econ__pct_civilian_labor,econ__pct_unemployment,econ__pct_uninsured_adults,econ__pct_uninsured_children,demo__pct_female,demo__pct_below_18_years_of_age,demo__pct_aged_65_years_and_older,demo__pct_hispanic,demo__pct_non_hispanic_african_american,demo__pct_non_hispanic_white,demo__pct_american_indian_or_alaskan_native,demo__pct_asian,demo__pct_adults_less_than_a_high_school_diploma,demo__pct_adults_with_high_school_diploma,demo__pct_adults_with_some_college,demo__pct_adults_bachelors_or_higher,demo__birth_rate_per_1k,demo__death_rate_per_1k,health__pct_adult_obesity,health__pct_adult_smoking,health__pct_diabetes,health__pct_low_birthweight,health__pct_excessive_drinking,health__pct_physical_inacticity,health__air_pollution_particulate_matter,health__homicides_per_100k,health__motor_vehicle_crash_deaths_per_100k,health__pop_per_dentist,health__pop_per_primary_care_physician,yr,heart_disease_mortality_per_100k


In [50]:
# for 문 만들기 (전체 평균치 대입)


In [47]:
df['econ__pct_uninsured_adults'] = df['econ__pct_uninsured_adults'].fillna(df['econ__pct_uninsured_adults'].mean()) 

In [58]:
df['econ__pct_uninsured_children'] = df['econ__pct_uninsured_children'].fillna(df['econ__pct_uninsured_children'].mean())

In [60]:
df['demo__pct_female'] = df['demo__pct_female'].fillna(df['demo__pct_female'].mean())

In [59]:
df['demo__pct_below_18_years_of_age'] = df['demo__pct_below_18_years_of_age'].fillna(df['demo__pct_below_18_years_of_age'].mean())


In [62]:
df['demo__pct_aged_65_years_and_older'] = df['demo__pct_aged_65_years_and_older'].fillna(df['demo__pct_aged_65_years_and_older'].mean())


In [63]:
df['demo__pct_hispanic'] = df['demo__pct_hispanic'].fillna(df['demo__pct_hispanic'].mean())


In [64]:
df['demo__pct_non_hispanic_african_american'] = df['demo__pct_non_hispanic_african_american'].fillna(df['demo__pct_non_hispanic_african_american'].mean())


In [65]:
df['demo__pct_american_indian_or_alaskan_native'] = df['demo__pct_american_indian_or_alaskan_native'].fillna(df['demo__pct_american_indian_or_alaskan_native'].mean())

In [66]:
df['demo__pct_asian'] = df['demo__pct_asian'].fillna(df['demo__pct_asian'].mean())


In [68]:
df['health__pct_adult_obesity'] = df['health__pct_adult_obesity'].fillna(df['health__pct_adult_obesity'].mean())

In [57]:
df['health__pct_diabetes'] = df['health__pct_diabetes'].fillna(df['health__pct_diabetes'].mean())

In [52]:
df['health__pct_low_birthweight'] = df['health__pct_low_birthweight'].fillna(df['health__pct_low_birthweight'].mean())

In [54]:
df['health__pct_physical_inacticity'] = df['health__pct_physical_inacticity'].fillna(df['health__pct_physical_inacticity'].mean())

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3198 entries, 0 to 3197
Data columns (total 35 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   row_id                                            3198 non-null   int64  
 1   area__rucc                                        3198 non-null   object 
 2   area__urban_influence                             3198 non-null   object 
 3   econ__economic_typology                           3198 non-null   object 
 4   econ__pct_civilian_labor                          3198 non-null   float64
 5   econ__pct_unemployment                            3198 non-null   float64
 6   econ__pct_uninsured_adults                        3198 non-null   float64
 7   econ__pct_uninsured_children                      3198 non-null   float64
 8   demo__pct_female                                  3198 non-null   float64
 9   demo__pct_below_18_

In [70]:
df['health__pct_excessive_drinking'] # nan값 많은 경우도 평균치로 대체할까? 연령 평균? 지역평균? 

0       nan
1      0.18
2      0.20
3       nan
4      0.19
       ... 
3193   0.17
3194   0.12
3195    nan
3196   0.11
3197    nan
Name: health__pct_excessive_drinking, Length: 3198, dtype: float64

In [71]:
df['health__homicides_per_100k'] # 해당 컬럼은 버려야 할까? (살인)

0      2.80
1      2.30
2      9.31
3       nan
4       nan
       ... 
3193    nan
3194    nan
3195    nan
3196    nan
3197    nan
Name: health__homicides_per_100k, Length: 3198, dtype: float64

In [73]:
df['health__motor_vehicle_crash_deaths_per_100k'] # 어떤 인자와 관련이 있을지 궁금 

0      15.09
1      19.79
2       3.14
3        nan
4      29.39
        ... 
3193   24.44
3194   23.45
3195     nan
3196   19.45
3197   15.53
Name: health__motor_vehicle_crash_deaths_per_100k, Length: 3198, dtype: float64

In [74]:
df['health__pop_per_dentist'] # groupby 지역으로 평균치 대체예정

0      1650.00
1      2010.00
2       629.00
3      1810.00
4      3489.00
         ...  
3193   1490.00
3194   6229.00
3195       nan
3196   2609.00
3197   7189.00
Name: health__pop_per_dentist, Length: 3198, dtype: float64

In [75]:
df['health__pop_per_primary_care_physician'] # groupby 지역으로 평균치 대체예정

0      1489.00
1      2480.00
2       690.00
3      6630.00
4      2590.00
         ...  
3193   1820.00
3194   3060.00
3195    940.00
3196   1559.00
3197   7140.00
Name: health__pop_per_primary_care_physician, Length: 3198, dtype: float64