<a href="https://colab.research.google.com/github/Ethan-code-1/project_chd/blob/main/Project__2__Coronary_Heart_Disease_WriteUp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Summary**:

This paper hopes to showcase the likelihood of a person developing CHD based on specific factors to be able to establish what are the most significant factors that play into CHD. This is able to be accomplished via predictive models, specifically by seeing what variables allow for the predictive model to generate the highest accuracy. The dataset we used is a subset of data from the Framingham Heart Study which has data on a sample of patients. This data showcases important information about how the patient is in terms of health across various factors and how that corresponds to his 10 year CHD risk. For our method, in terms of predictive models, we could utilize linear models, k-nearest neighbors, or decision trees. We decided that we wanted to utilize decision trees as this is able to capture non-linear relationships between features and the target variable as well as make it easy to understand with the most important features being selected at each split. When fitting our decision tree and predicting it on our test set, we ended up achieving a predictive accuracy of 84% which was definitely the most optimal for us. From this, we were able to conclude that the factors that were the most significant predictors in CHD was age, glucose, TotChol, and CigsPerDay. This is important information as it helps the people understand what they need to focus on and be more aware of in order to minimize their risk of developing CHD.

**Data:**


To clean the data, a number of factors were taken into account. This process started by counting and displaying all the columns with missing values, as these values can cause errors within models and leave out important missing information. To begin, a correlation matrix was created to identify variables that are highly correlated with developing coronary heart disease according to the ‘TenYearCHD’ variable. In addition, external background research was also done to get an idea of the most important variables to consider. This helped guide the cleaning process and determine what data was and was not okay to loose. Some variables that stuck out as potentially being important at this point related to predominantly to age and blood pressure.


Lastly, the describe method was used to get a general idea of the standard deviation of each variable which came in hand when deciding the best way to impute certain missing values (using the median vs mean for example).


The **first variable** cleaned was education. This variable seemed to have a low negative correlation with the ‘TenYearCHD’ variable (the lowest of the variables present within the data). Due to this, and the relatively low number of missing values, all missing values were simply set to the mean for education field (the value ‘2’).

In [None]:
education_mean = df['education'].mean()
print(education_mean)

#Mean is essentially 2, so impute all missing values
df['education'].fillna(2, inplace = True)

The **second variable** cleaned was ‘cigsPerDay’. This variable is heavily influenced by the ‘currentSmoker’ variable as individuals who do not smoke should have a value of zero for this field. As a result, all individuals with a ‘0’ for currentSmoker were also set to smoke zero cigarettes per day when the value was missing. Individuals marked as current smokers with missing values were handled differently. To impute their value, the mean of cigarettes per day for all individuals who reported being a current smoker was utilized.


In [None]:
for index, row in df.iterrows():
    if pd.isnull(row['cigsPerDay']):

        if row['currentSmoker'] == 0:
            df.at[index, 'cigsPerDay'] = 0
        else:
            df.at[index, 'cigsPerDay'] = 9

The **third variable** cleaned was ‘BPMeds’. This variable only had 37 missing values and had less significant positive correlation with ‘TenYearCHD’ than other similar variables within the dataset such as ‘diaBP’. Furthermore, so few individuals identified with this category (only around 3% of observations) that it was decided to impute all missing values with 0.


In [None]:
df['BPMeds'].fillna(0, inplace = True)

The **fourth variable** cleaned was ‘totChol’. This variable also had very few missing values. On average this value was 236.0 mg/dL within patients with a standard deviation of 44.85 mg/dL. The slightly larger standard deviation than expected influenced us to impute the missing values with the median value of 233.0 mg/dL instead, although with both values being so close to one another this may not have been necessary.

In [None]:
df['totChol'].median()
df['totChol'].fillna(233.0, inplace = True)

The **fifth variable** cleaned was ‘BMI’. This variable had only a few missing values and as such the mean value was used to input the missing fields.

In [None]:
df['BMI'].mean()
df['BMI'].fillna(25.89, inplace = True)

The **final variable** cleaned was glucose. This variable had by far the most missing values of the dataset but proportionally they were still small to the size of the overall dataset (There were 285 out of 3180 total observations). Furthermore, the variable had a rather high standard deviation and an elevated positive correlated with the diabetes variable. Because the field is believed to be missing completely at random, it was decided that dropping the rows with missing values would not significantly alter the results of our model.


In [None]:
(df['BPMeds'].mean())
df['BPMeds'].fillna(0.03006993006993007, inplace = True)