<a href="https://colab.research.google.com/github/Catherine-Nguyen88/project_chd/blob/main/DS3001_Project_2_Report_(Group_17).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**DS3001 Project 2 (Group 17)**

Isabella Dressel, Camila Gutierrez, Catherine Nguyen, Rhiannon Staley

## **Summary**

The Framingham Heart Study began in 1948 to look into the factors associated with Coronary heart disease. The study was conducted on a sample population of adults in Framingham, Massachusetts across multiple generations. When the study began over 70 years ago, little was known about the disease. Conclusions could only be made based on mortality statistics or clinical studies. It was decided that the best way to learn more about this disease was through studying a normal population of people. From there, prevention and treatment techniques could be developed based on the study’s findings. A hypothesis going into this study was that there are multiple contributing factors that eventually led to a person developing CHD, including constitutional (hereditary), conditioning (environmental) and time factors. In this project, we are given many of these factors and tasked with using various predictive methods to predict the likelihood a person develops coronary heart disease (CHD) based on these given factors. We used three main methods to predict the outcome variable TenYearCHD: multiple linear regression, regression trees and KNN. After fitting models on the train data and evaluating on the test data, we calculated Rsq and used this as our metric to measure model performance. For all methods and models used, we found that including all variables rather than narrowing down the number used to fit our models resulted in the best model performance. This indicated that all the factors we used were significant in predicting TenYearCHD. An essential aspect of using these predictive methods was finding optimal parameter values to improve our models such as tree depth for regression trees and k neighbors for KNN. Out of all methods that we tried, regression trees resulted in our highest Rsq value and thus, was our best model to predict the likelihood a person develops coronary heart disease.


In [None]:
# clone from repo
! git clone https://github.com/Catherine-Nguyen88/project_chd

Cloning into 'project_chd'...
remote: Enumerating objects: 177, done.[K
remote: Counting objects: 100% (103/103), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 177 (delta 89), reused 69 (delta 62), pack-reused 74[K
Receiving objects: 100% (177/177), 6.80 MiB | 12.37 MiB/s, done.
Resolving deltas: 100% (123/123), done.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

In [None]:
# Importing Data and Cleaning Variables
train_data = pd.read_csv('https://raw.githubusercontent.com/Catherine-Nguyen88/project_chd/main/fhs_train.csv', low_memory=False)
test_data = pd.read_csv('https://raw.githubusercontent.com/Catherine-Nguyen88/project_chd/main/fhs_test.csv', low_memory=False)

# clean training data
train_data1 = train_data.loc[:,['sex', 'age', 'education', 'currentSmoker', 'cigsPerDay',
       'BPMeds', 'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol',
       'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD']]

# selecting variables to include for analysis
train_final = train_data1.loc[:,['sex', 'currentSmoker', 'cigsPerDay',
                          'diabetes', 'totChol', 'sysBP',
                          'diaBP', 'BMI', 'TenYearCHD']]
train_final1 = train_final.dropna()

# clean testing data
test_data1 = test_data.loc[:,['sex', 'age', 'education', 'currentSmoker', 'cigsPerDay',
       'BPMeds', 'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol',
       'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD']]

# selecting variables to include for analysis
test_final = test_data1.loc[:,['sex', 'currentSmoker', 'cigsPerDay',
                                'diabetes', 'totChol', 'sysBP',
                                  'diaBP', 'BMI', 'TenYearCHD']]
test_final1 = test_final.dropna()

## **Data**

The data used was from the Framingham Study which began in 1948 through the US Public Health Service in Massachusetts. 5,209 men and women were studied in this longitudinal investigation of what leads to cardiovascular disease (CVD). By further understanding the risk factors that lead to CVD, this research intends to highlight the incidence rates of disease. 32 clinical exams and an event follow-up collected data up until 2018. Our specific project focuses on building predictive algorithms that predict if an individual develops CHD. We used 8 dependent variables (sex, currentSmoker, cigsPerDay, diabetes, totChol, sysBP, diaBP, and BMO) and 1 independent variable (TenYearCHD) in our research. Sex is the is the recorded sex of the participant with 1 designating males, diabetes is whether the participant has dibetes as of the first exam, totChol is the total cholesterol (mg/dL), sysBP is systolic blood pressure (mmHg), and BMI is body mass index (weight (kg)/height (m)2. Finally, TenYearCHD was 1 when a patient developed coronary heart disease within 10 years of exam For all the variables, NAs were dropped. The timeline analyzed was 1948 to 2018 to contrast possible differences correlations in risk factors and developing CHD. The data were cleaned in Cleaning_Variables.ipynb and extracted in project1_extract_data.ipynb.



**Challenges**


## **Results**

## **Conclusion**

The Framingham study was conducted in Framingham, Massachusetts starting in 1947 to learn more about the factors that contribute to cardiovascular disease. Over 70 years have passed and the study is still in progress. In 2019, the study was awarded $38 million dollars and renewed for another 6 years.
In this project, we further investigated how various hereditary, environmental, and time factors can contribute to an individual’s risk in developing coronary heart disease. After testing on a subset of variables, we came to the conclusion that using all the predictor variables in testing each predictive method resulted in the highest Rsq value and therefore was the best way to predict the binary outcome variable TenYearCHD. A TenYearCHD outcome of 0 means an individual did not develop CHD within 10 years but a 1 signifies they did develop CHD within 10 years.
We used regression trees, k nearest neighbors and multiple linear regression to create models to predict the TenYearCHD outcome. An important part of the project was to optimize the model by finding parameter values that avoided overfitting or underfitting the models. When applying the regression tree method, optimal tree depth was tested between 1 and 20. A tree depth of 5 resulted in a model that was the best to accurately represent and predict on the test set. When applying KNN, k neighbors were evaluated between 0 to 200 and it was found that k=108 resulted in the lowest mean squared error. Finding optimal parameter values was an important part of analysis. Avoiding overfitting made sure the model did not become overly and unnecessarily complex and capture noise. Conversely, avoiding underfitting makes sure that the model is not too simple, incorrectly generalizes about the data and fails to capture the relationship between predictors and the outcome variable, tenYearCHD. Out of the predictive methods we applied, regression trees resulted in the highest Rsq value of 0.099. We used all factors (features) to build the model and the optimal tree depth was 4.




**Defending the Project**


**Additional Work**

**Appendix**