# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Bohan Lei
- Zheran Li
- Ryan Choi
- Yiyao Liu
- Duye Liu


# Abstract 
Our goal in this project is to build prediction models that predict whether or not someone has heart disease based on their health status and ethnographic data. We will clean the 2020 CDC survey data of 400k adults and choose the features from HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, and SkinCancer. We will train models including KNN, Decision tree, decision boundaries, and compare their performance primarily indicated by their classification accuracy. 

# Background

One study that is similar to our project objective is using a machine learning model for predicting out-of-hospital cardiac arrests based on meteorological and chronological data. In this study, the researches used eXtreme Gradient Boosting algorithm to generate a model that predict daily Out-of-hospital cardiac arrest based on  OHCA nationwide registry and high-resolution meteorological and chronological datasets from Japan<a name="ha1"></a>[<sup>[1]</sup>](#ha1note). Their result have shown that combining meteorological and chronological variables in the machine learning model have the best predictive accuracy in both the training and testing datasets<a name="ha1"></a>[<sup>[2]</sup>](#ha1note). Their research outcome have also indicated that factors like holiday, weekend,  low ambient temperature, and large interday or intraday temperature difference are strongly associated out-of-hospital cardiac arrest<a name="ha1"></a>[<sup>[3]</sup>](#ha1note).  

Similarly, a recent study done by the Cedars-Sinai Artificial Intelligence in Medicine division<a name="sota"></a>[<sup>[4]</sup>](#sotanote) found that by combining  CTA and F-NaF PET , two advanced imaging techniques together, the team was able to develop machine learning models that improve the prediction of heart attacks. The team took data from 293 patients from the ages of 56 - 74 years in the span of 53 month, and created three models to predict the possibility of future heart attacks<a name="sota"></a>[<sup>[5]</sup>](#sotanote). By comparing with the other two models they developed based on baseline characteristics and the quantitative plaque analysis variables from CTA, they found that the model that combines CTA and F-NaF PET produce the most accurate result<a name="sota"></a>[<sup>[6]</sup>](#sotanote).


# Problem Statement

According to the CDC and WHO, heart disease is the most common cause of death in the world, responsible for 16% of total deaths. There are some common key risk factors that could increase one’s chance of getting a heart disease including, but not limited to, high blood pressure, high cholesterol, physical inactivity, and smoking. Understanding what key risk factors lead to heart disease and avoiding them will make it much easier to prevent getting a heart disease. The original data given by kaggle contains 401,958 rows and 279 columns where each column corresponds to different factors that could lead to heart attack (questions on survey) and each row corresponds to values (answers for questions). Using this dataset, one could apply machine learning algorithms such as Decision Trees and KNN to detect patterns within the data, which will help us to understand which factors have the least/greatest impact on heart disease. Then, doctors could advise people who have a high chance of getting a heart attack (e.g. family history) what actions to take to lower their chance of getting a heart disease.  

# Data

In this project, we use the Personal Key Indicators of Heart Disease Dataset. The link to the dataset is below: 
https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease <br>
This dataset is based on the 2020 annual Center of Disease Control and Prevention (CDC) survey of around 400k adults related to their health status. The dataset has in total 18 variables and 320,000 observations. Each observation consists of the results of the 18 variables. The variables are: HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, and SkinCancer. 
Since this dataset is clean enough, with no missing data, we don’t need to further do any data cleaning process. The most important variable is the HeartDisease variable, as that is the variable that we are going to predict by our model. We will use 1 to represent the result of “yes“， and 0 to “no”. This transformation will also be applied into other variables that contain only “yes” and “no”. This includes Smoking, AlcoholDrinking, Stroke, DiffWalking, Diabetic, PhysicalActivity, Asthma, KidneyDisease, and SkinCancer. Also, for sex, we denote 1 as “male” and 0 as “female”. For AgeCategory, since it will be hard for a model to predict a category, we will just use the starting age of the age group. For example, for category “75-79”, we will just replace it as 75. For race and general health, since there are too many categories, we will either use small numbers to represent each category, or use one-hot coding to transfer it into the numbers with 1 and 0. 


# Proposed Solution

We plan to implement different machine learning algorithms including different types of classification algorithms. We make use of decision trees(random forest) algorithms, decision boundaries, and K Nearest Neighbor (KNN) on our cleaned data. Our model should be able to make good predictions about the possibility of an individual to have a heart attack given his medical data and records. Since we have over ten columns to consider we might need to reduce the size of data. We will first check the relations between each variable, and try to delete the most unrelated variables, If it still cannot reduce many columns, we will use Principal Component Analysis (PCA) to reduce dimensions. Our solution should be tested by either cross validation or just apply it onto our test dataset, since we have a very large dataset. At last we will try to compare our results with the existing methods in papers or other individual’s posted solutions on Kaggle to further validate the result.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

We consider the classification accuracy to be the most important performance measure of our model. So we will use classification accuracy as our evaluation metrics. By comparing the accuracy of both models we will get a straightforward result of which model is better. The classification accuracy is derived by dividing the correct prediction result by the total number of predictions we make and it is one of the most common evaluation methods for machine learning models.

# Ethics & Privacy

The ethics problem is divided into 5 phases: data collection phase, data storage phase, data analysis phase, data modeling phase, and deployment phase. <br>
For the data collection phase, since this dataset was collected by CDC and published for public use, we can assume the collecting agency was held to official standards. We can assume that it could represent the population in the US, and it does not have a strong collecting bias. <br>
For the data storage phase, first of all, since this is a public data collection, we do not need to consider the data privacy problem. Also, since this is the data collected in 2020, we also do not need to be concerned about the problem of the updated data, since the update was already fixed. <br>
For the data analysis phase, the data itself could lead to some kind of biases against a specific group of people. We will try our best to reduce such biases by doing visualizations and observations to supervise the learning process. <br>
For the data modeling phase, the models we generate might also have prediction bias against specific groups of data. We will make use of the best unbiased model for our prediction to suppress such bias. <br>
For the deployment phase, some of our models might be used for future prediction which generate prediction bias. We will address this problem at the end of our model build. <br>


# Team Expectations 

- Be on time for meetings
- Respond to messages on Discord
- Finish assigned tasks on time, seek help if needed
- Divide tasks evenly
- Ask on the group chat before any changes are made


# Project Timeline Proposal

We will basically meet every week on Thursday night, but the meeting time is flexible if anyone has timeline problems. We will basically first finish the project proposal, and start to wrangle and explore the dataset in our notebook. After getting feedback from the proposal, we will start to implement algorithms and models into the dataset, and compare the results of them. After that, we will start to write the final report. The specific timeline is shown below. 

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/18  |  8 PM |  Brainstorm topics/questions   | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 4/23  |  5 PM | Finish the proposal by part  |Final check the project proposal| 
| 4/28  | 8 PM  | Explore the dataset and find the possibility to wrangle the data |Start the coding part of the project, finish data cleaning, data wrangling, EDAs |
| 5/5  | 8 PM  | Try to find the different ways to implement the algorithms | Each member picks a part for the  implementation of the algorithms into the dataset|
| 5/12  | 8 PM  | Finish the algorithms code| Compare the results of algorithms, discuss the problems we met, and start to write the final project report (if applicable)|
| 5/19  | 8 PM  | Start to write the final report and organize the results| Discuss/edit full project |
| 5/26  | 8 PM  | Finish the final report by part| Discuss problems we have on either the coding or the report |
| 6/2   | 8 PM  | Finish everything and prepare for submission| Final check all the stuffs |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="ha1note"></a>1.[^](#ha1): https://heart.bmj.com/content/107/13/1084<br> 

<a name="sotanote"></a>2.[^](#sota): https://physicsworld.com/a/machine-learning-and-advanced-imaging-improve-prediction-of-heart-attacks/
