(don't forget to add a cover page with the title of your contribution, the composition of the group and the name of the major to which you belong.)

# Visualizing the Impact of Lifestyle on Stroke Risk

## 1. Introduction 

### 1.1 Brief Context of Stroke as a Global Health Issue

Stroke remains a leading cause of death worldwide, with an increasing burden in lower-income countries. The [World Stroke Organization](https://www.world-stroke.org/news-and-blog/news/wso-global-stroke-fact-sheet-2022) highlights a 70.0% rise in incident strokes and a 43.0% increase in stroke-related deaths from 1990 to 2019, emphasizing the urgent need for effective stroke prevention and management strategies.

### 1.2 Problem Statement: Analyzing Lifestyle and Stroke Risk

**Problem Statement:** In this project, we aim to explore the relationship between lifestyle choices and the probability of stroke. Our goal is to visualize how healthy and unhealthy habits, alongside other factors, contribute to an individual's risk of experiencing a stroke. This involves analyzing data on various health indicators, such as age, hypertension, heart disease, glucose levels, BMI, and smoking status, to discern patterns that may predict stroke likelihood.

**Objective:** By creating a comprehensive visualization of these relationships, we seek to enhance understanding of stroke risk factors and support the development of targeted prevention strategies.

### 1.3 Questions and Assumptions

- **Questions:**
  1. How do different lifestyle choices impact the risk of stroke?
  2. Which factors are the most significant predictors of stroke?
  3. Can we identify specific patterns or trends among high-risk individuals?
  4. Does age has impact on strokes ? 
  5. Does body mass index and glucose level has impact on strokes ?

- **Assumptions:**
  1. Lifestyle factors such as diet, exercise, and smoking have a significant impact on stroke risk.
  2. Demographic factors like age and gender also play a crucial role in determining stroke likelihood.
  3. Data visualization can effectively communicate complex relationships between multiple risk factors and stroke probability.

Our approach integrates analysis and visualization techniques to provide insights into how lifestyle choices affect stroke risk. We will use the Stroke Prediction Dataset, ensuring strict adherence to data privacy standards. This project not only aligns with our academic focus in Santé Biotech and Cybersecurity but also addresses a critical healthcare challenge. Through this endeavor, we aim to contribute to the understanding of stroke prevention and the broader discussion on healthcare data analytics and privacy.

### 1.4 Formalization of the Problem

**Problem Type:** Classification, i.e. considering the input variables, can we predict the likelihood of stroke? (Yes/No)

In the current dataset, there are 11 features and one binary target variable (stroke). The features are as follows:

| Field             | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| id                | Unique identifier                                                           |
| gender            | "Male", "Female" or "Other"                                                 |
| age               | Age of the patient                                                          |
| hypertension      | 0 if the patient doesn't have hypertension, 1 if the patient has hypertension |
| heart_disease     | 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease |
| ever_married      | "No" or "Yes"                                                               |
| work_type         | "children", "Govt_job", "Never_worked", "Private" or "Self-employed"        |
| Residence_type    | "Rural" or "Urban"                                                          |
| avg_glucose_level | Average glucose level in blood                                              |
| bmi               | Body mass index                                                             |
| smoking_status    | "formerly smoked", "never smoked", "smokes" or "Unknown"*                   |
| stroke            | 1 if the patient had a stroke or 0 if not                                   |

\*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

### 1.5 Libraries and Tools

In [1]:
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install scikit-learn
%pip install xgboost

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting xgboost
  Obtaining dependency information for xgboost from https://files.pythonhosted.org/packages/63/ca/37b83f59b0efd919c03c52ad7e2473dced674f2f6eb07b9d6f7d80e4c54c/xgboost-2.0.2-py3-none-manylinux2014_x86_64.whl.metadata
  Downloading xgboost-2.0.2-py3-none-manylinux2014_x86_64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.2-py3-none-manylinux2014_x86_64.whl (297.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.1/297.1 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.0.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Machine Learning Helpers
from sklearn.preprocessing import (StandardScaler, LabelEncoder, OneHotEncoder)
from sklearn.metrics import (confusion_matrix, accuracy_score, roc_auc_score, roc_curve, auc, precision_score, recall_score, f1_score, precision_recall_curve)
from sklearn.model_selection import (train_test_split, GridSearchCV, StratifiedKFold, cross_val_score)

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier 
from xgboost import XGBClassifier

# 

## 2. Data exploration

### 2.1 Loading the data

We start by loading the data into a pandas dataframe.

In [3]:
df = pd.read_csv('./_dataset/healthcare-dataset-stroke-data.csv', sep=',', encoding='utf-8')

In [4]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## 3. Data analysis and visualization : solutions and arguments for the choice of the two algorithms

## 4. Cybersecurity: Data Privacy and Ethics, implementing anonymization techniques

### 3.1 Data Privacy 

In today's digital world, data privacy is a critical issue. As we increasingly
rely on technology to manage our lives, we must ensure that our data is
protected from misuse. This is especially important in the healthcare sector,
where sensitive information is at risk of being compromised. Both in the US and
the EU, there are strict regulations in place to protect patient data. 

In the US, the [Health Insurance Portability and Accountability Act (HIPAA)](https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html) establishes national standards for the protection of health information. In the EU, the [General Data Protection Regulation (GDPR)](https://commission.europa.eu/law/law-topic/data-protection/data-protection-eu_en) sets guidelines for the collection and processing of personal data. These regulations are designed to safeguard patient privacy and prevent unauthorized access to sensitive information. 

### 3.2 Anonymization Techniques

In this project, we will use the Stroke Prediction Dataset, which contains personal data that can be viewed as sensitive. As demonstrated by the study by [Sweeney, L. (2000)](https://privacytools.seas.harvard.edu/sites/projects.iq.harvard.edu/files/privacytools/files/paper1.pdf) on the re-identification of individuals in the U.S. population, it is possible to identify 87.1\% of the population based only on {5-digit ZIP, gender, date of birth}. Therefore, we must take steps to protect the privacy of the individuals in our dataset, since our features can be used to identify them in a similar way.

To do so, we will use the following anonymization techniques:
- **Generalization:** We will replace the age and glucose level values with age ranges and glucose level ranges, respectively. This will reduce the granularity of the data, making it more difficult to identify individuals.
- **Hashing:** We will hash the value of the ID field, which will prevent the identification of individuals based on this feature.
- **Differential Privacy:** see below...

One of the biggest drawback of privacy is that it can limit the usefulness of the data. It is always a trade-off between privacy and accuracy. In this project, we will use the [Differential Privacy](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) technique to protect the privacy of the individuals in our dataset while still allowing for meaningful analysis by adjusting the amount of noise $\epsilon$ added to the data. 

We'll compare the accuracy of our results **with and without differential privacy to determine the optimal value of $\epsilon$.**
 
##### what will will do (temporary section to work with)
Implementing differential privacy in the Stroke Prediction Dataset involves adding controlled noise to the data to protect individual privacy while still allowing for meaningful analysis. Differential privacy is a technique that ensures the output of a query on a dataset does not reveal whether any individual's data was included in the input.

Here's a step-by-step approach to implement differential privacy:

1. **Understanding the Dataset**: First, get a thorough understanding of the dataset. Identify the types of data (categorical, numerical, etc.) and understand their distributions and importance in stroke prediction.

2. **Choosing a Differential Privacy Mechanism**: Select an appropriate differential privacy mechanism. The most common mechanisms are the Laplace mechanism (for numerical data) and the Exponential mechanism (for categorical data). The choice depends on the data type and the queries you intend to perform.

3. **Setting the Privacy Budget (ε)**: The privacy budget, often denoted as ε (epsilon), is a parameter that determines the level of privacy. A lower ε means higher privacy (and more noise), but it can reduce the utility of the data. Choose an ε that balances privacy with the need for accurate analysis.

4. **Applying Noise to the Data**:
   - For numerical data (like age, avg_glucose_level, and bmi), you can add noise drawn from a Laplace distribution. The amount of noise depends on the sensitivity of the query and your chosen ε.
   - For categorical data (like gender, work_type, and smoking_status), use the Exponential mechanism to randomly choose an output based on the probability distribution that depends on the privacy parameter and the utility of each outcome.

5. **Testing and Validation**: After applying differential privacy, it's essential to test the dataset. Check if the privacy guarantees hold and if the dataset still provides meaningful insights for stroke prediction. This might involve comparing the results of analyses on the original and the differentially private dataset.

6. **Implementing in Machine Learning Models**: When using the differentially private dataset for machine learning, be aware that the added noise might impact the model's accuracy. It may require tuning the models differently than you would with the original data.

7. **Ongoing Monitoring and Adjustment**: Differential privacy implementation isn't a one-time process. It requires continuous monitoring and adjustments based on the outcomes of your analyses and the evolving requirements of your project.

Remember, the key challenge in implementing differential privacy is balancing privacy protection with the usefulness of the data. This often requires experimentation and fine-tuning to get right.


## 5. Experimentation and discussion of the results

## 6. Group assessment (what you have learned, points for improvement, etc.)