# Project Statement

# Project Statement

- Create a Jupyter notebook
- Create a data set by simulating a real-world phenomenon of my choosing 
- Model and synthesise such data using Python (suggest to use the numpy.random package for this purpose)
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

# Introduction

A computer simulation is an application designed to imitate a real-life situation with advanages of 

- It can avoid danger and loss of life.
- Conditions can be varied and outcomes investigated.
- Critical situations can be investigated without risk.
- It is cost effective.
- Simulations can be sped up so behaviour can be studied easily over a long period of time.
- Simulations can be slowed down to study behaviour more closely [1]

The project is to simulate a real-world phenomenon.

# The Framingham Heart Study 

The Framingham Heart Study is now considered one of the longest, most important epidemiological studies in medical history. In the 1960s, the study demonstrated the role cigarette smoking plays in the development of heart disease. Those findings helped to fuel the first anti-smoking campaigns of that era. The study provided researchers with knowledge of how dietary fat can increase the risk of heart disease. It showed a link between cholesterol levels in the blood and an individual's risk for developing heart disease. Later, Framingham data also demonstrated the beneficial role of high-density lipoprotein (HDL) cholesterol and the negative consequences of low-density lipoprotein (LDL) cholesterol. This program has helped to educate physicians, patients, and the public about the dangers of high blood cholesterol and to bring about reductions in Americans' blood cholesterol levels. [2]

## Set up

Import the modules required for project

In [None]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings


warnings.filterwarnings('ignore')
%matplotlib inline

# Investigate the Framingham Heart Data to reporduce real-world dataset
Import the Framingham Heart Data and print out the first 10 rows 

In [None]:
# import data from data folder
# https://github.com/TarekDib03/Analytics/blob/master/Week3%20-%20Logistic%20Regression/Data/framingham.csv
df = pd.read_csv("data/framingham.csv")

# Print first 10 entries
df.head(10)

## Cleaning up data

To begin the cleanig of data df.describe() will output the count, mean, std, min, max as well as lower, 50 and upper percentiles the lower (25) and upper (75) percentiles. The 50 percentile is the same as the median. [4]

In [None]:
# Print the count, mean, std, min, max as well as lower, 50 and upper percentiles. 
# The lower (25) and upper (75) percentiles. The 50 percentile is the same as the median.
# This is to give an overview of the dataset including missing values
df.describe()

The dataframe has 16 rows for the purpose of the project we require a minimum four different variables. To select the variables required for the project a correlation matrix investigates the dependence between multiple variables at the same time. It shows symmetric tabular data where each row and column represent a variable, and the corresponding value is the correlation coefficient denoting the strength of a relationship between these two variables. [5]

Using TenYearCHD (10 year risk of coronary heart disease(CHD)) we are going to identify the variables and the strength of a relationship with the other variables 

In [None]:
# To select the variables required for the project A correlation matrix investigates the dependency 
# between multiple variables at the same time. It shows symmetric tabular data where each row and column 
# represent a variable, and the corresponding value is the correlation coefficient denoting the strength of a 
# relationship between these two variables. 
# https://www.geeksforgeeks.org/sort-correlation-matrix-in-python/
# [9]

print('Get correlation of variables with TenYearCHD')
FHS_correlation = df.corr()['TenYearCHD']
corr_FHS = FHS_correlation.abs().sort_values(ascending=False)[1:]
round(corr_FHS,2)

A new dataframe is create using varibales with the highest correlation of 10% and above. This ensure we have selected more than the four variables

'TenYearCHD','age','sysBP','prevalentHyp','diaBP','glucose', 'diabetes'

In [None]:
# Create a new dataframe with variables with the highest correlation
# Cut-off rate is 10% and above

print('New dataframe with values with highest correlation')
High_corr = df[['TenYearCHD','age','sysBP','prevalentHyp','diaBP','glucose', 'diabetes']] 
High_corr.describe()

In [None]:
Group1 = High_corr.loc[((High_corr['age'] > 40) & (High_corr['age'] <= 50))]

Count missing values in rows by applying the Pandas isna() function with the sum() function to get the counts of missing values per each column in the dataframe. [6]

In [None]:
# Count missing values in rows
# https://cmdlinetips.com/2020/11/how-to-get-number-of-missing-values-in-each-column-in-pandas/
print('Count Missing values in dataframe')
High_corr.isna().sum()

As the project only requires more one-hundred data points across a decision to drop all rows containing the blank values 

In [None]:
# Drop rows with blank values
# https://statisticsglobe.com/drop-rows-blank-values-from-pandas-dataframe-python
print('Check cleaned data')
High_corr.dropna(inplace = True)                   
High_corr.describe()    

Check rows are removed

In [None]:
print('Verify missing values are removed from dataframe')
High_corr.isna().sum()

In [None]:
# Rename the rows [9]
High_corr = High_corr.rename(columns = {'TenYearCHD' : 'At Risk',
                            'age' : 'Age',
                            'sysBP' : 'Systolic Blood Pressure',
                            'prevalentHyp' : 'Hypertensive',
                            'diaBP' : 'Diastolic Blood Pressure',
                            'glucose' : 'Glucose Levels',
                            'diabetes' : 'Diabetes'})

# Output Histograms using numpy.random.normal.html

https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

In [None]:
# Get age groups 
# https://stackoverflow.com/questions/64900985/pandas-count-number-of-occurrences-of-each-value-between-ranges
High_corr['Age'].value_counts(bins=[0,40,50,60,70], normalize=True).sort_index()


In [None]:
Group0 = High_corr.loc[(High_corr['Age'] <= 40)]
Group1 = High_corr.loc[((High_corr['Age'] > 40) & (High_corr['Age'] <= 50))]
Group2 = High_corr.loc[((High_corr['Age'] > 50) & (High_corr['Age'] <= 60))]
Group3 = High_corr.loc[((High_corr['Age'] > 60) & (High_corr['Age'] <= 70))]

https://www.webmd.com/hypertension-high-blood-pressure/guide/diastolic-and-systolic-blood-pressure-know-your-numbers

Here’s how to understand your systolic blood pressure number:

    Normal: Below 120
    Elevated: 120-129
    Stage 1 high blood pressure (also called hypertension): 130-139
    Stage 2 hypertension: 140 or more
    Hypertensive crisis: 180 or more. Call 911.

This is what your diastolic blood pressure number means:

    Normal: Lower than 80
    Stage 1 hypertension: 80-89
    Stage 2 hypertension: 90 or more
    Hypertensive crisis: 120 or more.

https://academic.oup.com/aje/article/163/4/342/103626

glucose categories: normal (≤5.55 mmol/liter (100 mg/dl)), 
impaired (5.56–6.99 mmol/liter (101–126 mg/dl)), and diabetic 
(>6.99 mmol/liter (>126 mg/dl))

# systolic blood pressure
# Normal <= 120
# Elevated 120 - 180
# Hypertensive >= 180

# diastolic blood pressure
# Normal <= 80
# Elevated 80 - 120
# Hypertensive >= 120

# glucose levels
# normal <= 100
# impaired 101 - 126
# diabetic > 126

In [None]:
# Age groups 
# https://stackoverflow.com/questions/35523635/extract-values-in-pandas-value-counts
# Group0['Systolic Blood Pressure'].value_counts(bins=[0,120,180,295], normalize=True).sort_index().tolist()
# Group0['Diastolic Blood Pressure'].value_counts(bins=[0,80,120,142], normalize=True).sort_index()
# Group0['Glucose Levels'].value_counts(bins=[0,100,126,394], normalize=True).sort_index()

agegroups = ['Group0', 'Group0', 'Group0','Group0']

for ages in agegroups:
    SBP = ages['Systolic Blood Pressure'].value_counts(bins=[0,120,180,295], normalize=True).sort_index().tolist()
    DBP = ages['Diastolic Blood Pressure'].value_counts(bins=[0,80,120,142], normalize=True).sort_index().tolist()
    GL = ages['Glucose Levels'].value_counts(bins=[0,100,126,394], normalize=True).sort_index().tolist()



In [None]:
Group0['Diastolic Blood Pressure'].value_counts(bins=[0,80,120,142], normalize=True).sort_index()

In [None]:
Group0['Glucose Levels'].value_counts(bins=[0,100,126,394], normalize=True).sort_index()

# References 

[1] https://www.bbc.co.uk/bitesize/guides/zvxp34j/revision/3

[2] https://nfb.org//sites/default/files/images/nfb/publications/vodold/vspr9804.htm

[3] https://github.com/TarekDib03/Analytics/blob/master/Week3%20-%20Logistic%20Regression/Data/framingham.csv

[4] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

[5] https://www.geeksforgeeks.org/sort-correlation-matrix-in-python/

[6] https://cmdlinetips.com/2020/11/how-to-get-number-of-missing-values-in-each-column-in-pandas/

[7] https://statisticsglobe.com/drop-rows-blank-values-from-pandas-dataframe-python

[8] https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166

[9] https://www.kaggle.com/micahshull/python-heart-disease-framingham