# Model & Synthesize Data

The purpose of this Jupyter Notebook is to complete the task of modeling and synthesizing a data set.  To choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

My focus for this project is in relation to the Vermont Oxford Network.  Vermont Oxford Network (VON) is a nonprofit voluntary collaboration of health care professionals working together as an interdisciplinary community to change the landscape of neonatal care. Founded in 1988, VON is now comprised of teams of health professionals representing neonatal intensive care units and level I and II care centers around the world, in support of the mission to improve the quality and safety of medical care for newborn infants and their families through a coordinated program of research, education, and quality improvement projects. (ref: https://public.vtoxford.org/about-us/)

The National Perinatal Epidemiology Centre, Department of Obstetrics and Gynaecology, UCC, published a report in 2018 "Mortality Risk Amongst Very Low Birth Weight Infants Born in the Republic of Ireland" (ref: 
https://www.ucc.ie/en/media/research/nationalperinatalepidemiologycentre/annualreports/MortalityRiskAmongstVeryLowBirthWeightInfantsintheRepublicofIrelandReport2014-2016.pdf).

In 2014-2016, 19 neonatal units submitted data to the VON’s VLBW database. The inclusion criteria were as follows:
'Any infant who is born alive at your hospital and whose birth weight is between 401 and 1500 grams OR whose gestational age is between 22 weeks 0 days and 29 weeks 6 days (inclusive), regardless of where in your hospital the infant receives care'

This notebook aims to sythesise data to reflect the key variables & findings in this report

# The Summary Approach

In order to complete the task of synthesizing data which assesses the mortality risk amongst very low birth weight babies, I have decided to take the following approach:

 - Explain the Variables
 - Generate a Data Frame Structure
 - Create the DF variables
 - Analyise & plot the dataset and highlight key findings

# The Variables - Explained

## Gestational Weeks
Gestational age is the common term used during pregnancy to describe how far along the pregnancy is. It is measured in weeks, from the first day of the woman's last menstrual cycle to the current date. A normal pregnancy can range from 38 to 42 weeks. Infants born before 37 weeks are considered premature. The outcome of this variable will be most likely between 22 - 32 weeks. (ref:https://medlineplus.gov/ency/article/002367.htm)

## Birthweight
Birth weight is the first weight of your baby, taken just after he or she is born. A low birth weight is less than 2,640 grams. A high birth weight is more than 4,000 grams. The expected result of this variable will see most weights less than 5.5 pounds.  

The gestational ages of the VLBW (Very Low Birth Weight) infants ranged from 21 weeks to 33 weeks and their birth weights ranged from 360g to 2,640g
(ref:https://www.ucc.ie/en/media/research/nationalperinatalepidemiologycentre/annualreports/MortalityRiskAmongstVeryLowBirthWeightInfantsintheRepublicofIrelandReport2014-2016.pdf)

## Congenital Anomaly 
Congenital anomalies are also known as birth defects, congenital disorders or congenital malformations. If the infant was born with a congenital anomaly, it was recored. Yes or No.

## Disposition at 1 Year
Did the infant live or die.  The outcome is recorded one year after birth.  There are 4 possible outcomes:
- Died: Unfortunately self explanatory  
- Home: The infant has been discharged from hospital and is at home
- Still Hospitalised: The infant is still under medical care in hospital
- Unkown: no contact with the infant's family so status of infant is unknown 
 



# How to create a Dataset 

In [18]:
# create a dataframe structure by starting with the column headings
import pandas as pd #use pandas to create the dataframe
import numpy as np # use numpy to generate the random variables
df = pd.DataFrame(columns= ["Gestational Weeks", "Birthweight", "Sex", "Congenital Anomaly", "Disposition at 1 Year"]) #gives headings to each of the columns

In [19]:
df # create the dataframe

Unnamed: 0,Gestational Weeks,Birthweight,Sex,Congenital Anomaly,Disposition at 1 Year


In [22]:
# Testing to populate the data 
data = [['Alex',10, 6, 2, 5],['Bob',12, 6 ,2, 5],['Clarke',13, 6 ,2, 5]]
df = pd.DataFrame(data, columns = ['Gestational Weeks', 'Birthweight', 'Sex', 'Congenital Anomaly', 'Disposition at 1 Year'])
df

Unnamed: 0,Gestational Weeks,Birthweight,Sex,Congenital Anomaly,Disposition at 1 Year
0,Alex,10,6,2,5
1,Bob,12,6,2,5
2,Clarke,13,6,2,5


In [6]:
# The above table is the basic structure which I will use as the foundations for systhessing my dataset.
# Key learning is how to structure code in order to populate the rows

## Gestational Weeks
In this dataset the Gestational weeks has a spread from 21 weeks to 33 weeks. With the majority of babies born around weeks 28, 29, 30.  Based on this, I think it is best to generate an array from a normal distribution.


In [5]:
#Setting Parameters for gestational weeks.  Majority of babies born weeks 28, 29, 30
# mean and standard deviation
gw = np.random.normal(28, 3, 100).round(0) # 28 is the mean, 3 is standard deviation, 100 is the size of the sample
gw

# took research to find how to round the floats to integers
# https://stackoverflow.com/questions/33160367/numpy-how-to-generate-a-normally-distributed-set-of-integers?rq=1

array([31., 28., 27., 29., 27., 26., 30., 25., 25., 30., 27., 28., 32.,
       24., 26., 29., 30., 28., 26., 25., 30., 34., 31., 28., 30., 28.,
       30., 27., 28., 32., 33., 27., 27., 28., 30., 29., 29., 27., 29.,
       27., 27., 25., 29., 26., 33., 26., 25., 24., 27., 25., 29., 23.,
       25., 32., 26., 33., 29., 30., 29., 30., 28., 31., 27., 28., 25.,
       29., 30., 24., 28., 30., 31., 25., 27., 27., 28., 25., 26., 29.,
       35., 31., 31., 27., 26., 31., 30., 34., 30., 31., 38., 30., 34.,
       27., 28., 30., 30., 29., 24., 28., 27., 29.])

## Birthweight
The birth weights of very low birth weight infants ranged from 360g to 2,640g.  The data here forms a uniform distribution. Infants are born at many different birth weights. No one more than any other.

In [6]:
#Setting Parameters for bithweight, low = 360g, High = 2640g

bw = np.random.randint(360, 2640, size=100) #this function draws random numbers from a uniform distribution
bw

array([1450,  667, 1923,  734,  983,  985, 1647, 1295, 2635, 2357,  676,
       2015, 1622, 2445, 1797,  467,  871,  900, 1450, 1966, 1146, 2166,
       2394,  391,  625, 1772, 1174,  732,  601, 2144, 1185, 2619,  775,
       2493, 1396, 2518, 2262, 2524, 1877, 1866, 2191, 1482, 1889, 2000,
       1594, 1607,  768,  673, 2625, 2090, 1145, 2306, 1789,  499, 1804,
        389, 1970,  497, 2086, 2514, 1292, 1719,  384, 2215, 2494, 2146,
       2057, 1108, 2538, 2107, 1034, 2086,  698, 1886, 1425, 2015, 1348,
       2405, 1405, 1719, 1167, 2425, 1672,  817, 1801,  696, 1763, 2007,
       2176, 2338, 1715, 1495, 1039, 2038, 2627,  530, 1784, 2125,  537,
       1660])

## Sex

Is the infant male or female.  There is a 50/50 chance of being either male or female.  

In [23]:
#using random.choice, I have ensure that there is a 50/50 chance of the infant being male or female
s = np.random.choice(["Male", "Female"], 100, p=[0.5, 0.5])
s

array(['Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male',
       'Male', 'Female', 'Female', 'Female', 'Male', 'Female', 'Male',
       'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male',
       'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male',
       'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female',
       'Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Female', 'Male',
       'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female',
       'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male',
       'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female',
       'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male',
       'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Male',
       'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female',
       'Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female',
       'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female',
      

## Congenital Anomaly
About 9% of babies have a congenital anomaly.  This figure is high in comparison to VON data as abortion was/is not in Ireland.  Does the baby have a congenital anomaly? Yes or No?

In [10]:
# using random.choice, I have implmented a bias to reflect the real world
ca = np.random.choice(["Yes","No"], 100, p=[0.11, 0.89])
ca


array(['No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No',
       'Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
       'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes',
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No',
       'No', 'Yes'], dtype='<U3')

## Disposition at 1 Year


In [24]:
# 1 year after the infants recorded birth, data is once again gathered to establish the health of the infant.
dip = np.random.choice(["Died", "At Home", "Hospital", "Unknown"], 100, p=[0.2, 0.5, 0.2, 0.1]) 
# I have used an estiamted the probabilty based on the report's key findings and used this to generate this array
dip

array(['Unknown', 'Unknown', 'Died', 'At Home', 'At Home', 'At Home',
       'At Home', 'At Home', 'Died', 'Unknown', 'At Home', 'At Home',
       'Hospital', 'Hospital', 'At Home', 'Died', 'At Home', 'Hospital',
       'At Home', 'Hospital', 'Died', 'Hospital', 'At Home', 'At Home',
       'At Home', 'Died', 'Died', 'At Home', 'Unknown', 'At Home',
       'At Home', 'At Home', 'At Home', 'Unknown', 'At Home', 'Hospital',
       'At Home', 'At Home', 'At Home', 'Died', 'Died', 'At Home',
       'At Home', 'At Home', 'Died', 'Died', 'Unknown', 'Hospital',
       'At Home', 'Hospital', 'Died', 'At Home', 'At Home', 'Died',
       'Unknown', 'Unknown', 'Died', 'Hospital', 'At Home', 'At Home',
       'At Home', 'Hospital', 'At Home', 'At Home', 'Hospital', 'At Home',
       'At Home', 'Died', 'Died', 'At Home', 'Hospital', 'Hospital',
       'At Home', 'Died', 'Died', 'Hospital', 'At Home', 'At Home',
       'At Home', 'At Home', 'At Home', 'At Home', 'Unknown', 'At Home',
       'At Home

# Creating the Dataset

In [27]:
# Populating the Dataset
# using my trial and error example above, I have used this structure to create the dataset

df = pd.DataFrame({"Gestational Weeks" : gw, "Birthweight" : bw, "Sex" : s, "Congenital Anomaly" : ca, "Disposition" : dip})
df

Unnamed: 0,Gestational Weeks,Birthweight,Sex,Congenital Anomaly,Disposition
0,31.0,1450,Female,No,Unknown
1,28.0,667,Male,Yes,Unknown
2,27.0,1923,Male,No,Died
3,29.0,734,Female,No,At Home
4,27.0,983,Male,No,At Home
5,26.0,985,Female,No,At Home
6,30.0,1647,Male,No,At Home
7,25.0,1295,Male,Yes,At Home
8,25.0,2635,Female,No,Died
9,30.0,2357,Female,No,Unknown
