In order to explore missing data, we will look at the ["echocardiogram" data from UCI](https://archive.ics.uci.edu/dataset/38/echocardiogram).







In [None]:
# prompt: Input csv file from google drive called echocardiogram.csv


import pandas as pd

url="https://drive.google.com/uc?export=download&id=19VA0LGyLsVnm6zIlajJf08bA_BEIV2yh"

# use .head .info and .describe to explore the data
echo_df = pd.read_csv(url)
echo_df

Unnamed: 0,survival,alive,age,pericardialeffusion,fractionalshortening,epss,lvdd,wallmotion-score,wallmotion-index,mult,name,group,aliveat1
0,11.0,0.0,71.0,0.0,0.260,9.000,4.600,14.0,1.000,1.000,name,1,0.0
1,19.0,0.0,72.0,0.0,0.380,6.000,4.100,14.0,1.700,0.588,name,1,0.0
2,16.0,0.0,55.0,0.0,0.260,4.000,3.420,14.0,1.000,1.000,name,1,0.0
3,57.0,0.0,60.0,0.0,0.253,12.062,4.603,16.0,1.450,0.788,name,1,0.0
4,19.0,1.0,57.0,0.0,0.160,22.000,5.750,18.0,2.250,0.571,name,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
128,7.5,1.0,64.0,0.0,0.240,12.900,4.720,12.0,1.000,0.857,name,,
129,41.0,0.0,64.0,0.0,0.280,5.400,5.470,11.0,1.100,0.714,name,,
130,36.0,0.0,69.0,0.0,0.200,7.000,5.050,14.5,1.210,0.857,name,,
131,22.0,0.0,57.0,0.0,0.140,16.100,4.360,15.0,1.360,0.786,name,,


### Data set

The goal of this study is to measure the survival at one year after the heart attack, hence the key variable is **aliveat1**

Here is the defintion of the features.

   1. survival -- the number of months patient survived (has survived, if patient is still alive).  Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive.  Check the second variable to confirm this.  Such patients cannot be used for the prediction task mentioned above.
   2. still-alive -- a binary variable.  0=dead at end of survival period, 1 means still alive
   3. age-at-heart-attack -- age in years when heart attack occurred
   4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart.  0=no fluid, 1=fluid
   5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
   6. epss -- E-point septal separation, another measure of contractility.  Larger numbers are increasingly abnormal.
   7. lvdd -- left ventricular end-diastolic dimension.  This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
   8. wall-motion-score -- a measure of how the segments of the left ventricle are moving
   9. wall-motion-index -- equals wall-motion-score divided by number of segments seen.  Usually 12-13 segments are seen in an echocardiogram.  Use this variable INSTEAD of the wall motion score.
   10. mult -- a derivate var which can be ignored
   11. name -- the name of the patient (I have replaced them with "name")
   12. group -- meaningless, ignore it
   13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year.  1 means patient was alive at 1 year.



### Missing Data evaluation


Lets check the missing variables in our data

In [None]:
# prompt: count missing values for each column

missing_values = echo_df.isnull().sum()
print(missing_values)


## Class Discussion

- **aliveat1** is our target variable but it is missing a large percent of the time.  What does that mean?  How do we deal with it?  Explore.


- Look at the **aliveat1** feature.  Is it completely consistent with the definition in the data description?

- How might we deal with other missing data issues in this data set?  

