## Introduction to Stats in Python Studio

We are going to be working with this [dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) from Kaggle.  No need to download, as it is included in the git repository you just cloned.
<br>

Heart Disease is the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.
<br>

Heart failure is a common event caused by heart disease and this dataset contains 12 features that can be used to predict mortality by heart failure. You are tasked to look at two particular variables and record your observations about their usefulness for predicting the probability of heart failure.
<br>

In section one, you will be asked to run some simple EDA and apply statistical terminology to describe each variable in more detail.  Section two will explore what the distribution of your variables looks like. Finally, in section three you will be asked to make some inferences about your variables and if you feel they are good indicators of predicting heart failure.
<br>

Answer the questions and record your observations in the space provided. Feel free to add more code blocks if you'd like.
<br>



In [1]:
# Import libries need with alias
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')


# Set style and font size
sns.set_style('darkgrid')
sns.set(font_scale=1.5)

In [7]:
# Read in data to a dataframe
df = pd.read_csv('heart3.csv')

## Section 1: First look at the data:

Run some simple EDA and look at the data and your variables. Answer the following questions.

In [9]:
df.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [12]:
df[['DEATH_EVENT', 'smoking', 'sex', 'age']].groupby(['smoking', 'sex']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,DEATH_EVENT,DEATH_EVENT,DEATH_EVENT,DEATH_EVENT,DEATH_EVENT,DEATH_EVENT,DEATH_EVENT,DEATH_EVENT,age,age,age,age,age,age,age,age
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
smoking,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
0,0,101.0,0.306931,0.463521,0.0,0.0,0.0,1.0,1.0,101.0,59.650168,11.309325,40.0,50.0,60.0,65.0,95.0
0,1,102.0,0.343137,0.477101,0.0,0.0,0.0,1.0,1.0,102.0,61.702618,12.917774,40.0,51.5,60.0,70.0,95.0
1,0,4.0,0.75,0.5,0.0,0.75,1.0,1.0,1.0,4.0,63.0,10.132456,50.0,57.5,65.0,70.5,72.0
1,1,92.0,0.293478,0.457851,0.0,0.0,0.0,1.0,1.0,92.0,61.076087,11.468288,40.0,52.0,60.0,70.0,90.0


Which of our columns are catogorical data?

None of them- they are all numbers, except for the labels.    
  
  
Which of our columns are continuous?  
  
All of them - as the data could change by factors of polling, or trends in lifestyle over time.

### Statistical interpretation of our data?

#### First Variable:  Death Event

Mean, Min, Max, STD? Describe what this means.

FEMALE: <span style="font-size: 12px;">0.306931 (average death event by smoking)&nbsp; Min= 0 , Max =&nbsp;</span> 

Second Variable: Smoking  

Mean, Min, Max, STD? Describe what this means.

What could the numbers in our categorical data tell us?

Why might we want to keep our categorical data as 1's and 0's? Why may we want to use something like the code below to change it?

In [30]:
df['sex'] = df.sex.replace({1: "Male", 0: "Female"})
df['anaemia'] = df.anaemia.replace({1: "Yes", 0: "No"})
df['diabetes'] = df.diabetes.replace({1: "Yes", 0: "No"})
df['high_blood_pressure'] = df.high_blood_pressure.replace({1: "Yes", 0: "No"})
df['smoking'] = df.smoking.replace({1: "Yes", 0: "No"})

df['DEATH_EVENT'] = df.DEATH_EVENT.replace({1: "Died", 0: "Alive"})

AttributeError: 'Axes' object has no attribute 'sex'

## Section 2: Distribution of our data:

In [34]:
# Plot the distribution of your variable using distplot

smoking1=df['smoking'].dropna()
sns.distplot('smoking','sex',kde=True)
plt.show()

TypeError: 'Axes' object is not subscriptable

In [None]:
# Create boxplot to show distribution of variable


In [None]:
# Feel free to add any additional graphs that help you answer the questions below.

In [None]:
# Another way to check the skewness of our variable
df['variable'].skew()

In [None]:
# Another way to check the kurtosis of our variable
df['variable'].kurtosis()

### Interpretation of how our data is distributed by variable?
Looking at the above graphs, what can you tell about the distribution of your variables?
<br><br><br><br><br>
What is the skewness and kurtosis of your variables.  What does this mean?<br>
<br><br><br><br><br>
What are some of the differences you note looking at a categorical variable vs a continuous variable?
<br><br><br><br><br>

## Section 3: Finding Correlations

Lets start by breaking our data into two.  

In [None]:
# splitting the dataframe into 2 parts
# on basis of ‘DEATH_EVENT’ column values
df_died = df[df['DEATH_EVENT'] == 1 ]
df_lived = df[df['DEATH_EVENT'] == 0 ]

In [None]:
# Plot your variable based on if they died or lived

sns.distplot(df_died['variable'])
sns.distplot(df_lived['variable'])
plt.title("Chances of survival vs Variable")
plt.legend(('Died','Lived'))
plt.plot()


In [None]:
# Feel free to add any additional graphs that help you answer the questions below.

#### What things can you infer if we consider our data a sample of the population, based on each of your variables.  
<br><br><br><br><br>
#### Do you think either of your variables is a good indicator for predicting Heart Failure, why or why not?  
<br><br><br><br><br>