# Lab Exercise 06
---

## Research pipeline

We have already covered some topics and techniques for loading, cleaning, and manipulating your datasets. While this is extremely important and most times the longest, most tedious part of the research process. However, if you have very clean data you will often be rewarded with fun and exciting analyses and results.

![research](img/research_protocol.png)


The analysis part is crucial and neded prior to good visualizations for your results, but I am going to cover visualization techniques first since some of these methods anf functions will be very helpful in some data exploration before you run a more formal analysis technique or applying machine learning, etc.

## Visualization techniques

Visualization is important to depict certain aspects of your data. These include: 
- distributions
- statistical relationships (continuous variables)
- statistical relationships (categorical variables)

Some visualization methods are built into modules like scipy, numpy, and pandas. We can use these to quickly generate depictions of our data to gain better understanding.

Let's load our dataset and use some more pandas methods.

In [None]:
import pandas as pd

df = pd.read_csv("stroke_data.csv")
df.columns

First, we will plot data distributions, which involves counting (frequencies) of samples for a given variable.

Let's use the `plot` method in pandas to plot a histogram of the age variable using the `kind` argument. This method requires the matplotlib module, so we will load that before we run `plot`.

In [None]:
import matplotlib.pyplot as plt

df['age'].plot(kind='hist')

We can do this for categorical variables as well. This involves counting the values for each category first then using that data in the plot. 

We have to modify the `kind` argument to use `bar` to plot a bar chart depitcing the counts for the `gender` variable categories.

In [None]:
df['gender'].value_counts().plot(kind='bar')

We can also compare two variables and check for relationships that may exist between them. In pandas we use the `plot` method. Let's plot a scatter plot of age vs avg_glucose_level. 

In [None]:
df.plot(kind='scatter', x='age', y='avg_glucose_level')

These plots are excellent for quickly generating a visual to understand the data better. But for publication/report quality plots we can do better. So let's check out the seaborn module.

### Seaborn
Seaborn leverages matplotlib to draw "attractive and informative statistical graphics".

Let's import seaborn and use the `displot` function to generate a distribution plot. Seaborn works harmoniously with pandas dataframes to create the plots. We will select the age variable to run this analysis on (same as the pandas example).

In [None]:
import seaborn as sns

sns.displot(data=df, x='age')

With continuous variables, the histogram requires the data to be binned. We can modify the bin size for the `displot` function by using the `binwidth` argument. For the age variable, lets use a bin size of 5.

In [None]:
sns.displot(data=df, x='age', binwidth=5)

We can also set the number of bins, as opposed to the bin size. We do this using the `bins` argument.

In [None]:
sns.displot(data=df, x='age', bins=10)

If we want to plot the count for every single value in the range of ages (no bins) we can use the `discrete` argument.

In [None]:
sns.displot(data=df, x='age', discrete=True)

We can add more information to a single plot by using the `hue` argument. We will separate and color the age data by gender.

In [None]:
sns.displot(data=df, x='age', hue='gender')

While the histogram is a very nice representation of the distribution, there are other methods to approximate and visualize the distribution. Kernel density esitmate plots a "smoothed" distribution. We can use the 'kde' in the `kind` argument.

In [None]:
sns.displot(data=df, x='age', kind='kde')

We can stack that with the histogram by running `kind='hist'` and `kde=True`.

In [None]:
sns.displot(data=df, x='age', kde=True)

Unlike the pandas plot method, seaborn allows us to plot univariate distributions of categorical variables using the `displot` function. 

We will draw a "histogram" using the gender variable.

In [None]:
sns.displot(data=df, x='gender', kind='hist')

If we want to look at the frequencies for a categorical variable using a proper barplot then we will have to use the `catplot` function. We will set the `kind` argument with `count` to generate a frequency plot.

This is exactly what we did with the pandas `plot` function, but we do not need to preprocess the data, seaborn does it for us.

In [None]:
sns.catplot(data=df, x='gender', kind='count')

We can use the `catplot` function to look at relationships between categorical and continuous variables. One such plot is the boxplot.

In [None]:
sns.catplot(data=df, x='gender', y='age', kind='box')

Boxplots provide a wealth of information regarding the distribution of samples with regard to a continuous and categorical variable. 

The middle line in the box represents the median (50th percentile) of the data, while the bottom and top edges of the box are the first, Q1, and third quartiles, Q3, (25th percentile and 75 percentile), respectively. The "whiskers" are the lines coming out of the top and bottom of the box. These represent the minimum and maximum calculated at Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, respectively. IQR is the interquartile range, which is Q3 - Q1. Any outliers appear as points either above the maximum or below the minimum.

If we want to look at relationships between two categorical variables, usually we would look at a crosstab (or contingency table). This shows the counts of samples across multiple variables.

This is easily done with pandas. We will show a simple bivariate crosstab using the `crosstab` function.

In [None]:
pd.crosstab(df['hypertension'], df['heart_disease'])

This is very useful, but we can go a step further and plot this information in a heatmap. This type of plot will depict the frequency information based on a color gradient.

Below, the lighter collor refers to a higher frequency and darker means lower.

In [None]:
sns.heatmap(pd.crosstab(df['hypertension'], df['heart_disease']))

We can make this plot even better by adding the counts directly to the plot using the `annot` argument.

In [None]:
sns.heatmap(pd.crosstab(df['hypertension'], df['heart_disease']), annot=True)

Or calculate the frequency in pandas `crosstab` and plot the frequencies.

In [None]:
sns.heatmap(pd.crosstab(df['hypertension'], df['heart_disease'], normalize=True), annot=True)

We also want to plot continuous variables with respect to other continuous variables. Like with pandas, we can do this with seaborn. To do this we use the `relplot` function.

Here we will generate a scatter plot of age vs. avg_glucose_level.

In [None]:
sns.relplot(data=df, x='age', y='avg_glucose_level')

We can layer a lot of information to these sctterplots by incorporating color cahnge based on another variable. We do this by using the `hue` argument.

Below we will plot age vs. avg_glucose_level and have a specific color denoting whether the patient has hyptertension or not.

In [None]:
sns.relplot(data=df, x='age', y='avg_glucose_level', hue='hypertension')

Moreover, we can change the `style` of the points based on another variable. We will add heart_disease to the scatterplot.

In [None]:
sns.relplot(data=df, x='age', y='avg_glucose_level', hue='hypertension', style='heart_disease')

Lastly, we can change the `size` of the points based on yet another variable. Let's use a continuous variable such as bmi.

In [None]:
sns.relplot(data=df, x='age', y='avg_glucose_level', hue='hypertension', style='heart_disease', size='bmi')

While adding all of this information into a single plot maybe helpful, it may not always be the best plot for portraying the information you need.

Based on this scatterplot we can see there is not a linear relationship between avg_glucose_level and age, but hypertension and heart_disease might be related to the two continuous variables. So rather than looking at the scatterplot, can we look at the density of samples at certain values and separate them by the two categorical variables.

We can do this by using `displot` again, but using two variables.

In [None]:
sns.displot(data=df, x='age', y='avg_glucose_level')

Breaking this down by hypertension...

In [None]:
sns.displot(data=df, x='age', y='avg_glucose_level',hue='hypertension')

We can also plot this using the kde method...

In [None]:
sns.displot(data=df, x='age', y='avg_glucose_level',hue='hypertension',kind='kde')

And again with heart_disease...

In [None]:
sns.displot(data=df, x='age', y='avg_glucose_level',hue='heart_disease',kind='kde')

Lastly, if we want to use these plots in publications or presentations we will need to save or export them somehow. We can do this using a function from matplotlib called `savefig`. This function will save your current plot to a filename of your choosing. The function will also automatically know what file type based on the file extension given.

We will save our plot as "plot.png".

In [None]:
plt.savefig("plot.png")