# Week 4: Visualizing Data 📊
## Tutorial

In the pre-module, you have seen just one example of a plot we can make using pandas ```hist()```. As you can imagine, technology can help us generate fantastic visualizations that would otherwise take a lot of time to refine and re-generate. Now it's your turn to learn to write the code that creates them!

In this module you will learn:
1. More plotting using pandas
2. Bivariate analysis
3. Plotting heatmaps using seaborn

<span style="background-color: #FFD700">**Complete the code below to load the dataset.**</span>

In [None]:
import pandas as pd
df = ...
df

FileNotFoundError: [Errno 2] No such file or directory: 'hf_data_tut.csv'

## Pandas plot

The ```hist()``` function you learned about in the pre-module notebook is a convenient function for generating historrams, but we can actually do much more visualization with the general plotting function for DataFrames, called **plot()**.


| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| plot() | kind, xlabel, ylabel, title | A plot of the data, using the specified kind of plot. | df.plot(kind, xlabel, ylabel, title) |

This function has many parameters you can specify and play with, but you do not need to worry about them for now. The important parameter we'd like to point out is ```kind```. Since plot() is a general plotting function, the **kind** parameter lets us specify what type of plot we want to produce.

These types of plots are available:

**kind**
* ‘line’ : line plot (default)
* ‘bar’ : vertical bar plot
* ‘barh’ : horizontal bar plot
* ‘hist’ : histogram
* ‘box’ : boxplot
* ‘kde’ : Kernel Density Estimation plot
* ‘density’ : same as ‘kde’
* ‘area’ : area plot
* ‘pie’ : pie plot
* ‘scatter’ : scatter plot (DataFrame only)
* ‘hexbin’ : hexbin plot (DataFrame only)

You can specify the axis labels and title on the plot by giving strings for the parameters ```xlabel```, ```ylabel```, and ```title```.

For example, if we wanted to plot a histogram for the ejection_fraction variable using plot() instead of hist(), we could specify it like this:

In [None]:
ef_hist = df['ejection_fraction'].plot(kind='hist', ylabel='Frequency', title='Histogram of ejection fraction')
ef_hist

Cool! What other kinds of plots can we make?

### Bar graph

Another way to examine the distribution of the data is through a bar graph. Say that instead of looking at the histogram of ejection_fraction (frequency of patients in each *bucket* of ejection fraction), we are now interested in how many patients there are for *every distinct* ejection fraction value. We can visualize the distribution using a bar graph with Number of Patients on the y-axis, and Ejection Fraction on the x-axis. We first use the ```value_counts()``` function to give us the frequency of each row in the ejection_fraction column, and then we plot the bar graph.

Below is the result of value_counts() on the ejection fraction column. Run the cell below, and you should see all distinct values of ejection fraction (35 to 70) on the left hand column, with the corresponding frequency (number of patients) on the right column. Finally, ```sort_index()``` sorts the column values on the left in ascending order.

| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| value_counts() | n/a | The frequency of each row in the Series. | series.value_counts() |
| sort_index() | n/a | The Series sorted in ascending order. | series.sort_index() |

Remember that a Series in pandas is like a list of values. You can replace "series" in the syntax above with a row or column of a DataFrame.

In [None]:
ef_dist = df['ejection_fraction'].value_counts().sort_index()
ef_dist

---

##### **Q1.** Call the plot function on this result to generate a bar graph. You should label the y-axis "Number of Patients", and the x-axis "Ejection Fraction".

<span style="background-color: #FFD700">**Complete the code below.**</span>

In [None]:
# TODO: fill in all parameters of this function
ef_dist.plot(...)

---

### Stacked bar plot

As you carry on your research, you suspect that older patients might be at greater risk of death. You want to see what fraction of patients in each age group have died, versus the fraction that lived in that age group. How would you go about this?

First, we want to sort each patient into their age group, and record this age group in a new column in the DataFrame. Then, we should get the counts of deaths per age group. We can then divide the death count of an age group by the total number of patients in that group to get the fractions of patients who died in that age group. We have provided the code below to do this data manipulation, and save the fractions in ```age_death_fractions```.


<span style="background-color: #FFD700">**Run the code below.**</span>

In [None]:
# Create age groups
age_groups = [30, 40, 50, 60, 70, 80, 90, 100]
df['age_group'] = pd.cut(df['age'], bins=age_groups)

# Create a new DataFrame with counts of DEATH_EVENT for each age group
age_death_counts = df.pivot_table(index='age_group', columns='DEATH_EVENT', aggfunc='size', fill_value=0)
# Calculate the total count for each age group
total_counts = age_death_counts.sum(axis=1)
# Convert counts to fractions by dividing by the total count
age_death_fractions = age_death_counts.div(total_counts, axis=0)
age_death_fractions


We could go ahead and generate a bar graph at this point, as we did in the previous section:

In [None]:
age_death_fractions.plot(kind="bar", ylabel='Fraction of patients')

---

##### **Q2.** This is alright as a visualization, but it could be easier/quicker to interpret it if we stacked the orange and blue bars.

To stack the two results for each outcome (death/no death), we just have to specify ```stacked=True``` in addition to specifying the ```kind``` as a bar graph when we call the ```plot()``` function.


<span style="background-color: #FFD700">**Complete the code below to plot the data in ```age_death_fractions``` as a stacked bar graph.**</span>


In [None]:
# TODO: fill in the arguments for the stacked bar plot
age_death_fractions.plot(...)

---

##### **Q3.** What observations can you make about patients in the 70-79 age group?

<span style="background-color: #FFD700">**Write you answer here.**</span>

Answer: 


---

##### **Q4.** According to the stacked bar plot, do you think older age groups are at greater risk of death, as per your initial hypothesis? Support your answer with evidence from the plot.

<span style="background-color: #FFD700">**Write you answer here.**</span>

Answer: 

---

##### **Q5.** View ```age_death_counts``` by printing in a separate code cell below. Note how many patients are in each age category, particularly the youngest and oldest groups. Why might you want to exclude these groups when creating a stacked bar plot with the fraction of patients?

<span style="background-color: #FFD700">**Write you answer here.**</span>

Answer: 

---

### Box plot

Boxplots are another very common type of plot to visualize distribution of data. To create a boxplot, we would specify ```box``` as the ```kind``` of plot when calling ```plot()``` on the column you want to plot.

In [None]:
age_box = df['age'].plot(kind="box", title="Boxplot of patient ages")


Take a look at the boxplot. Here is how to interpret it:

Box: The box represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). The height of the box indicates the spread of the middle 50% of the data.

Line inside the box: This line represents the median, which is the middle value of the dataset when it is sorted.

Whiskers: The whiskers extend from the box to the minimum and maximum values within a defined range. The range is often set as a multiple of the IQR. Any data points beyond the whiskers are considered potential outliers.

Outliers: Outliers will be shown as individual points outside of the box. In this boxplot, there are no outliers.

## Bivariate analysis

You've seen lots of ways to view the distribution of data with histograms, bar graphs, stacked bar graphs, and boxplots. However, there is so much more to data science than just distribution! We also wish to learn about the correlation between features in our dataset.

Bivariate analysis is useful when we want to know the relationship, or *correlation*, of two variables in the dataset. A correlation value falls between -1 and 1. The value can be interpreted as follows:

<span style="background-color: #AFEEEE">**-1 to 0:**</span>
 The two variables have a negative relationship; as one variable increases, the other decreases

<span style="background-color: #AFEEEE">**0:**</span>
 The two variables have no relationship with each other.

<span style="background-color: #AFEEEE">**0 to 1:**</span>
 The two variables have a positive relationship; as one variable increases, the other also increases.

This means that values near zero (regardless of the sign) are weakly correlated, and values near -1 or 1 are strongly correlated.

Since there are 12 variables in our heart failure dataset, there can be 12 x 12 = 144 comparisons. The DataFrame function **corr()** generates a table of correlation values.

| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| corr() | n/a | A table of correlation values for each pair of features. | df.corr() |

<span style="background-color: #FFD700">**Run the code cell below.**</span>

In [None]:
df.corr()

Take a second to view the generated table. It is definitely not easy to see which are the highest or lowest correlation values from this table. We can pick out the strong correlations by creating a heatmap, which uses intensity of colors to easily distinguish between higher/lower values. Unfortunately, there are currently no great ways for people to generate heatmaps using the pandas library alone. For this reason we will introduce a different library to help us do bivariate analysis.


### Seaborn
Introducing: <span style="background-color: #AFEEEE">**Seaborn**</span>! Seaborn is a data visualization library with a special heatmap function. We import the seaborn library and use the <span style="background-color: #AFEEEE">**heatmap()**</span> function to visualize the above correlation table as a heatmap.

| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| heatmap() | data, annot, cmap (see documentation for more) | A heatmap object. | heatmap(data, annot, cmap) |

These parameters provide customization options for visualizing data in the heatmap:
* data (required): Represents the 2D dataset for the heatmap. This can be a NumPy array or a Pandas DataFrame. Index/column information in DataFrame is used for labeling.
* annot: Controls whether to display values in cells. If True, shows actual values; if an array is provided, uses it for annotation.
* cmap: Specifies the colormap for mapping data values to colors. Can be a colormap name, object, or list of colors. Default colormap depends on whether the center parameter is set.

There can be more parameters for further customization, but these are just a few we are using in this example.


---

##### **Q6.** Let's generate a heatmap of our dataset.
<span style="background-color: #FFD700">**Complete the code below.**</span>

In [None]:
import seaborn as sns

# generate the heatmap
heatmap = sns.heatmap(...)  # TODO: fill out the arguments in this line

# Resize the plot for better viewing
heatmap.figure.set_figwidth(15)
heatmap.figure.set_figheight(10)

As you can see, the darker the shade, the higher the value. We see 1's along the diagonal when we compare each variable against itself, as we discussed earlier. Exclude same-variable pairs when answering the questions below.

---

##### **Q7.** Which pair of variables are most positively correlated? Explain what this means in your own words.

<span style="background-color: #FFD700">**Write your answer here.**</span>

Answer: 

---

##### **Q8.** Which pair of variables are second most positively correlated? Explain what this means in your own words.

<span style="background-color: #FFD700">**Write your answer here.**</span>

Answer: 

---

##### **Q9.** Which pair of variables are most negatively correlated? Explain what this means in your own words.

<span style="background-color: #FFD700">**Write your answer here.**</span>

Answer: 

---

##### **Q10.** Considering that values near 0 are weakly correlated, what do you think about the variables in this dataset based on the heatmap? Do you see more strong correlations or more weak correlations? Do you think this heatmap shows us a clear link between certain variables and patient death?

<span style="background-color: #FFD700">**Write your answer here.**</span>

Answer: 

### **Graded exercise** (5 marks total):
We can create boxplots of data grouped by a column in the DataFrame. For example, what if we wanted to see the distribution of patient ages in the "death" category and "no death" category within DEATH_EVENT? We could generate a plot with two boxplots: one for death, and one for no death.

Complete the code below to create boxplots that show the distribution of patient ages for each category of DEATH_EVENT (death/no death), and answer the questions that follow. You will want to use a different pandas function for boxplots, called ```boxplot()```; it works in much the same way as ```plot(kind='box')```, but you will not have to specify the kind. There is currently more support for the ```boxplot()``` function in pandas.


| Function | Input parameters | Output | Syntax |
| --- | --- | --- | --- |
| boxplot() | column, by | A boxplot of the specified column in the dataframe. | df.boxplot(column, by) |

* column: the column that the boxplot is generated for.
* by: the column that you wish to group the data by. One boxplot will be generated for each group.



---

##### **GQ1.** Complete the code below to create boxplots that show the distribution of patient ages for each category of DEATH_EVENT (death/no death). (3 marks)

<span style="background-color: #FFD700">**complete the code below**</span>

In [None]:

age_death_box = df.boxplot(...) # TODO: fill in the arguments for the boxplot

# Setting the title of the plot
import matplotlib.pyplot as plt
plt.title('Box plot of age grouped by death event')
plt.suptitle("") # get rid of the automatic 'Box plot grouped by group_by_column_name' title
plt.show()

---

##### **GQ2.** Look at the boxplots you generated. What is the age range and median of patients who died and did not die? (1 mark)

<span style="background-color: #FFD700">**Write your answer here.**</span>

Answer: 



---

##### **GQ3.** Compare the boxplots to your stacked bar plot. Are your observations on the boxplots consistent with your analysis of the stacked bar plot? What do these plots suggest? (1 mark)

<span style="background-color: #FFD700">**Write your answer here.**</span>

Answer: 

---

## Conclusion

In this module you have learned:
1. Plotting using pandas
    * histogram
    * bar graph
    * stacked bar graph
    * box plot
2. Bivariate analysis
3. Plotting heatmaps using seaborn

## Further Reading
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html