# Tutorial 3: Visualizing Data
------------------------------------------------------------------------------------

Pictures can tell the story much more efficiently than text, particularly showing relationships between variables. Surely, you have made many plots of data in the past using Excel or other (expensive) statistical software packages. The purpose of this tutorial is to enable you to use the powerful Python libraries which will create plots that represent the data in the best possible ways. It is assumed that you already understand the kinds of plots that might be useful for your data and are just learning how to translate that from plotting tools to Python.

## Learning Objectives
- Make plots with Matplotlib
- Make plots with Pandas

## Prerequisites
- Foundations in Python coding
- Pandas & NumPy
- Knowledge of plot types

## Getting Started
import the required libraries in the next code box.

In [None]:
%pip install jupyterquiz
%pip install matplotlib
from jupyterquiz import display_quiz
import numpy as np
import matplotlib.pyplot as plt
print("All packages installed")

# Matplotlib
The main plotting set of libraries in Python is **Matplotlib** . It can do 2D and 3D plotting in "Matlab" style, and even [animate plots](https://matplotlib.org/stable/users/explain/animations/animations.html#sphx-glr-users-explain-animations-animations-py) (Although that is beyond the scope of this introduction). 

It is typical to import the libraries (matplotlib.pyplot) as plt to simplify it's use, much like we have been using np and pd as shorthand for NumPy and Pandas in this module. 

You could devote a whole course to [Matplotlib](https://matplotlib.org/stable/tutorials/index) as there are so many ways to plot but we will briefly cover mechanics and a few examples.

While it may seem easier to use Excel and its drop-down menus to make plots, there are some fantastic advantages with Matplotlib.
1. You can print in a format with enough dots per inch (dpi) to make publication-qualitity images: fig.saveas('fig.tiff',dpi=600)
2. While it takes a few more steps than excel:
 - you CAN control all the characteristics
 - the code is re-usable and can be applied to the next plot vs. having to start from scratch for each excel or even SigmaPlot figure.
 - python is free
 - you can plot much larger datasets


## Making a plot with PyPlot

With Pyplot, we establish a figure, then annotate it and/or add all the necessary plotting elements. By default, the plot() function draws a line from point to point. 

   'Dose µM0, 0.1, 0.5, 1, 5, 10, 50, 100, 200, 5000    'Cell Viabilit drug 1y ()': [100, 98, 92, 85, 70, 60, 40, 25, 5Cell Viability with drug 2: 100, 99, 97, 94, 90, 85, 80, 75, 70, 65print(df)


In [None]:
# Basic plotting 
import numpy as np
x=np.array((0, 0.1, 0.5, 1, 5, 10, 50, 100, 200, 500))
y=np.array((100, 98, 92, 85, 70, 60, 40, 25, 15, 5))
plt.plot(x,y)

# Change the points to red circles
#plt.plot(x,y,"ro")

#Add a 2nd dose-response curve
#drug2=np.array((100, 99, 97, 94, 90, 85, 80, 75, 70, 65))
#plt.plot(x,drug2)


## Matplotlib Object Hierarchy
It might help you to have a mental map of how matplotlib organizes information to make the plot. Most of this is built into the pyplot module.

<img alt="Anatomy of a Matplotlib figure" src="https://matplotlib.org/stable/_images/anatomy.png" width="400" height="400">

At the top of the hierarchy is the **Figure** object, holding one or more **Axes**

Below that are individual lines, grids, legends and text boxes, ticks and labels

This gives us a fine granularity and level of control over the plot


### Adding to Plots
In plt.plot(), we can indicate line styles with a text string and add many chart elements. 

Additional data can be added

<div class="alert alert-block alert-info"> <b>Tip:</b> Try adding and changing the plot elements</a>. </div>

In [None]:
# axes look funny, change range?
fig = plt.figure()             #Create a figure, to which all parts need to be added
plt.title("Dose-response Curve")    #Plot title
plt.xlim(-10, 500)                  #x axis range
plt.plot(x,y,"--")             #line style
plt.plot(x,y,"mo")             #point style and color
plt.ylabel("Cell Viability %")       #Y axis label
plt.xlabel("Dose (µM)") #X-axis label

There are many built-in marker options (far more than this short list)
<table>
    <tr>
        <th>Marker</th>
        <th>Description</th>
        <th>Marker</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>o</td>
        <td>circle</td>
        <td>*</td>
        <td>star</td>
    </tr>
    <tr>
        <td>.</td>
        <td>point</td>
        <td>s</td>
        <td>square</td>     
    </tr>
    <tr>
        <td>x</td>
        <td>X</td>
        <td>D</td>
        <td>diamond</td>
    </tr>
    <tr>
        <td>'v'</td>
        <td>triangle down</td>
        <td>'^'</td>
        <td>triangle up</td>
    </tr>
</table>

You can also choose from standard colors:
![colors](https://matplotlib.org/stable/_images/sphx_glr_named_colors_001.png)'
<div class="alert alert-block alert-info"> <b>Tip:</b> Try using the color and symbol information to customize the plot above</a>. </div>

## Histograms

A histogram is a type of bar plot where the X-axis represents the bin ranges while the Y-axis gives information about frequency

Data that is appropriate for a histogram is continuous. Histograms show us the overall distribution of numerical data, so you are displaying the *density* of data points. 

The key variables of histograms that affect their appearance are 
1. Bins: The range of values is divided into a set of intervals called bins. The height of each bin represents the frequency of data points falling within that interval
2. 
X-axis: Represents the values or ranges of the data being plotetd.
3.
Y-axis: Represents the frequency or count of data points within eachin 


The histogram plots an array of data. In this first demonstration, we will just create rayars os random distributions me bady NumPy

Matplotlib will automatically create bins, but you CAN choose the size. Play around with the # of bins and look at the alternative (exponential) array

You have a lot of options for how your histogram could look. Here are some examples: 
![HistogramTypes./images/](mpl_hist.bmp)




In [None]:
# Make a Histogram
binom = np.random.binomial(100,0.2,1000)
plt.figure(figsize=(4, 2))
plt.title("Binomial Distribution Simulation")
plt.hist(binom)
#plt.hist(binom, bins=3)
#plt.hist(binom, bins=50)
#expon= np.random.exponential(1,1000)
#plt.hist(expon)

More than one set of data may be compared with a histogram. As with line plots, a second set of data is easy to add to a plot.


In [None]:
import os
import numpy as np
import pandas as pd

# Load simulated protein expression levels for two groups
prot_df=pd.read_csv("."+ os.sep + "Datasets"+ os.sep + "sim_prot_exp.csv")
# Plot histograms for "Control" and "Treated"
plt.figure(figsize=(10, 6))

#The first one will appear behind whichever one is added 2nd.
plt.hist(prot_df[prot_df['Group'] == 'Control']['Protein Expression (AU)'], bins=15, alpha=0.7, label='Control', color='blue', edgecolor='black')
plt.hist(prot_df[prot_df['Group'] == 'Treated']['Protein Expression (AU)'], bins=15, alpha=0.7, label='Treated', color='green', edgecolor='black')

plt.title('Protein Expression Levels: Control vs. Treated')
plt.xlabel('Protein Expression (AU)')
plt.ylabel('Frequency')
plt.legend()
plt.grid(axis='y', linestyle='--', linewidth=0.3)
plt.show()


We can use Cancer expression dataset we used in the Pandas tutorial. It can be part of quality control to examine the histogram for expression data to ensure that there are no obvious outliers. 


In [None]:
# Make a Histogram with Cancer Data
import pandas as pd
import os
cancer=pd.read_csv("." + os.sep + "Datasets" + os.sep +"cancer.csv")     # data is from Science 286:531-537
cancer.index=cancer.iloc[:,1]  #make the gene accession numbers, column 2, the indices of the dataframe
cancer.drop(cancer.columns[0:2], axis=1, inplace=True)     #drop the two columns of gene name and information
#cancer.head(n=3)

### Histograms in Pandas
The central wrapper is DataFrame.plot() The default value is line plots You can change this with the kind argument: ‘bar’, ‘scatter’, ‘pie’ and others Thus, you can call for a histogram directly (the other plots require X and Y)

df.plot.typeofplot


In [None]:
cancer.plot.hist(legend=False)

# Show the plot
plt.show()

In [None]:
# Normalize each gene's expression levels (row-wise normalization)
cancer_norm = cancer.div(cancer.sum(axis=1), axis=0)
cancer_norm.plot.hist(legend=False)

# Show the plot
plt.show()

### Histogram Styles
We can add arguments to plt.hist() for different styles.
Check help for plt.hist() 

In [None]:
# Histogram with Various Arguments
plt.figure()
plt.title("Binomial Distribution Simulation")
plt.grid(True)
plt.hist(binom,bins=20, align="left",
         orientation = "horizontal",
         color="r",alpha=0.75);

### Plot Legends
Legends accessed with plt.legend()
We can use existing label arguments, or provide them. 

In [None]:
# Legend
plt.figure()
plt.plot(x,x,label="identity")
plt.plot(x,x**2,label="square")
plt.plot(x,x*2,label="scale")
plt.legend(loc="upper left")
#help(plt.legend)

The central wrapper is DataFrame.plot()
The default value is line plots
You can change this with the kind argument: ‘bar’, ‘scatter’, ‘pie’ and others
You can also call hist directly 

In [None]:
# Import Dataframe
import os
file= "." + os.sep + "Data"
states = pd.read_csv("." + os.sep +"Datasets" + os.sep + "states.csv")
states["Income_cat"] = pd.qcut(states.Income,3,labels=["low","med","high"])

# Histograms 
states.Income.plot(kind="hist")
states.Income.hist(color='k', alpha=0.5, bins=50)

Pandas - Annotate Title
Set the blank canvas in order to annotate it.

## Scatter Plot

Scatterplots are powerful tools for visualizing the relationship between two variables by plotting individual data points on a two-dimensional plane, enabling you to uncover patterns, trends, or clusters in your data.

To create a scatterplot in Matplotlib, you need to provide two arrays (or similar structures) that represent the x-coordinates and y-coordinates of the data points. These arrays determine the position of each point on the plot.

**Basic Requirements for Matplotlib**
1. Two Arrays:
 * One for the x-axis (independent variable).
 * One for the y-axis (dependent variable).
The arrays must be of the same length.

2. Optional Enhancements:
 * Additional arguments for customization, such as color (c), size (s), labels, etc.

**If you use pandas** you use df.plot(x='Column', y='Column, kind='scatter')
Pandas uses the column labels for X & Y axis.

See this use for our states data.

In [None]:
states=pd.read_csv("." + os.sep + "Datasets" + os.sep +"states.csv")  

#Matplotlib version

plt.figure()
plt.scatter(states["Population"],states["Murder"])
plt.title('US states relationship between Murder rate and Population')
plt.xlabel('Population (10K)')
plt.ylabel('Murder rate per 100K')

In [None]:
# Pandas version
# plot life expectancy versus illiteracy
#states.plot(x="Life Exp",y="Illiteracy")  #The default is a line plot
states.plot(x="Life Exp", y="Illiteracy", kind="scatter")

### Scatter matrix

This is a popular way to view univariate and bivariate distributions
The diagonal is histograms, and the off diagonals are scatterplots of **all** variables. 

In [None]:
# Scatter matrix
from pandas.plotting import scatter_matrix
scatter_matrix(states);  # The semicolon suppresses a long array listing all the components
#states.corr()

## Test your knowledge

Write Python code to do the following (answers at the end) then run the quiz in the subsequent python code box
1. Make a HORIZONTAL histogram of the illiteracy data across the states (from the states dataset) with 5 bins
2. Add a title & choose a new color.
3. Load the diabetes dataset as a Pandas dataframe (https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", sep="\t"). *Y is blood glucose level*
4. Create a scatter plot showing BMI with age, with each sex in a different color

In [None]:
#Write your plot code here

In [None]:
from jupyterquiz import display_quiz
plotqz= "PythonQuizQuestions/matplot.json"
display_quiz(plotqz)

## Multiple Figures per Plot
Subplots allow you to create multiple plots (axes) within a single figure in Matplotlib. They are particularly useful when you want to display related plots side by side for comparison or when visualizing multiple aspects of a dataset.

There are two main ways to create subplots
1. plt.subplot() (Simple Grid-Based Subplots):

  - Specify a grid layout (e.g., 2x2) and plot in a specific cell.
  - Example: plt.subplot(2, 2, 1) means a 2x2 grid, and the plot is in the first cell.

2. plt.subplots() (More Flexible and Modern):

  - Creates a grid layout and returns figure and axes objects for better control.
  - Use axes[i, j] to reference individual subplots.

In [None]:
# Multiple Figures Per Plot
x = np.arange(0,4*np.pi,0.01)
plt.figure()
plt.subplot(211)
plt.plot(x,np.sin(x),"k--")
plt.ylim(-1.2,1.2)
plt.title('Time variations')
plt.subplot(212)
plt.plot(x,x**2,"b-")


# Save figure 
#plt.savefig("myplot.png")

## Boxplots
Boxplots and bar plots are used for different purposes in data visualization. 

A bar plot represents categorical data using bars, where the height of each bar shows a summary statistic like the mean or count. 

In contrast, a boxplot displays the **distribution** of numerical data, showing key statistics like the median, quartiles, and potential outliers, making it ideal for comparing variability and spread across groups.

While bar plots focus on central tendencies, boxplots emphasize the range and shape of the data.

The shape and location of the box and whiskers in a boxplot provide insights into the distribution of the data:

1. **Box (Middle 50% of Data):**The box represents the interquartile range (IQR), which is the range between the 25th percentile (Q1) and the 75th percentile (Q3). A wide box indicates high variability in the middle 50% of the data, while a narrow box suggests low variability.
2. **Line Inside the Box (Median):** The line inside the box shows the median (50th percentile), giving the central tendency of the data.
If the median is closer to one end of the box, it indicates a skewed distribution.
3. **Whiskers:** The whiskers extend from the box to the smallest and largest data points within 1.5 times the IQR.
Short whiskers indicate that most data is tightly clustered around the median, while long whiskers suggest a wider spread.
4. **Outliers** (Points Beyond Whiskers): Any individual points outside the whiskers are considered outliers, representing unusual or extreme values.

We have already seen a histogram of the prot_df ("cancer expression") pandas dataframe we made for this tutorial.  


In [None]:
prot_df.boxplot(column="Protein Expression (AU)", by="Group", grid=False)

# Add title and labels
plt.title("Protein expression with treatment")
plt.suptitle("")  # Remove default "by" subtitle
plt.ylabel("Protein Expression")
plt.show()

### Seaborn approach
If you want to show the data on the boxplot, it is very easy to do with seaborn libraries (matplotlib requires an iteration on the dataframe to add each point).
You can see that seaborn is even more intuitive with how it assumes the labels you want on the figure

In [None]:
import seaborn as sns
# Create boxplot
sns.boxplot(x="Group", y="Protein Expression (AU)", data=prot_df)

# Overlay data points
sns.stripplot(x="Group", y="Protein Expression (AU)", data=prot_df, color="red", alpha=0.6, jitter=True)

## Heatmaps

Heatmaps are useful for visualizing correlations, distributions, or other matrix-like data. They have been used extensively in bioinformatics because they are an intuitive way to show large amounts of related data. They are really quite simple to create with matplotlib.

Heatmaps:
- Display abundance levels of proteins or metabolites across conditions or treatments.
- Show the levels of DNA methylation at different CpG sites across samples.
- Show the statistical significance (e.g., p-values) of enriched pathways or gene sets for different conditions.
- Display the expression levels of thousands of genes across multiple samples.

The key parameters needed for matplotlib are
- **data:** A 2D array or matrix to visualize.
- **cmap:** Colormap (e.g., 'viridis', 'coolwarm', etc.).
- **interpolation:** Determines how values are smoothed (e.g., 'nearest' avoids smoothing).

In [None]:
# Example data
data = np.array([[1.0, 0.8, 0.5],
                 [0.8, 1.0, 0.3],
                 [0.5, 0.3, 1.0]])

# Row and column labels
labels = ['Gene1', 'Gene2', 'Gene3']

# Create the heatmap
plt.figure(figsize=(6, 6))
plt.imshow(data, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label='Correlation Coefficient')

# Add labels
plt.xticks(range(len(labels)), labels, rotation=45, ha='right')
plt.yticks(range(len(labels)), labels)
plt.title('Heatmap with Labels')
plt.show()

# Conclusion

You should now be able to make effective plots with Matplotlib. It might be nice, though to print a copy of the [Matplotlib cheatsheet](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png) now that you know how to use the syntax.

The next Tutorial will explore using [Inferential Statistics](./Submodule_2_Tutorial_4_InferentialStatistics.ipynb)


## Clean up
Remember to shut down your Jupyter Notebook compute instance when you are done for the day to avoid unnecessary charges. 

## Solutions to Test your Knowledge

In [None]:
# Histogram of states data with Various Arguments
plt.figure()
plt.title("Illiteracy")
plt.grid(True)
plt.hist(states['Illiteracy'], bins=5, align="left",
         orientation = "horizontal",
         color="m",alpha=0.75);
plt.xlabel=('Frequency')
plt.ylabel=('Illiteracy rate (%)')
plt.show()

In [None]:
import pandas as pd

#get diabetes data and add the file into a DataFrame
# Use \t as the delimiter for tab-separated values
diabetes = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", sep="\t")
# Display the DataFrame
print(diabetes.head())

In [None]:
# Basic plotting 
diabetes.plot.scatter(x=["AGE"],y=["BMI"],c="SEX", colormap='viridis', colorbar=False);
plt.title('Diabetic Subjects BMI with Age')