# **Data Science Basics in Python Series**

## **Lecture 3: Matplotlib for Unvariate Data Visualization in Python**

**Michael Pyrcz, Associate Professor, The University of Texas at Austin**

### **Data Visualizatio with Matplotlib in Python for Engineers and Geostatistics**

This is the tutorial for/ demonstrate of **Unvariate Data Visualization in Python**. In Python, a common tool for dealing with Visualization is **Matplotlib Python package**
* Initiated by John Hunter along with many contributers.
* Opensource project is a sponsored project of [NumFocus](https://numfocus.org/)

This tutorial includes methods and operation that would commonly be required for Engineers and Geostatistics working with Data Visualization for the purpose of:
* Data checking/ Data cleaning
* Data mining/ Inferential Data Analysis
* Predictive Model

for Data Analytics, Geostatistics and Machine learning.

### **Data Visualization**

Data visualization includes any graphical represetation of the data

We will demonstrate the basic concepts with only
* unvariate distributions, histograms

We will start simple and add more complexity and customization.

### **Project Goal**

Learns the basics for working with Unvariate Data Visualization in Python to build practical spatial data analytics, geostatistics and machine learning workflows.
* Focus on customization and not a survey of available plot times

### **Load and Configure the Required Libraries**

The following code loads the required libraries and set the plotting default.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)  # Control of aeis ticks
plt.rc('axes', axisbelow=True)                                     # set axes and grids in the background for all plots
from matplotlib.patches import Rectangle                           # Drawing shapes on plots

### **Loading the Dataset**

Let's load a tabular dataset from another .csv file, [spatial_nonlinear_MV_facies_v1.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/spatial_nonlinear_MV_facies_v1.csv)



In [None]:
table = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/spatial_nonlinear_MV_facies_v1.csv')

table = table.iloc[:,1:]
print('The tabular is a ' + str(type(table)) + '.')
table.head()

### **Extract the Feature from the Table**

In [None]:
por = table['Porosity'].values   # extract porosity feature as a 1D ndarray
print('The por is a ' + str(type(por)) + ' of the shape' + str(por.shape) + '.')
np.set_printoptions(precision=10, threshold=100)
por

### **Histogram**

Here's the basic histogram plot
* Quite a plain plot
* Axes unlabeled

In [None]:
plt.hist(x=table['Porosity'].values)
plt.show()

### **Design the Plot space**

Let's improve the plot by considering and designing the plot space.
* Label the axes (`xlabel()`, `ylabel()`)
* Add a grid (`.grid()`) to improve our ability to perform 'ocular inspection'
* We explicitly control the plot size, start considering readability.
* Consider color (color = string) to separate the elements, i.e. foreground and background 

In [None]:
plt.hist(x=por, color='red')
plt.xlabel('Porosity (%)')
plt.ylabel('Frquency')
plt.title('Porosity Histogram')
plt.grid()

plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Compose the Elements**

Let's consider how we can combine all the elements to improve clarity.
* Outline the histogram bars (edgecolor = string) to separate the binning of the data.
* Use transparency (alpha < 1.0) to further improve the 'ocular inspection'

In [None]:
plt.hist(alpha=0.2, edgecolor='black', x=por, color='red')
plt.xlabel('Porosity (%)')
plt.ylabel('Frquency')
plt.title('Porosity Histogram')
plt.grid()

plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Improve the Consistency Between Elements**

Let's improve the consistency of the plot elements 
* Specify the axes' extents (`.xlim()`, `.ylim()`) and align yaxex increments with integer frequency.
* Only show the grid on y and add a minor grid and ticks for readability.

In [None]:
plt.hist(bins=np.linspace(0,30,20), alpha=0.2, edgecolor='k', x=por, color='red')
plt.xlabel('Porosity (%)')
plt.ylabel('Frquency')
plt.title('Porosity Histogram')

plt.xlim(0,30)
plt.ylim(0,80)
plt.gca().yaxis.grid(True, which='major', linewidth=1.0)
plt.gca().yaxis.grid(True, which='minor', linewidth=0.2)
plt.gca().tick_params(which='major', length=7)
plt.gca().tick_params(which='minor', length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator())
plt.gca().yaxis.set_minor_locator(AutoMinorLocator())

plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show() 


### **Histogram Ultimate Control**

Let's further improve the consistency between our plot elements and add hierarchy to the labels.

* Specify the histogram bins (bins=list), grid and ticks to align with histogram bins.
* Adjust the font sizes (fontsize = float)

In [None]:
plt.hist(bins=np.linspace(0,30,31), alpha = 0.3, edgecolor='b', x=por, color='red')
plt.gca().tick_params(which='major', length=7)
plt.gca().tick_params(which='minor', length=4)
plt.gca().yaxis.grid(True, which='major', linewidth=1.0)
plt.gca().yaxis.grid(True, which='minor', linewidth=0.2)
plt.xlabel('Porosity (%)', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.title('Porosity Histogram', fontsize=20)
plt.xlim(0,30)
plt.ylim(0,45)

plt.gca().xaxis.set_major_locator(MultipleLocator(1))
plt.gca().yaxis.set_minor_locator(MultipleLocator(1))
plt.gca().yaxis.set_major_locator(MultipleLocator(5))

plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show() 

### **Make a Custom plot Formatting Function**

This is a lot of code, let's make a custom function to format our histogram.
* We could make our function more flexible with the addition of function arguments.
* This is helpful for concise workflows, especially when small plots are reused.
* We will not discuss the definition and use of **styles**

In [None]:
def format_hist():             # function declaration 
  plt.xlabel("Porosity (%)", fontsize=15)
  plt.ylabel('Frequency', fontsize=15)
  plt.title('Porosity Histogram', fontsize=20)
  plt.xlim(0,30)
  plt.ylim(0,45)
  plt.gca().tick_params(which='major', length=7)
  plt.gca().tick_params(which='minor', length=4)
  plt.gca().yaxis.grid(True, which='major', linewidth=1.0)
  plt.gca().yaxis.grid(True, which='minor', linewidth=0.2)
  plt.gca().xaxis.set_major_locator(MultipleLocator(1))
  plt.gca().yaxis.set_minor_locator(MultipleLocator(1))
  plt.gca().yaxis.set_major_locator(MultipleLocator(5))





### **Histogram with Custom Formatting Function**

Let's demonstrate the application of format function
* Same result with much less code

In [None]:
plt.hist(bins=np.linspace(0,30,31), alpha=0.2, edgecolor='b', x=por, color='red')
format_hist()
plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1)
plt.show()

### **Add a Custom Legend**

We may want to communicate key statistics with a custom legend by adding shapes and annotation to our plot.


In [None]:
plt.hist(bins=np.linspace(0,30,31), alpha=0.2, edgecolor='b', x=por, color='red')
format_hist()
plt.gca().add_patch(Rectangle((23.5,33.5), 4.5,9, facecolor='white', edgecolor='black', linewidth=0.5))
plt.text(24,40.5, 'Mean ' + str(round(np.average(por),1)))
plt.text(24, 38.5, 'P90   ' + str(round(np.percentile(por,90),1)))
plt.text(24, 36.5, 'P10   ' + str(round(np.percentile(por,10),1)))
plt.text(24, 34.5, "n       " + str(por.shape[0]))

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1)
plt.show()

### **Highlight Data Feature**

Let's highlight the summary statistics on the plot with lines.


In [None]:
plt.hist(bins=np.linspace(0,30,31), alpha=0.2, edgecolor='b', x=por, color='red')
format_hist()
plt.gca().add_patch(Rectangle((23.5,33.5), 4.5,9, facecolor='white', edgecolor='black', linewidth=0.5))
plt.text(24,40.5, 'Mean ' + str(round(np.average(por),1)))
plt.text(24, 38.5, 'P90   ' + str(round(np.percentile(por,90),1)))
plt.text(24, 36.5, 'P10   ' + str(round(np.percentile(por,10),1)))
plt.text(24, 34.5, "n       " + str(por.shape[0]))
p10 = np.percentile(por,10)
avg = np.average(por)
p90 = np.percentile(por,90)
plt.plot([p10,p10],[0.0,45], color='b', linestyle='dashed')
plt.plot([avg,avg],[0.0,45], color='b')
plt.plot([p90,p90],[0.0,45], color='b', linestyle='dashed')

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1)
plt.show()

### **From start to end**

Let's compare our original and final plot

In [None]:
plt.figure(figsize = (10,4))

plt.subplot(121)
plt.hist(por)

plt.subplot(122)
plt.hist(bins=np.linspace(0,30,31), alpha=0.2, edgecolor='b', x=por, color='red')
format_hist()
plt.gca().add_patch(Rectangle((23.7,33.5),4.5,9, facecolor='white', edgecolor='black', linewidth=0.5))
plt.text(24, 40.5, 'Mean ' + str(round(np.average(por),1)))
plt.text(24, 38.5, 'P90   ' + str(round(np.percentile(por,90),1)))
plt.text(24, 36.5, 'P10   ' + str(round(np.percentile(por,10),1)))
plt.text(24, 34.5, "n       " + str(por.shape[0]))
p10 = np.percentile(por,10)
avg = np.average(por)
p90 = np.percentile(por,90)
plt.plot([p10,p10],[0.0,45], color='b', linestyle='dashed')
plt.plot([avg,avg],[0.0,45], color='b')
plt.plot([p90,p90],[0.0,45], color='b', linestyle='dashed')

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1)
plt.show()

### **Multiplt plots**

Now let's separate the sample by facies element.

* Facies 0 - Shalestone
* Facies 1 - sandstone

In [None]:
sand = table[table['Facies'] == 1]['Porosity'].values   # extract the sandstone samples
shale = table[table['Facies'] == 0]['Porosity'].values  # extract the shalestone samples
print('There are ' + str(len(sand)) + ' sandstone samples.' )
print('There are ' + str(len(shale)) + ' shalestone samples.')

### **Combine Multiple plots with Subplots**

We can combine plots with subplots
* The subplot specification is $[n_{rows}][n_{columns}][\text{current_index}]$, current\_index is from 1 to $n_{rows} \times n_{columns} $
* the index is from the top left and across the columns and then down (like reading a page)

In [None]:
plt.figure(figsize=(10,4))

plt.subplot(121)
plt.hist(x=sand, color='yellow', alpha = 0.5, edgecolor='k', bins=np.linspace(0,30,31), label='sand')
plt.legend()
format_hist()

plt.subplot(122)
plt.hist(x=shale, color='brown', alpha=0.5, edgecolor='k', bins=np.linspace(0,30,31), label='shale')
plt.legend()
format_hist()

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1, wspace= 0.2)
plt.show()

### **Combine Multiple plot within Same plot space**

We can combine plot in the same plot space. We do this by adding plots in sequence.

In [None]:
plt.hist(x=sand, color='yellow', alpha = 0.5, edgecolor='k', bins=np.linspace(0,30,31), label='sand')
plt.hist(x=shale, color='brown', alpha=0.5, edgecolor='k', bins=np.linspace(0,30,31), label='shale')
plt.legend()
format_hist()

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1)
plt.show()


### **Specifying Plot Order**

We can use zorder argument to specify the plotting order

In [None]:
plt.figure(figsize=(10,4))

plt.subplot(121)
plt.hist(zorder=1 ,x=sand, color='yellow', alpha = 1, edgecolor='k', bins=np.linspace(0,30,31), label='sand')
plt.hist(zorder=2 ,x=shale, color='brown', alpha=1, edgecolor='k', bins=np.linspace(0,30,31), label='shale')
plt.legend()
format_hist()

plt.subplot(122)
plt.hist(zorder=2, x=sand, color='yellow', alpha = 1, edgecolor='k', bins=np.linspace(0,30,31), label='sand')
plt.hist(zorder=1, x=shale, color='brown', alpha=1, edgecolor='k', bins=np.linspace(0,30,31), label='shale')
plt.legend()
format_hist()

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1, wspace= 0.2)
plt.show()

### **Adding PDF to Histogram**

We can combine multiple types of plots. Here we add the PDF's estimated with kernel density estimation. 

In [None]:
plt.figure(figsize=(8,4))

import seaborn as sns
plt.hist(zorder=1 ,x=sand, color='yellow', alpha = 0.5, edgecolor='k', bins=np.linspace(0,30,31), label='sand', density= True)
plt.hist(zorder=2 ,x=shale, color='brown', alpha= 0.5, edgecolor='k', bins=np.linspace(0,30,31), label='shale', density= True)
sns.kdeplot(x=sand, color='orange', alpha=0.8, levels=1)
sns.kdeplot(x=shale, color='black', alpha=0.8, levels=1)
plt.legend()
plt.xlabel('Porosity (%)')
plt.ylabel('Density / Probability')
plt.title('Normalized Histogram and PDFs')
plt.subplots_adjust(left=0.0, bottom=0.0, top=1.0, right=1.1)
plt.show()

### **Plotting a Cumulative Distribution Function**

It is quite easy to switch from histogram to CDF
* We set the cumulative argument to True.
* We switch histtype argument to step or stepfilled.

In [None]:
plt.hist(cumulative=True, histtype='stepfilled', x=shale, color='brown', alpha = 1, edgecolor='k', bins=np.linspace(0,30,31), label='shale', density=True )
plt.hist(cumulative=True, histtype='stepfilled', x=sand, color='yellow', alpha = 1, edgecolor='k', bins=np.linspace(0,30,31), label='sand', density=True )

plt.legend(loc='upper left')
plt.xlim(0,30)
plt.ylim(0,1)
plt.xlabel('Porosity (%)')
plt.ylabel('Cumulative Probabilities')
plt.title('Cumulative Distribution Function')

plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()