# **Data Science Basics in Python Series**

## **Lecture 4. Matplotlib for Bivariate Data Visualization in Python**

**Michael Pyrcz, Associate Professor, The University of Texas at Austin**

### **Data Visualization with MatPlotLib in Python for Engineers and Geoscientists**


### **Data Visualization**
Data visualization includes any graphical representations of the data.

We will demonstrate the basic concepts with only:
* bivariate distributions with unvariate histograms

We will start simple and add more complexity and customization.

### **Project Goal**

Learn the basic for working with Bivariate Data Visualization in Python to build practical spatial data analytics, geostatistics and machine learning workflows.
* Focus on customization and not a survey of available plot times.

### **Load the Required and Configure Libraries**

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)  # control the axes ticks
plt.rc('axes', axisbelow=True)
from matplotlib.patches import Rectangle
import seaborn as sns

### **Loading the Dataset**

Let's load the tabular dataset with file `.csv` [spatial_nonlinear_MV_facies_v1.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/spatial_nonlinear_MV_facies_v1.csv)


In [None]:
df = pd.read_csv(r'https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/nonlinear_facies_v1.csv')
df = df.iloc[:,1:]  # remove the first feature

print('The tabular data is a ' + str(type(df)) + ' with a ' + str(len(df))+ ' samples.')
df.head()

### **Extract the Feature from the table**

I do this concise and readablle code

In [None]:
por = df['Por'].values
perm = df['Perm'].values

print('The por is a ' + str(type(por)) + ' of shape' + str(por.shape) + '.')
print('The perm is a ' + str(type(perm)) + " of shape" + str(perm.shape) + '.' )

### **Summary Statistics for Plotting**

Let's calculate the minimum and maximum values for each feature and assign a good range for plotting

In [None]:
print('The porosity range is [' + str(np.min(por)) + ',' + str(np.max(por)) + '].')
print('The permeability range is [' + str(np.min(perm)) + ',' + str(np.max(perm)) + '].')
pormin = 0.0; pormax = 32.0
permmin = 0.0; permmax = 1350.0

### **Unvariate**

Let's start with unvariate, the basic histogram plots for each of two features

In [None]:
plt.subplot(121)
plt.hist(x=por, edgecolor='k', bins=np.linspace(pormin,pormax,11), color='red', alpha=0.2)
plt.xlim([pormin, pormax])
plt.xlabel('Porosity (%)')
plt.ylabel('Frequency')
plt.title('Porosity Histogram')
plt.grid(axis='y')

plt.subplot(122)
plt.hist(x=perm, edgecolor='k', bins=np.linspace(permmin, permmax,11), color='red', alpha=0.2)
plt.xlim([permmin, permmax])
plt.xlabel('Permeability (mD)')
plt.ylabel('Frequency')
plt.title('Permeability Histogram')
plt.grid(axis='y')

plt.subplots_adjust(left=0.0, bottom=0.0, right = 2.0, top=1.1)
plt.show()

Interesting, the two features have distinctly different distribution
* porosity is symmetric while permeability is positively skewed

### **Scatter plot**

Let's now look at the relationship between two feature


In [None]:
plt.scatter(x=por, y=perm)

This is an interesting dataset
* nonlinear, heteroscedastic, multiple populations

### **Design the Plot space**

Let's improve the plot by considering and designing the plot space
* Label the axes (`x.label()`, `y.label()`)
* Add a grid (`.grid()`) to improve our ability to perform 'ocular inspection'
* We explicitly control the plot size, start considering readability
* Consider color (color=string) to separate elements, i.e. for instance foreground and background

In [None]:
plt.scatter(x=por,y=perm, color='red')
plt.xlabel('Porosity (%)')
plt.ylabel('Permeability (mD)')
plt.title('Permeability - Porosity Scatter Plot')
plt.grid()

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.1, right=1.0)
plt.show()

### **Compose the Elements**

Let's think more about how we can combine all the elements to improve clarity
* Outline the data points (edgecolor = string) to better separate the data.
* Use transparency (alpha < 1.0) to further improve our perception of relative data density, joint probability.

In [None]:
plt.scatter(x=por,y=perm, color='red', edgecolor='k', alpha=0.1)
plt.xlabel('Porosity (%)')
plt.ylabel('Permeability (mD)')
plt.title('Permeability - Porosity Scatter Plot')
plt.grid()

plt.subplots_adjust(left=0.0, bottom=0.0, top=1.1, right=1.0)
plt.show()

### **Improve the Consistency between Elements**

Let's improve the consistency of the plot elements
* Specify the axes' extent (`x.lim()`, `y.lim()`) and align y axes increaments with integer frequency
* Add a minor grid and ticks for readability

In [None]:
plt.scatter(x=por, y=perm, edgecolor='k', color='red', alpha=0.1)
plt.xlabel('Porosity (%)')
plt.ylabel('Permeability (mD)')
plt.title('Porosity - Permeability scatter plot')
plt.xlim([pormin, pormax])
plt.ylim([permmin, permmax])

plt.gca().grid(True, which='major', linewidth=1.0)
plt.gca().grid(True, which='minor', linewidth=0.2)
plt.gca().tick_params(which='major', length=7)
plt.gca().tick_params(which='minor', length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator())
plt.gca().yaxis.set_minor_locator(AutoMinorLocator())

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1)
plt.show()

### **Make a Custom plot function**

We have a good plot now, but it requires quite a bit of code
* We can make a convenient function to make this plot for any dataset.

In [None]:
def my_scatterplot(x,xmin,xmax,xlabel,y,ymin,ymax,ylabel,title):
  plt.scatter(x=x,y=y,s=15,color='red',edgecolor='k', alpha=0.05)
  plt.xlabel(xlabel)
  plt.ylabel(ylabel)
  plt.title(title)
  plt.xlim([xmin,xmax])
  plt.ylim([ymin,ymax])
  plt.grid()

  plt.gca().grid(True,which='major',linewidth=1.0)
  plt.gca().grid(True,which='minor',linewidth=0.2)
  plt.gca().tick_params(which='major',length=7)
  plt.gca().tick_params(which='minor',length=4)
  plt.gca().xaxis.set_minor_locator(AutoMinorLocator())
  plt.gca().yaxis.set_minor_locator(AutoMinorLocator())

print('my_scatterplot is a ' + str(type(my_scatterplot)) + '.')

Now, let's try out our custom plot function

In [None]:
my_scatterplot(por,pormin,pormax,'Porosity (%)',perm,permmin,permmax,'Permeability (mD)','Porosity-Permeability scatter plot')
plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Adding a $3^{rd}$ Dimension**

Of course, we can plot the facies categorical feature to observe the mixture of 3 distinct populations

In [None]:
scatter = plt.scatter(x=por,y=perm,c=df['Facies'],edgecolor='k', alpha=0.1, cmap=plt.cm.Dark2, label=['Sand','Mixed','Shale'])
plt.xlabel('Porosity (%)')
plt.ylabel('Permeability (mD)')
plt.title('Porosity - Permeability Scatter plot')
plt.xlim([pormin, pormax])
plt.ylim([permmin,permmax])
plt.grid()

legend = plt.gca().legend(*scatter.legend_elements(), loc='lower right', title='Facies')
plt.gca().add_artist(legend)

plt.gca().grid(True,which='major',linewidth=1.0)
plt.gca().grid(True,which='minor',linewidth=0.2)
plt.gca().tick_params(which='major',length=7)
plt.gca().tick_params(which='minor',length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator())
plt.gca().yaxis.set_minor_locator(AutoMinorLocator())

plt.subplots_adjust(left=0.0, bottom=0.0,right=1.0, top=1.1)
plt.show()


### **Bivarate Conditional Distributions**

It is often useful to calculate conditional statistics with respect to the other feature binned. First we make our bins over porosity.

In [None]:
nbins = 8
por_bins = np.linspace(pormin, pormax, nbins)  # set the bin boundaries and then the centroid for plotting
por_centroids = np.linspace((por_bins[0] + por_bins[1])*0.5, (por_bins[nbins-2] + por_bins[nbins-1])*0.5, nbins-1)
df['por_bins'] = pd.cut(df['Por'],por_bins, labels=por_centroids)   # cut on the boundaries and labels with centroids
df.head()

Then we calculate the conditional statistics and add to our custom plot

In [None]:
# Calculate the conditonal statistics
cond_exp = df.groupby('por_bins')['Perm'].mean()
cond_P10 = df.groupby('por_bins')['Perm'].quantile(.1)
cond_P90 = df.groupby('por_bins')['Perm'].quantile(.9)

my_scatterplot(por,pormin,pormax,'Porosity (%)',perm,permmin,permmax,'Permeability (mD)','Porosity - Permeability scatter plot')
plt.plot(por_centroids, cond_exp, color='black',label='Expectation')
plt.scatter(por_centroids,cond_exp,color='red',edgecolor='k',zorder=10)
plt.plot(por_centroids, cond_P90, 'r--',color='black', linewidth=1.0,label='P90')
plt.plot(por_centroids, cond_P10, 'r-.',color='black', linewidth=1.0,label='P10')
plt.legend(loc='lower right')
plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Bivariate Joint Distributions**

We may want to visualize the bivariate joint distributions in our data. We can use the `seaborn` Python package to accomplish this with kernel density estimate plot.



In [None]:
sns.kdeplot(data=df,x='Por',y='Perm',cmap=sns.color_palette('inferno',as_cmap=True),
            levels=np.linspace(0.05,0.9,10),bw_adjust=0.3, label='Train Density',
            cbar=True,shade=True) # estimate join PDF
plt.scatter(x=por,y=perm,s=10,marker='o',color='black',alpha=0.1)  # add the data scatter plot
plt.xlabel('Porosity (%)')
plt.ylabel('Permeability (mD)')
plt.title('Porosity - Permeability, Joint Density Plot')
plt.xlim([pormin, pormax])
plt.ylim([permmin, permmax])
plt.grid()

plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Bivariate Marginals and Joint Distributions**

We may also want to visualize the bivariate marginals and joint distributions together. We can use python `seaborn` package to accomplish this with a joint plot.

In [None]:
sns.jointplot(x='Por',y='Perm',data=df, kind='kde', xlim=[pormin,pormax],ylim=[permmin,permmax],
              shade=False,n_levels=10,cmap=plt.cm.inferno, thresh=0.01)
plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Bivariate Marginals and Joint Distributions with Multiple Populations**

Let's repeat the previous bivariate plot with separate populations, facies

In [None]:
sns.jointplot(x='Por',y='Perm',data=df,kind='kde',hue='Facies',
              xlim=[pormin,pormax],ylim=[permmin,permmax],palette=plt.cm.Dark2
              )
plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Bivariate Binned Marginals and Joint Distributions**

Finally, we can bin the bivariate marginals and Joint distributions.

In [None]:
sns.jointplot(x='Por',y='Perm',data=df,kind='hist',xlim=[pormin,pormax],ylim=[permmin,permmax])
plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1)
plt.show()

### **Conclusion**

This was the basic overview of bivariate visualization for data science basic in Python.