<a href="https://colab.research.google.com/github/Rocks-n-Code/PythonCourse/blob/master/3%20-%20Pandas%20%26%20Matplotlib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas and Matplotlib
In this section we'll introduce pandas, and making figures in matplotlib.

To get started with our code we'll import the libraries we need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

#Enable plots to show up in the jupyter notebook
%matplotlib inline

#Increase the number of columns that will display
pd.options.display.max_columns = 50

#Set the size of our plots
phi = (1 + 5 ** 0.5) / 2
plt.rcParams['figure.figsize'] = [10*phi, 10]
#You can also comment in code cells by using "#" to ignore the line. Useful for testing too.

We'll then need to load our file into a pandas "dataframe." Pandas will try to automatically tell what type of data is in each column. We want to preserve the "0" in our API numbers so we'll want them treated like a string rather than a number so we must tell pandas what type of data is in that column the the `dtype` variable.  We'll be using data from the [USGS Core Research Center](https://www.usgs.gov/core-science-systems/nggdp/core-research-center) in Lakewood, CO.

In [None]:
file_path = 'https://raw.githubusercontent.com/Rocks-n-Code/PythonCourse/master/data/cores.csv'
data_types = {'API Num':str}
df = pd.read_csv(file_path, dtype=data_types)
df.shape #(Row count, Columns count)

We see that we have 16,321 rows and 25 columns of data. To preview the data lets increase the number of displayed columns and  then we'll use the ```python df.head()```  command.

In [None]:
#Preview our data
df.head()

We can select data we wish to work with from our dataframe to create a "series."  This can be done with locations in the dataframe or with filters. When counting in Python `0` is the first number you start with. Let's look at the API number (column 7) 2nd row (index 1) using two different methods. 

In [None]:
print(df.at[1,'API Num'])
print(df.iloc[1,7])

We can also filter to select data. Using one or mutliple criteria.  When we select a portion of a `DataFrame` we return a pandas `Series`.

In [None]:
df[df['Field'] == 'JANICE']

In [None]:
df[(df['Field'] == 'JANICE') & 
   (df['Source'] == 'CENTER OF SECTION') & 
   (df['Well Name'] == '1 HARRISON')]

Or we can define the column of data that we want. `Series.tolist()` will format our data into a list.

In [None]:
df['API Num'][(df['Field']=='JANICE')&(df['Source']=='CENTER OF SECTION')&(df['Well Name']=='1 HARRISON')].tolist()

Now looking at this lets see how many cores are in each state with a loop. `Series.unique()` will return 

In [None]:
states = df['State'][df['State'].notnull()].unique().tolist()
total_count = 0
state_counts = []

#For loop
for state in states:
    #filter to those rows that are from the state in the loop, see the shape, take the row count
    state_count = df[df.State == state].shape[0] 
    state_counts.append(state_count)
    total_count += state_count
    print(state,':',state_count)
    
null_rows = df[df['State'].isnull()].shape[0]
print('Null :', null_rows)
print('Total :', total_count + null_rows)


Now let's take that same data and graph it with a bar graph in matplotlib.

In [None]:
N = len(states)
ind = np.arange(N)
width = 0.35
p1 = plt.bar(ind, state_counts, width)
plt.ylabel('Cores')
plt.title('Cores by State')
plt.xticks(ind, states)
plt.yticks(np.arange(0, 6001, 500))
plt.show()

---

## Example Two: Clay Typing

In this example we will load some spectral gamma data, calculate vclay, and look at the K-Th ratio.


In [None]:
#Load Data & Set DEPT to Index
las_df = pd.read_csv('https://raw.githubusercontent.com/Rocks-n-Code/PythonCourse/master/data/Spectral_GR.csv')
las_df.set_index('DEPT',inplace=True)

#Preview Data
las_df.head()

Let's calculate Vshale. First we'll get a clean GR reading from the clean sands around 585' and a GR reading for shale around 4835'.

In [None]:
GRclean = las_df[580:590]['GR'].mean()
GRshale = las_df[4830:4840]['GR'].mean()

Next we'll write a function for Vshale and apply it to our GR to make a new column.

In [None]:
def Vshale(gr,GRclean=GRclean,GRshale=GRshale):
    vshale = (gr - GRclean)/(GRshale-GRclean)
    return vshale

Next we will use `.apply(lambda x: <function>(x))` to calculate a new column.

In [None]:
las_df['VSHALE'] = las_df['GR'].apply(lambda x: Vshale(x))

In [None]:
las_df.head()

Let's plot the spectral gamma in other plots to give us more information on what type of clays we are dealing with. We can utilize our Vshale calculation to see how different relative volumes of clay change with clay type.

In [None]:
#Background Image
im = plt.imread('https://github.com/Rocks-n-Code/PythonCourse/blob/master/img/3_KTHcrossplot_crop.png?raw=true')
implot = plt.imshow(im)

#Image is 689x411 pixels and 5x20 on scale
colormap = plt.cm.gist_rainbow 
normalize = mpl.colors.Normalize(vmin=0, vmax=1)
plt.scatter(las_df['POTA'].apply(lambda x: x*(689/5)),       #Scale to image size & scale
            las_df['THOR'].apply(lambda x: -x*(411/20)+411), #Scale to image size & scale
            s=48,               #Size of dot
            c=las_df['VSHALE'], #column to use for color scale
            cmap=colormap,      #color map
            norm=normalize,
            alpha=0.3)          #Alpha

#Set Axis Scales
plt.xticks([0,689], [0,5])      #change the x axis
plt.yticks([411,0], [0,20])     #change the y axis

#Set Axis Labels
plt.xlabel('K (%)')             #label x axis
plt.ylabel('Th (ppm)')          #label y axis

#Set Color Bar
cbar = plt.colorbar(shrink=0.5)
cbar.set_label('VSHALE', rotation=90)

#Show Plot
plt.show()

---

## Example Three: Maturity in North Dakota

Lets take another look at some more data from the National Energy Geochemical Survey database. 

SOURCE: https://energy.usgs.gov/GeochemistryGeophysics/GeochemistryLaboratories/GeochemistryLaboratories-GeochemistryDatabase.aspx
        

KEY: https://mrdata.usgs.gov/geochem/about.php

Let's open a text file of the analysis.

In [None]:
chem = pd.read_csv('https://raw.githubusercontent.com/Rocks-n-Code/PythonCourse/master/data/Analysis_abrv.csv', #Original was 3,138,631 rows and 191 Mb; smaller file used for online example.
                   dtype={'OrderID':str,'SampleNumber':str,'AnalysisGroup':str,
                          'Matrix':str,'Analysis':str,'Param':str,'Units':str,'Comments':str},
                   encoding = "ISO-8859-1",
                   low_memory=False)

Preview your data.

In [None]:
print(chem.shape)
chem.head()

Let's filter down to only Rock-Eval data by filtering to the contents of a list with `.isin(<list>)`.

In [None]:
#Make a list of values to filter to only rock-eval day.
parms = ['OI', 'S1', 'S2', 'S3', 'TMAX', 'TOC', 'HI', 'S2S3', 'PI', 'PC']
rockeval = chem[(chem['Analysis']=='Rock-Eval')&(chem['Param'].isin(parms))]
print(rockeval.shape)

Now let open an excel file. We'll then calulate two new columns and merge more information about those samples. 

In [None]:
#Open an excel file to a dataframe
samples = pd.read_excel('https://github.com/Rocks-n-Code/PythonCourse/blob/master/data/Samples.xlsx?raw=true',converters={'API':str})

#Calculate TVD SS
samples['TVDSS_top'] = samples['ELEVF'] - samples['TOPF']
samples['TVDSS_bot'] = samples['ELEVF'] - samples['BOTF']

#Preview the first three rows of the dataframe
samples.head(3)

In [None]:
samples.shape

Lets make sure the data is in the correct format and merge the two dataframes.

In [None]:
#Set the columns to a string format
rockeval['SampleNumber'] = rockeval['SampleNumber'].astype(str)
samples['SampleNumber'] = samples['SampleNumber'].astype(str)

#Merge the sample location dataframe to the analysis dataframe
rockeval = rockeval.merge(samples,how='left',on=['OrderID','SampleNumber','Matrix'])

In [None]:
rockeval.head(4)

Notice that Comments columns with the '_x' and '_y'? This occurs when there are the same column name in both the two dataframes in the merge.

Let's check those calculated values and set them to the surface elevation if they are null.

In [None]:
print(rockeval[rockeval['TVDSS_top'].isnull()].shape)
rockeval['Z'] = rockeval['TVDSS_top'].where(rockeval['TVDSS_top'].notnull(),other=rockeval['ELEVF'])
rockeval['Result'] = rockeval['Result'][rockeval['Result']!='ND']

#Save out a copy for later
#rockeval.to_csv('data/Rock-Eval.csv',index=False)

Let's take a portion of the pyrolysis data and look at maturity trends in North Dakota.

In [None]:
#Filter to TMAX data in North Dakota
nd = rockeval[rockeval['STATE']=='North Dakota'][rockeval['Param']=='TMAX']
nullvals = ['ND','NA']
nd = nd[~nd.Result.isin(nullvals)] # "~" means the opposite
nd['Result'] = nd['Result'].astype(float)

#Make a figure with matplotlib
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d') #nrows, ncols, and index

#make a list our maturity cutoffs, colors, and labels.
maturity = [(0,430,'saddlebrown','Immature'),
    (435,445,'lime','Early'),
    (445,450,'green','Peak'),
    (450,470,'darkolivegreen','Late'),
    (470,999, 'r','Gas')]

#Populate the figure with a for loop
for low,high,c,label in maturity:
    xs = nd['Longitude'][(nd['Result']>=low)&(nd['Result']<high)]
    ys = nd['Latitude'][(nd['Result']>=low)&(nd['Result']<high)]
    zs = nd['Z'][(nd['Result']>=low)&(nd['Result']<high)]
    ax.scatter(xs, ys, zs, c=c, marker='o')

#Set the legend
ax.legend([x[3] for x in maturity])

#Rotate the figure
ax.view_init(45,260)

plt.show()

---

## Give it a try

Try to reload the already merged rock-eval data, filter to the TOC data, preview your data, describe your data, and make a plot.  I've given you the framework for a scatterplot below.

In [None]:
#Filter down to TOC analysis

#Use ".describe()" to find out about your resuls


In [None]:
#Make a scater plot with your results
plt.scatter(toc['Longitude'], toc['Latitude'], c=toc['Result'])
plt.gray()

plt.show()