# Data Exploration 

Exploratory Data Analysis (EDA) is an important step within a data science workflow. It allows you to become familiar with your data and understand it's contents, extent and variation. Within this stage, you can identify patterns within the data and also relationships between the features (well logs).

As petrophysicists/geoscientists we commonly use log plots, histograms and crossplots (scatter plots) to analyse and explore well log data. Python provides a great toolset for visualising the data from different perspectives in a quick and easy way.

In this workbook, we will cover:
- Reading in data from a CSV file
- Viewing data on a log plot
- Viewing data on a crossplot / scatter plot
- Viewing data on a histogram
- Visualising all well log curves on a crossplot and histogram using a pairplot

The first step is to import the libraries that we require. These will be:
- pandas for load and storing the data
- matplotlib and seaborn for visualising the data
- numpy for a number of calculation methods

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Next, we will load the data in using the pandas `read_csv` function and assign it to the variable `df`. The data will now be stored within a structured object known as a dataframe.

In [2]:
df = pd.read_csv('data/spwla_volve_data.csv')

The first step after loading in a dataset it to check its contents. The first method `.describe()` provides summary statistics of each numeric column within the dataframe. We can gain information such as the number of points, the mean, the standard deviation, min and max values, and the percentile values.

In [3]:
df.describe()

Unnamed: 0,MD,BS,CALI,DT,DTS,GR,NPHI,RACEHM,RACELM,RHOB,RPCEHM,RPCELM,PHIF,SW,VSH
count,27845.0,27845.0,27845.0,5493.0,5420.0,27845.0,27845.0,27845.0,27845.0,27845.0,27845.0,27600.0,27736.0,27736.0,27844.0
mean,3816.22496,8.5,8.625875,78.000104,131.027912,38.52914,0.188131,352.689922,97.55893,2.379268,1561.079977,30.041154,0.157434,0.531684,0.2724204
std,398.843662,0.0,0.079941,7.730495,13.230939,21.814711,0.05339,1367.355219,395.725094,0.162293,9570.308431,210.915588,0.075957,0.353637,0.1872371
min,3223.0,8.5,8.3049,54.28,83.574,6.8691,0.024,0.1974,0.2349,1.627,0.139,0.1366,0.001,0.043,1.82e-15
25%,3503.0,8.5,8.5569,72.5625,123.403425,21.1282,0.157,1.8564,1.781,2.24,2.1483,1.884,0.091,0.201,0.1258
50%,3713.3,8.5,8.625,77.228,131.86435,35.071,0.1839,4.0358,3.6812,2.356,5.1368,4.1954,0.178,0.433,0.24
75%,4057.0,8.5,8.672,84.3429,138.0175,49.1783,0.2152,14.929,8.891,2.5025,24.6874,14.78265,0.225,1.0,0.354
max,4744.0,8.5,9.175,96.2776,186.0908,127.0557,0.541,6381.0991,2189.603,3.09,62290.77,5571.4351,0.292,1.0,1.0


The next method we can call upon is `info()`. This provides a list of all of the columns within the dataframe, their data type (e.g, float, integer, string, etc.), and the number of non-null values.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27845 entries, 0 to 27844
Data columns (total 16 columns):
wellName    27845 non-null object
MD          27845 non-null float64
BS          27845 non-null float64
CALI        27845 non-null float64
DT          5493 non-null float64
DTS         5420 non-null float64
GR          27845 non-null float64
NPHI        27845 non-null float64
RACEHM      27845 non-null float64
RACELM      27845 non-null float64
RHOB        27845 non-null float64
RPCEHM      27845 non-null float64
RPCELM      27600 non-null float64
PHIF        27736 non-null float64
SW          27736 non-null float64
VSH         27844 non-null float64
dtypes: float64(15), object(1)
memory usage: 3.4+ MB


## Data Visualisation

The next useful set of methods available to us is the `head()` and `.tail()` functions. These return the first / last five rows of the dataframe

In [5]:
df.head()

Unnamed: 0,wellName,MD,BS,CALI,DT,DTS,GR,NPHI,RACEHM,RACELM,RHOB,RPCEHM,RPCELM,PHIF,SW,VSH
0,15/9-F-1 A,3431.0,8.5,8.6718,86.9092,181.2241,53.9384,0.3222,0.5084,0.8457,2.7514,0.6461,0.6467,0.02,1.0,0.6807
1,15/9-F-1 A,3431.1,8.5,8.625,86.4334,181.1311,57.2889,0.3239,0.4695,0.8145,2.7978,0.7543,0.657,0.02,1.0,0.7316
2,15/9-F-1 A,3431.2,8.5,8.625,85.9183,180.9487,59.0455,0.3277,0.5012,0.8048,2.8352,0.8718,0.6858,0.02,1.0,0.7583
3,15/9-F-1 A,3431.3,8.5,8.625,85.3834,180.7211,58.255,0.3357,0.6048,0.7984,2.8557,0.9451,0.7913,0.02,1.0,0.7462
4,15/9-F-1 A,3431.4,8.5,8.625,84.8484,180.493,59.4569,0.3456,0.7115,0.7782,2.8632,1.0384,0.873,0.02,1.0,0.7646


In [6]:
df.tail()

Unnamed: 0,wellName,MD,BS,CALI,DT,DTS,GR,NPHI,RACEHM,RACELM,RHOB,RPCEHM,RPCELM,PHIF,SW,VSH
27840,15/9-F-11 B,4743.6,8.5,8.875,,,19.561,0.109,1.789,1.768,2.493,3.985,2.194,0.107,0.621,0.146
27841,15/9-F-11 B,4743.7,8.5,8.851,,,16.974,0.116,1.719,1.751,2.468,3.158,1.996,0.119,0.624,0.12
27842,15/9-F-11 B,4743.8,8.5,8.804,,,14.334,0.117,1.737,1.76,2.443,2.248,1.796,0.128,0.67,0.093
27843,15/9-F-11 B,4743.9,8.5,8.726,,,12.617,0.114,1.719,1.767,2.427,1.67,1.6,0.132,0.736,0.076
27844,15/9-F-11 B,4744.0,8.5,8.672,,,12.828,0.106,1.669,1.777,2.424,1.317,1.467,0.13,0.822,0.078


We know from the introduction that we should have 5 wells within this dataset. We can check that out by calling upon the wellName column and using the method `.unique()`. This will return back an array listing all of the unique values within that column.

In [7]:
df['wellName'].unique()

array(['15/9-F-1 A', '15/9-F-1 B', '15/9-F-1 C', '15/9-F-11 A',
       '15/9-F-11 B'], dtype=object)

As seen above, we can call upon specific columns within the dataframe by name. If we do this for a numeric column, such as CALI, we will return a pandas series containing the first 5 values, last 5 values, and details about that column.

In [10]:
df['CALI']

0        8.6718
1        8.6250
2        8.6250
3        8.6250
4        8.6250
          ...  
27840    8.8750
27841    8.8510
27842    8.8040
27843    8.7260
27844    8.6720
Name: CALI, Length: 27845, dtype: float64

### Well Log Plots

In [None]:
# Create Plot Function

In [None]:
def create_plot(wellname, dataframe, curves_to_plot, depth_curve, log_curves=[]):
    num_tracks = len(curves_to_plot)
    
    fig, ax = plt.subplots(nrows=1, ncols=num_tracks, figsize=(num_tracks*2, 10))
    fig.suptitle(wellname, fontsize=20, y=1.05)
    
    for i, curve in enumerate(curves_to_plot):
        
        
        ax[i].plot(dataframe[curve], depth_curve)
               
        ax[i].set_title(curve, fontsize=14, fontweight='bold')
        ax[i].set_ylim(depth_curve.max(), depth_curve.min())
        ax[i].grid(which='major', color='lightgrey', linestyle='-')

        
        if i == 0:
            ax[i].set_ylabel('DEPTH (m)', fontsize=18, fontweight='bold')
        else:
            plt.setp(ax[i].get_yticklabels(), visible = False)
        
        # Check to see if we have any logarithmic scaled curves
        if curve in log_curves:
            ax[i].set_xscale('log')
            ax[i].grid(which='minor', color='lightgrey', linestyle='-')
        
        
    
    plt.tight_layout()
    plt.show()

In [None]:
grouped =df.groupby('wellName')

In [None]:
dfs_wells = []
wellnames = []

#Split up the data by well
for well, data in grouped:
    dfs_wells.append(data)
    wellnames.append(well)

curves_to_plot = ['BS', 'CALI', 'DT', 'DTS', 'GR', 'NPHI', 'RACEHM', 'RACELM', 'RHOB', 'RPCEHM', 'RPCELM', 'PHIF', 'SW', 'VSH']
logarithmic_curves = ['RACEHM', 'RACELM', 'RPCEHM', 'RPCELM']

print(wellnames)



In [None]:
well = 0
create_plot(wellnames[well], dfs_wells[well], curves_to_plot, dfs_wells[well]['MD'], logarithmic_curves)

In [None]:
well = 1
create_plot(wellnames[well], dfs_wells[well], curves_to_plot, dfs_wells[well]['MD'], logarithmic_curves)

### Standard Crossplots

In [None]:
# Create Den Neu, Ac Den and Ac Neu Xplots
import matplotlib
def make_xplot(welldata, xvar, yvar,  color, rows=1, cols=1, xscale=[0,1], yscale=[0,1], vmin=0, vmax=1):
    fig, axs = plt.subplots(rows, cols, figsize=(25,5))
    cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["green","yellow","red"])
    
    for (name, welldata), ax in zip(grouped, axs.flat):
        sc = ax.scatter(x=xvar, y=yvar, data=welldata, s=5, c=color, vmin=vmin, vmax=vmax, cmap=cmap)
#         sns.scatterplot(x=xvar, y=yvar, data=welldata, hue=color, ax=ax, legend=False, palette="viridis")

        ax.set_ylim(yscale[0], yscale[1])
        ax.set_xlim(xscale[0], xscale[1])
        ax.set_ylabel(yvar)
        ax.set_xlabel(xvar)
        ax.set_title(name)
        
        fig.colorbar(sc, ax=ax)
    plt.tight_layout()
    plt.show()

In [None]:
make_xplot(grouped, 'NPHI', 'RHOB', 'CALI', 1, 5, [-0.15, 0.6], [3.5,1.5], 8.5, 9)

In [None]:
make_xplot(grouped, 'DT', 'DTS', 'GR', 1, 5, [40, 140], [40, 240], 0, 100)

### Seaborn Pairplot

In [None]:
sns.pairplot(df, vars=['CALI', 'DT','DTS', 'RHOB', 'GR', 'NPHI'], diag_kind='kde', plot_kws = {'alpha': 0.8, 's': 20, 'edgecolor': 'k'})

### Correlation Matrix

In [None]:
corr = df.corr()

plt.figure(figsize = (10,10))
sns.heatmap(corr, annot=True, cmap='plasma')

### Creating Working Dataset

In [None]:
# Drop BS, RPCELM and RACELM

In [None]:
# Save DF to Pickle