### A real dataset

Today we will take a look at a well log dataset together. There is a nice open source dataset of SECARB Cranfield Well Logs in Franklin, Mississippi (https://edx.netl.doe.gov/dataset/secarb-cranfield-well-logs). We refined the data silghtly to make it easier to load them into the notebook. You should have downloaded the `borehole.txt` file together with this notebook. 

In [None]:
# the usual imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# importing the dataset as a pandas dataframe
# Note that the file 'borehole.txt' has to be located in the same folder as th notebook
# or alternatively the full path has to be defined
data = pd.read_csv('borehole.txt', delim_whitespace=True)

In [None]:
data.head()

Here some information on the different columns and different signals. We are no well logging experts and obviously the main focus of this exercise is data analysis, but it is always a good idea to have a rough idea of what you are looking at:

MNEM | .UNIT  | DESCRIPTION 
:---: | :---: |:--:
DEPT |.F | DEPTH (BOREHOLE)
CTEM |.DEGF | Cartridge Temperature
DCAL |.IN | Differential Caliper 
DT   |.US/F | Delta-T
GR_EDTC|.GAPI | Gamma Ray  
GTEM |.DEGF | Generalized Borehole Temperature 
ITT  |.S | Integrated Transit Time 
RWA  |.OHMM | Apparent Water Resistivity 
SPHI |.CFCF | Sonic Porosity 
STIT |.F | Stuck Tool Indicator, Total 
TENS |LBF | Cable Tension 

If you are interested in more detail, check out the original dataset or some of the references at the end of this notebook.

### 1. Checking the data and reduce size

Let's start by checking our data. Take a look at the datframe. Is it complete? Do we need to clean it? For the puropose of this exercise we will reduce the size of the dataset a little bit and concentrate on some specific logs.

<div class="alert alert-info">
    
**Your task:**

Create a smaller dataframe containing only the columns `'DEPT', 'DT', 'GR_EDTC', 'GTEM', 'RWA'` and only the first 2000 entries. Hint for second part: Take a look ehat `keyword argument` the `df.head()` function takes.

</div>



In [None]:
# YOUR CODE HERE
# Take a look at the dataframe


In [None]:
# YOUR CODE HERE
# create a subset (smaller dataframe)


Let's also visualize the log plots.


<div class="alert alert-info">

**Your task:**

Plot the different well log signals against depth. Some notes: Like in a proper well log plot the depth axis should be inverted: shallow (top) to deep (bottom) - check out `invert_yaxis()`. Try to plot all foru plots in a single figure.

In [None]:
# YOUR CODE HERE


### 2. Covariance and correlation intuition

Today we focus on the relationship between different properties. You discussed the correlation coefficient (or $r$ value) in the lecture. It takes values between -1 and 1. 

<div class="alert alert-info">
    
**Your task:**

Just from looking at the well log plot you created, make an educated guess for the correlation coefficients for the following parameter pairs:

</div>

Note: We will ignore the temperature in the following analysis.

In [None]:
# YOUR GUESS HERE
corr_DT_GREDTC_guess = 
corr_DT_RWA_guess =
corr_RWA_GREDTC = 

To get a better feeling for this relation it is a godd idea to plot the properties against each other. 

<div class="alert alert-info">
    
**Your task:**

Plot the three well log signals against each other using scatterplots. Does the plot support your guess from above? Note of caution: We did not normalize the data, but as matplotlib automatically scales the axes we still get a good impression.

</div>

In [None]:
# YOUR CODE HERE


### 3. Calculate covariances

We now have a basic idea of what our three datasets look like and even plotted them against each other to get a visual intuition of their correlation. A measure of this correlation is the covariance. We can calculate it using the following equation:

$$Cov_{x,y}=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n}$$

Remember from last week the definition of standard deviation, which is the squareroot of the variance of a dataset:

$$ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2} $$

Hopefully you see why this new measure is called covariance!

<div class="alert alert-info">  
    
**Your task:**

Write a function that calculates the covariance between two arrays of measured properties. Use this function to calculate the covariances between all three given signals. Hint: Don't just use `numpy.cov`, write the equation properly as practice.

</div>

In [None]:
def calc_cov(x,y):
    # YOUR CODE HERE
    
    return cov


### 4. Calcualte correlation coefficient


Let's now calculate the *correlation coefficient*. It is the scaled version of the covariance and allows to compare the strength of correlation between different properties. It can be derived from the covariance based on the following formula:

$$\rho_{x,y}=\frac{Cov_{x,y}}{\sigma_x \sigma_y}$$

<div class="alert alert-info">  
    
**Your task:**

Write a function that calculates the correlation coefficient between two arrays of measured properties. You can use your the calculation for the covariance that you used before as a basis. Determine the correlation coefficients between the three provided properties and compare it to the respective covariances.

</div>

In [None]:
def calc_cor_coeff(x,y):
    # YOUR CODE HERE
    
    return cor_coeff


<div class="alert alert-info">  
    
**Your task:**

Compare the results to your guess. Discuss the meaning of covariance and correlation coefficient in this setting. What are important observations?
Check your results with `pandas`. Use the `df.corr()` method.

</div>

In [None]:
# YOUR CODE HERE

### References

The dataset can be found here:

* https://edx.netl.doe.gov/dataset/secarb-cranfield-well-logs

And some information on some fo the signals:

* http://www.hitchnerexplorationservices.com/blog/post/1953736
* https://www.glossary.oilfield.slb.com/Terms/i/interval_transit_time.aspx
* https://petrowiki.org/Gamma_ray_logs
* https://petrowiki.org/Porosity_evaluation_with_acoustic_logging
