# Well log correlation analysis

This notebook aims to perform a correlation analysis between variables related to a well. The main objective is to identify possible relationships or dependencies between these well characteristics.

The notebook uses different correlation methods, including Pearson's correlation coefficient, Kendall's correlation coefficient and Spearman's correlation coefficient. These methods allow us to assess the strength and direction of the relationship between numerical variables.

In addition, heatmaps are generated that visualize the correlations found. These maps provide an intuitive graphical representation of the magnitude of correlations, allowing us to identify patterns and trends between the well logs and other well characteristics. The notebook is highly customizable and allows the selection of specific variables to be analyzed.

This notebook is a powerful tool for exploratory data analysis in oil industry related projects, as it allows a better understanding of the relationships between key variables in a well, facilitating informed decision making and optimization of the processes involved.

## Libraries

Libraries to be used during code development are imported.

In [None]:
#======================Bibliotecas para graficar=============================#
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
import matplotlib.colors
from matplotlib.ticker import StrMethodFormatter
from scipy.stats import pearsonr
from scipy.stats import kendalltau
from scipy.stats import spearmanr
import seaborn as sns

#======================Bibliotecas para análisis numérico y de datos=========#
import numpy as np
import pandas as pd
import scipy.stats

#======================Bibliotecas para desarrollo de interfaces gráficas====#
import tkinter as tk
from tkinter.filedialog import askopenfilename

## Import Database 

In this section of the script the data to be plotted is imported. It is required that the file to work with is in a comma separated text format ".csv". Then a summary table with database statistics is generated.

In [None]:
#====================Se crea la ventana con la que se accedará al directorio de la base de datos===#
root = tk.Tk()
root.withdraw()

#====================Dentro de la ventana se selecciona la base de data a abrir====================#
datos = askopenfilename(filetypes = [("csv files", "*.csv")])

#====================Se lee la base de datos y se presenta un resumen con estadígrafos de esta=====#
rgp = pd.read_csv(datos)

## Add the name of your well or file

In [None]:
pozo = 'well name'

## Name of your variables 

In [None]:
rgp.columns.tolist()

## Variable selection
In this section you can choose the variables to which you want to analyze their correlation.

In [None]:
rgp_n1 = rgp[[
    'Varaible 1',
    'Varaible 2',
    'Varaible 3',
    'Varaible 4',
    'Varaible 5']]  
rgp_n1.describe().apply(lambda s: s.apply('{0:.4f}'.format)).T


#rgp_n1 = rgp #Para analizar todas las variables

## Standardization of variables (Z-score Standardization)
Standardization is a technique used to transform numerical variables so that they have a mean of zero and a standard deviation of one. This transformation is applied by subtracting the mean from the data and dividing by the standard deviation. The formula for z-score standardization is as follows:

$variable_{standardized} = \frac{{variable - mean}}{{standard\_deviation}}$

In [None]:
rgp_n1 = (rgp_n1-rgp_n1.mean())/rgp_n1.std()
rgp_n1.describe().apply(lambda s: s.apply('{0:.5f}'.format)).T

# Correlation Matrices
Correlation matrices are representations of the relationships between multiple variables. Scatter plots can be printed to graphically visualize these relationships. In addition, correlation coefficients, such as Pearson's, Kendall's or Spearman's correlation coefficient, which quantify the strength and direction of relationships, can be calculated. Interpreting the coefficients implies understanding that high values do not imply causality. In summary, correlation matrices and scatter plots help to understand the relationships between variables, while correlation coefficients provide quantitative measures of those relationships.

### Functions

In [None]:
def corr_func(x, y, ax=None, **kws):
    """Plot the correlation coefficient in the top right hand corner of a plot.
    """
    r, _ = pearsonr(x, y)
    s, _ = spearmanr(x, y)
    k, _ = kendalltau(x, y)
    fontsize = 30
    ax = ax or plt.gca()
    ax.annotate(f'P = {r:.2f}', xy=(.5, .70), xycoords=ax.transAxes,
                fontsize=fontsize, ha='center')
    ax.annotate(f'S = {s:.2f}', xy=(.5, .50), xycoords=ax.transAxes,
                fontsize=fontsize, ha='center')
    ax.annotate(f'K = {k:.2f}', xy=(.5, .30), xycoords=ax.transAxes,
                fontsize=fontsize, ha='center')

#-----------------------------------------------------------------------------------------------------------------------------
def corr_pairs(df):
    corr_df = df.corr().abs() # Calcula la matriz de correlación y toma los valores absolutos
    corr_pairs = corr_df.unstack().sort_values(ascending=False).drop_duplicates() # Desenrolla la matriz y ordena los valores
    corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) != corr_pairs.index.get_level_values(1)] # Elimina los pares de variables iguales
    corr_pairs = corr_pairs[(corr_pairs >= 0.7) | (corr_pairs <= -0.7)] # Filtra los valores que sean mayores o iguales a 0.7 o menores o iguales a -0.7
    return corr_pairs

#-----------------------------------------------------------------------------------------------------------------------------
def most_common_vars(df, n):
    """Returns the n most common variables in a DataFrame.
    """
    var_counts = pd.Series(df.columns).value_counts()
    return list(var_counts.head(n).index)

#-----------------------------------------------------------------------------------------------------------------------------
def corrfuncK(x, y, **kwds):
    cmap = kwds['cmap']
    norm = kwds['norm']
    ax = plt.gca()
    ax.tick_params(bottom=False, top=False, left=False, right=False)
    sns.despine(ax=ax, bottom=True, top=True, left=True, right=True)
    r, _ = kendalltau(x, y)
    facecolor = cmap(norm(r))
    ax.set_facecolor(facecolor)
    lightness = (max(facecolor[:3]) + min(facecolor[:3]) ) / 2
    ax.annotate(f"{r:.2f}", xy=(.5, .5), xycoords=ax.transAxes,
                color='white' if lightness < 0.7 else 'black', size=40, ha='center', va='center')
    
#-----------------------------------------------------------------------------------------------------------------------------
def corrfuncS(x, y, **kwds):
    cmap = kwds['cmap']
    norm = kwds['norm']
    ax = plt.gca()
    ax.tick_params(bottom=False, top=False, left=False, right=False)
    sns.despine(ax=ax, bottom=True, top=True, left=True, right=True)
    r, _ = spearmanr(x, y)
    facecolor = cmap(norm(r))
    ax.set_facecolor(facecolor)
    lightness = (max(facecolor[:3]) + min(facecolor[:3]) ) / 2
    ax.annotate(f"{r:.2f}", xy=(.5, .5), xycoords=ax.transAxes,
                color='white' if lightness < 0.7 else 'black', size=40, ha='center', va='center')
    
#-----------------------------------------------------------------------------------------------------------------------------
def corrfuncP(x, y, **kwds):
    cmap = kwds['cmap']
    norm = kwds['norm']
    ax = plt.gca()
    ax.tick_params(bottom=False, top=False, left=False, right=False)
    sns.despine(ax=ax, bottom=True, top=True, left=True, right=True)
    r, _ = pearsonr(x, y)
    facecolor = cmap(norm(r))
    ax.set_facecolor(facecolor)
    lightness = (max(facecolor[:3]) + min(facecolor[:3]) ) / 2
    ax.annotate(f"{r:.2f}", xy=(.5, .5), xycoords=ax.transAxes,
                color='white' if lightness < 0.7 else 'black', size=40, ha='center', va='center')

## Correlation matrix

In [None]:
data = rgp_n1.copy()
g = sns.PairGrid(data)
g.map_upper(corr_func)
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot)
g.fig.suptitle("Gráfico de correlación de las variables para el pozo "+pozo, fontsize=50, y=1)
for ax in g.axes.flat:
    ax.set_xlabel(ax.get_xlabel(), fontsize=29)
    ax.set_ylabel(ax.get_ylabel(), fontsize=29)
plt.savefig('Heatmap_correlaciones_'+pozo+'.png', format="png", bbox_inches="tight")
plt.show()

## Variables with high correlation (positive and negative)

In [None]:
high_corr_pairs = corr_pairs(data)
high_corr_pairs

In [None]:
most_common_vars(data, 15)# the number of variables with the highest correlation can be defined

## Kendall's Tau Correlation Heat Map
Kendall's correlation coefficient is a non-parametric measure used to assess the correlation between two ranked or ordered variables. Unlike Pearson's correlation coefficient, Kendall's coefficient does not assume a linear relationship and is especially useful when data present outliers or when the relationship between variables is not strictly monotonic.

The equation for Kendall's correlation coefficient, denoted as "$\tau" or "Kendall's tau", is represented by

$\tau = \frac{{C - D}}{{\frac{1}{2} n (n - 1)}}$

"$C$" is the number of concordances (pairs of elements that have the same order in both variables).
"$D$" is the number of mismatches (pairs of items that have a different order in both variables).
"$n$" is the sample size (number of observations).
Kendall's correlation coefficient varies between -1 and 1, where:

A value of 1 indicates a perfect positive correlation (all pairs of items have the same order in both variables).

A value of -1 indicates a perfect negative correlation (all pairs of elements have an opposite order in both variables).

A value of 0 indicates no correlation.

In [None]:
d1 = rgp_n1
g = sns.PairGrid(d1)
g.map_lower(plt.scatter, s=10)
g.map_diag(sns.histplot, kde=False)
g.map_upper(corrfuncK, cmap=plt.get_cmap('seismic'), norm=plt.Normalize(vmin=-1, vmax=1))
g.fig.subplots_adjust(wspace=0.06, hspace=0.06) # equal spacing in both directions
g.fig.suptitle('Mapa de calor de correlación por Kendall y diagramas de dispersión del pozo '+pozo,size=50, y=1.01)
for ax in g.axes.flat:
    ax.set_xlabel(ax.get_xlabel(), fontsize=30)
    ax.set_ylabel(ax.get_ylabel(), fontsize=30)
plt.savefig('Heatmap_scattergram_Kendall_'+pozo+'.png', format="png", bbox_inches="tight")
plt.show()

## Spearman's correlation heat map
Spearman's correlation coefficient is a non-parametric measure used to assess the correlation between two ranked or ordered variables. Unlike Pearson's correlation coefficient, Spearman's coefficient does not assume a linear relationship and is based on the ranks of the data rather than the observed values.

The formula for Spearman's correlation coefficient is expressed as follows:

$\rho_{spearman} = 1 - \frac{6\sum{d_i^2}}{n(n^2-1)}$

"$d_i$" represents the differences between the ranks of pairs of observations.
"$n$" is the sample size (number of observations).
Spearman's correlation coefficient varies between -1 and 1, where:

A value of 1 indicates a perfectly monotonic positive correlation (all pairs of items have the same order in both variables).

A value of -1 indicates a perfectly monotonic negative correlation (all pairs of elements have an opposite order in both variables).

A value of 0 indicates no monotonic correlation.

In [None]:
d1 = rgp_n1
g = sns.PairGrid(d1)
g.map_lower(plt.scatter, s=10)
g.map_diag(sns.histplot, kde=False)
g.map_upper(corrfuncS, cmap=plt.get_cmap('seismic'), norm=plt.Normalize(vmin=-1, vmax=1))
g.fig.subplots_adjust(wspace=0.06, hspace=0.06) # equal spacing in both directions
g.fig.suptitle('Mapa de calor de correlación por Spearman y diagramas de dispersión del pozo '+pozo,size=50, y=1.01)
for ax in g.axes.flat:
    ax.set_xlabel(ax.get_xlabel(), fontsize=30)
    ax.set_ylabel(ax.get_ylabel(), fontsize=30)
plt.savefig('Heatmap_scattergram_Spearman_'+pozo+'.png', format="png", bbox_inches="tight")
plt.show()

## Pearson's correlation heat map
Pearson's correlation coefficient is a parametric measure used to assess the linear correlation between two continuous variables. It is a measure of the strength and direction of the linear relationship between the variables.

The equation of Pearson's correlation coefficient is represented:

$\rho_{pearson} = \frac{{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}}{{\sqrt{{\sum{(X_i - \bar{X})^2}} \sum{(Y_i - \bar{Y})^2}}}}$

"$X_i$" and "$Y_i$" are the values of the two variables.
"$"$"bar{X}$" and "$"$"bar{Y}$" are the means of the two variables.
Pearson's correlation coefficient varies between -1 and 1, where:

A value of 1 indicates a perfect positive linear linear correlation (the points tend to lie on an ascending line).
A value of -1 indicates a perfect negative linear correlation (the points tend to be in a descending line).
A value of 0 indicates no linear correlation (no apparent linear relationship between the variables).

In [None]:
d1 = rgp_n1
g = sns.PairGrid(d1)
g.map_lower(plt.scatter, s=10)
g.map_diag(sns.histplot, kde=False)
g.map_upper(corrfuncP, cmap=plt.get_cmap('seismic'), norm=plt.Normalize(vmin=-1, vmax=1))
g.fig.subplots_adjust(wspace=0.06, hspace=0.06) # equal spacing in both directions
g.fig.suptitle('Mapa de calor de correlación por Pearson y diagramas de dispersión del pozo '+pozo,size=50, y=1.01)
for ax in g.axes.flat:
    ax.set_xlabel(ax.get_xlabel(), fontsize=30)
    ax.set_ylabel(ax.get_ylabel(), fontsize=30)
plt.savefig('Heatmap_scattergram_Pearson_'+pozo+'.png', format="png", bbox_inches="tight")
plt.show()