## Data Disclaimer

All the data serving as an input to these notebooks was generously donated by GEOLINK  
and is CC-by-SA 4.0 

If you use their data please reference their dataset properly to give them credit for their contribution.

In [1]:
import lasio
import matplotlib.pyplot as plt
%matplotlib inline
import os
import numpy as np
from sklearn import preprocessing
from operator import itemgetter

## Trying to find the greatest number of wells with the highest number of common well logs

This is starting to sound alot like a [Project Euler](https://projecteuler.net/) challenge...  

_Why do we want to do this?_  
For any machine learning algorithm, if we don't want to deal with missing data imputation then we'll have to use a common set of well logs.  
And the more of these datasets the better.

In [2]:
##### run this if you want to create the headers yourself ####
fname_and_headers = []
for f in os.listdir("../geolink_wells/"):
    try:
        fname_and_headers.append((f, lasio.read("../geolink_wells/"+f).keys()))
    except ValueError:
        print("Error in: ", f)
np.save("../data/log_headers.npy", np.array(fname_and_headers, dtype="object"))

In [3]:
fname_and_headers = np.load("../data/log_headers.npy")

Let's just join all these keys together so that we can encode them.

In [4]:
all_headers = []
for f, key in fname_and_headers:
    all_headers += key

I've decided to encode them so that we can deal with the numbers (probably strings work with sets too, lazy me).

In [5]:
enc = preprocessing.LabelEncoder()
unique_headers = enc.fit_transform(all_headers)
print(enc.transform(np.unique(all_headers)))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27]


So we've now encoded all our unique labels in the header files, next let's transform all our header files to this encoding.  
We'll also make them sets so that we can do simple intersections and comparisons.

In [6]:
header_sets = []
for f, key in fname_and_headers:
    header_sets.append((f, set(enc.transform(key))))
print(header_sets[0:3])

[('35_11-5.las', {1, 4, 5, 7, 9, 10, 11, 13, 14, 15, 16, 18, 21, 24}), ('25_11-24.las', {0, 1, 3, 4, 5, 7, 9, 10, 11, 13, 14, 15, 16, 18, 19, 20}), ('34_4-5.las', {11, 4})]


Stop... *for loop time*:  
We'll now intersect each encoded well log header with every other well log header.
This should give us a fairly big list of intersections from which we'll find all the unique ones.

In [7]:
header_intersections = []
for i in range(len(header_sets)):
    for j in range(len(header_sets)):
        if i != j:
            header_intersections.append(header_sets[i][1].intersection(header_sets[j][1]))

In [8]:
print("Total number of intersections", len(header_intersections))

unique_headers, count_headers  = np.unique(header_intersections, return_counts=True)
##Oof, but these are all unsorted
print(count_headers[0:10])
print(unique_headers[0:10])

Total number of intersections 49506
[7686   12    2    1    8    1    1    3    1    5]
[{11, 4} {10, 11, 4, 15} {4, 10, 11, 15, 18}
 {1, 4, 5, 7, 10, 11, 13, 15, 16, 21, 24} {4, 10, 11, 15, 18}
 {4, 7, 10, 11, 16} {4, 10, 11, 13, 15, 16} {4, 10, 11, 15, 18}
 {1, 4, 5, 7, 10, 11, 13, 15, 16, 24} {4, 10, 11, 15, 18}]


Some stackexchange magic from [here](https://stackoverflow.com/questions/13668393/python-sorting-two-lists/13668413)

In [9]:
sorted_unique_headers, sorted_count_headers = [list(x) for x in zip(*sorted(zip(unique_headers, count_headers), key=itemgetter(1)))]
sorted_unique_headers, sorted_count_headers = sorted_unique_headers[::-1], sorted_count_headers[::-1]

In [10]:
N = 1
print("Total number of unique header combinations: ", len(sorted_count_headers))
print("Top ", 25, " unique intersections: ", sorted_count_headers[0:25])
print("Header names for the most common non-trivial well log combinations: ")
print(*np.unique(all_headers)[list(sorted_unique_headers[N])])

Total number of unique header combinations:  21190
Top  25  unique intersections:  [7686, 191, 133, 125, 118, 102, 100, 88, 87, 83, 83, 82, 81, 80, 80, 78, 76, 74, 69, 69, 68, 67, 67, 65, 58]
Header names for the most common non-trivial well log combinations: 
CALI DEPT DRHO DTC GR LITHOLOGY_GEOLINK NPHI RDEP RHOB RMED


Let's try and get the well las file names where all these well logs are present.

In [11]:
out_well_fnames = []
for f, well_header in fname_and_headers:
    if len(set(enc.transform(well_header)).intersection(sorted_unique_headers[N])) == len(sorted_unique_headers[N]):
        out_well_fnames.append(f)
print(len(out_well_fnames))
print(out_well_fnames)

172
['35_11-5.las', '25_11-24.las', '15_9-2.las', '31_2-21 S.las', '34_8-3.las', '33_9-6.las', '16_11-1 S.las', '31_2-10.las', '30_6-8.las', '32_2-1.las', '35_11-10.las', '16_5-3.las', '30_6-11.las', '7_3-1.las', '35_11-7.las', '16_1-2.las', '25_7-2.las', '16_2-11 A.las', '31_5-4 S.las', '16_2-6.las', '31_4-3.las', '33_9-11.las', '15_9-9.las', '35_11-1.las', '15_9-13.las', '16_10-2.las', '35_9-7.las', '31_2-8.las', '16_2-7.las', '33_5-2.las', '16_4-1.las', '16_7-4.las', '25_5-3.las', '30_6-22.las', '34_2-2 R.las', '35_8-6 S.las', '15_9-15.las', '31_2-9.las', '25_5-4.las', '31_6-8.las', '34_7-21.las', '16_10-3.las', '15_9-12.las', '31_6-1.las', '30_3-5 S.las', '36_7-1.las', '17_4-1.las', '35_6-2 S.las', '25_10-10.las', '31_4-5.las', '25_11-15.las', '17_11-1.las', '16_7-2.las', '30_3-2 R.las', '25_2-5.las', '25_11-5.las', '31_5-3.las', '31_2-1.las', '34_10-16 R.las', '25_5-1.las', '30_3-3.las', '29_3-1.las', '34_10-7.las', '31_2-2 R.las', '34_7-5.las', '31_3-3.las', '31_4-6.las', '31_5-2

We can now proceed to import these files as las files and get their dataframes and hopefully put them into a data format that is more suited for ML tasks.

In [24]:
from tqdm import tqdm_notebook as tqdm
well_dataframes = []
for f in tqdm(out_well_fnames):
    well_dataframes.append(lasio.read("../geolink_wells/"+f).df())
    break
print(well_dataframes)

HBox(children=(IntProgress(value=0, max=172), HTML(value='')))


[             LITHOLOGY_GEOLINK       CALI      DRHO       NPHI      RHOB  \
DEPT                                                                       
443.227997                 NaN        NaN       NaN        NaN       NaN   
443.380402                 NaN        NaN       NaN        NaN       NaN   
443.532806                 NaN        NaN       NaN        NaN       NaN   
443.685211                 NaN        NaN       NaN        NaN       NaN   
443.837616                 NaN        NaN       NaN        NaN       NaN   
443.990021                 NaN        NaN       NaN        NaN       NaN   
444.142426                 NaN        NaN       NaN        NaN       NaN   
444.294800                 NaN        NaN       NaN        NaN       NaN   
444.447205                 NaN        NaN       NaN        NaN       NaN   
444.599609                 NaN        NaN       NaN        NaN       NaN   
444.752014                 NaN        NaN       NaN        NaN       NaN   
444.904419

In [25]:
well_dataframes[0].head()

Unnamed: 0_level_0,LITHOLOGY_GEOLINK,CALI,DRHO,NPHI,RHOB,PEF,GR,DTC,DTS,RDEP,SP,RSHA,RMED
DEPT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
443.227997,,,,,,,42.707977,,,,,,
443.380402,,,,,,,39.981758,,,,,,
443.532806,,,,,,,37.370552,,,,,,
443.685211,,,,,,,32.266106,,,,,,
443.837616,,,,,,,28.130108,,,,,,


In [26]:
well_dataframes[0].describe()

Unnamed: 0,LITHOLOGY_GEOLINK,CALI,DRHO,NPHI,RHOB,PEF,GR,DTC,DTS,RDEP,SP,RSHA,RMED
count,7362.0,18460.0,18459.0,7374.0,18460.0,18449.0,21906.0,18592.0,7009.0,21895.0,18588.0,7448.0,18589.0
mean,9.363896,16.540613,-0.025488,28.226891,2.297814,10.994915,79.193541,108.309069,997.036344,3.05293,65.354793,33.88346,4.039052
std,4.966183,3.277885,0.157227,9.811672,0.291549,46.07112,22.984406,27.760326,229.318267,3.932845,49.456203,217.865715,8.712975
min,1.0,7.949219,-1.460555,1.771864,1.209807,1.621701,14.090071,48.600006,548.679138,0.409521,0.751601,0.129722,0.144774
25%,6.0,13.066406,-0.013127,19.844249,2.040244,4.289062,61.75,84.900003,796.639954,1.036037,14.50015,3.081051,1.048237
50%,7.0,17.875,-0.006289,29.318535,2.431646,5.233411,84.856308,104.600006,957.731873,1.455078,51.500401,6.475961,1.704242
75%,16.0,18.90625,0.001086,35.745694,2.51694,8.390556,93.5625,134.699989,1203.506348,3.574219,108.899994,9.858277,5.066406
max,18.0,22.890625,0.480722,62.623631,3.362264,1000.0,193.0,163.600006,1552.308594,95.125,157.837494,2000.0,362.879578


Let's only keep those columns that are shared amongst all wells.

In [43]:
logs_of_interest = list(np.unique(all_headers)[list(sorted_unique_headers[N])])
logs_of_interest.remove('DEPT')

In [45]:
well_dataframes[0][logs_of_interest]

Unnamed: 0_level_0,CALI,DRHO,DTC,GR,LITHOLOGY_GEOLINK,NPHI,RDEP,RHOB,RMED
DEPT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
443.227997,,,,42.707977,,,,,
443.380402,,,,39.981758,,,,,
443.532806,,,,37.370552,,,,,
443.685211,,,,32.266106,,,,,
443.837616,,,,28.130108,,,,,
443.990021,,,,26.835022,,,,,
444.142426,,,,27.441097,,,,,
444.294800,,,,29.518074,,,,,
444.447205,,,,30.653519,,,,,
444.599609,,,,30.506626,,,,,


In [46]:
well_dataframes[0][logs_of_interest].describe()

Unnamed: 0,CALI,DRHO,DTC,GR,LITHOLOGY_GEOLINK,NPHI,RDEP,RHOB,RMED
count,18460.0,18459.0,18592.0,21906.0,7362.0,7374.0,21895.0,18460.0,18589.0
mean,16.540613,-0.025488,108.309069,79.193541,9.363896,28.226891,3.05293,2.297814,4.039052
std,3.277885,0.157227,27.760326,22.984406,4.966183,9.811672,3.932845,0.291549,8.712975
min,7.949219,-1.460555,48.600006,14.090071,1.0,1.771864,0.409521,1.209807,0.144774
25%,13.066406,-0.013127,84.900003,61.75,6.0,19.844249,1.036037,2.040244,1.048237
50%,17.875,-0.006289,104.600006,84.856308,7.0,29.318535,1.455078,2.431646,1.704242
75%,18.90625,0.001086,134.699989,93.5625,16.0,35.745694,3.574219,2.51694,5.066406
max,22.890625,0.480722,163.600006,193.0,18.0,62.623631,95.125,3.362264,362.879578
