MIT License

 Copyright (c) 2020 Matthew W. Bauer, P.G.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:

 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.

 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
___

Source: https://github.com/Rocks-n-Code/PythonCourse/
___

# Use Python to Improve Your Understanding of Geologic Conditions in the Rockies:
## Bottom Hole Temperature Workflow

By Matthew W. Bauer, P.G.

How much does your team spend in time and money gathering datasets, fixing errors, and relating multiple data sources before you can interpret? Are there conventional workflows that the time needed to apply them prevents their application over regional areas? What about datasets that the variable and sample counts are so large that it is hard to wrap your mind around? Have you tried to understand the economics and risk of a project without having absolute inputs?

Adding python programming to your workflows can help with all of that. It isn’t a magic bullet so understanding what it can and can’t do is important. Workflow automation and machine learning can't replace the domain expertise, abstract thought, or creativity of a good geologist. That said coding literacy provides large benefits to earth scientists by being able to acquire and utilize large datasets efficiently. I also argue that those benefits can start to be realized earlier in the learning process than the traditional “10,000-hour” learning threshold. Especially time savings in accessing and cleaning data. So what packages should you start learning?

The python library `pandas` allows us to efficiently open a larger number of file formats in larger sizes than Excel can handle. Finding and fixing errors in datasets with boolean filters and regular expressions can eliminate the days of sifting through an Excel workbook fixing issues by hand. The ease and speed in which higher-level math can be applied to these multidimensional arrays can simplify the multiple columns and complex nested functions required in Excel. Merging datasets in `pandas` greatly simplifies the sometimes tedious vlookup process. The fuzzy string matching capabilities of the library `fuzzywuzzy` combined with `pandas` can easily make looking up missing API numbers from well names with slight variations in spelling a thing of the past. The python libraries `lasio` and `welly` allow us to open, interact with, and save LAS logs. Limitations on time and data budgets can limit the number of well logs, tops, and downhole testing that we base our interpretations on and, potentially, increase a project's risk. Automating the process of accessing public data, also known as scraping, with the libraries `requests`, `selenium`, and `urllib` can increase the amount of data we can access and vastly improve our understanding of natural systems.

### Bottom Hole Temperatures
Let's take a look at an example of an automated workflow for LAS logs. Whether you're mapping geothermal prospects, or building maturity models, bottom hole temperature (BHT) data is essential. Some G&G software packages allow parsing of LAS file headers otherwise you are left reviewing one file at a time. Parsing can be automated making the processing of approximately 64k files on a solid-state drive for BHT data achievable overnight. First, we will import the libraries that we will be using and set some variables for working in a jupyter notebook.

In [None]:
#Import libraries
import pandas as pd     #library for working with dataframes
import geopandas as gpd #library for working with geospatial data
import lasio            #library for working with LAS files
import glob             #library for finding files
import os               #library for interacting with files
import re               #library for regular expressions

#Set options
pd.set_option('display.max_columns', None)
%matplotlib notebook

With the library `glob` we can search for our files with the file path and a wildcard character of "*". For Windows users, once we have the list of files then we’ll make the slash direction uniform.

In [None]:
##Find your LAS files
las_path = "D:/CO/CO_LAS/*.las"
files = [x.replace('\\','/') for x in glob.glob(las_path)]
print(len(files),'LAS files found.')

Next, we will loop through those files and use the library `lasio` to open each file. We will then pull the LAS file header parameters into a temporary `pandas` dataframe and filter to aliases of temperature readings in the header info and collect the highest reading. We can also collect the maximum measured depth from each log. We'll add this, or concatenate, to our out primary dataframe that we'll save out occasionally and once we are finished.

In [None]:
temp_alias = ['BHT','BHT:1','BHT:2','BHT:3','BHT:4',
    'MRT','MRT:1','MRT:2','MRT2:1',
    'MRT1:1','MRT2:2','MRT1:2','MRT:3',
    'MRT2','MRT3','MRT4','MRT1',
    'MRT                         192',
    'MAXRECTEMP',
    'BOTTEMP','BOTTOMHOLETEMP','TEMP','TEMPERATURE','MAXTEMP',
    'BHTEMP','BHTEMP:1','BHTEMP:2','BHTEMP_SRC']

celcius_units = ['degC','DEGC']

depth_alias = ['DEPT','DEPTH','M_DEPTH','DPTH','DEPT:1','MD','DEPTH:1','DEPTH_HOLE','BDEP',
    'DMEA','TOTAL_DEPTH','TVD','DEP','TDEP','DEPTMEAS','DEPT_PNN','DEPT_CBL',
    '"DEPTH"','TVD:1','DEP:1']

#Make empty list and dataframe
err_files = []
bht_df = pd.DataFrame()

#Place files that cause hangups here
hangups = []

#Saveout counter
i = 0

#Loop through files
for file in files:
    if file in hangups: continue
    i += 1
    try:
        las = lasio.read(file)
        params = {'mnemonic' : [x.mnemonic for x in las.params],
                  'unit' : [x.unit for x in las.params],
                  'value' : [x.value for x in las.params],
                  'descr' : [x.descr for x in las.params]}

        temp = pd.DataFrame(params)
        temp['API'] = file.split('/')[-1].split('_')[0]
        temp['file'] = file.split('/')[-1]
        
        #Pull Maximum Depth
        depths = [x.mnemonic for x in las.curves if x.mnemonic in depth_alias]
        max_depth = max([las[x].max() for x in depths])
        temp['MaxMD'] = max_depth

        #Filter to just temperature alisas
        temp = temp[temp.mnemonic.isin(temp_alias)]
        
        bht_df = pd.concat([bht_df,temp],ignore_index=True)
        
    except Exception as e:
        print(file,e)
        err_files.append(file)
        
    if i > 100:
        i = 0
        print(round(files.index(file)/len(files)*100,2),'%complete')
        bht_df.to_csv('bht_df.csv',index=False)

        
bht_df.to_csv('bht_df.csv',index=False)


Once we have the BHT data parsed from the LAS files we will need to clean it up to make it usable. The values contain non-number characters such as "°" or " F" that we will need to remove. Regular expressions, or `re`, can do this easily in a single line of code. Regular expressions do this by expanding our search capabilities to ranges of characters or even patterns of characters. In this case, we'll use `re` to search for anything that isn't a number or a decimal place and remove those characters.

In [None]:
pre_count = bht_df.shape[0]

#Drop Null Values
bht_df = bht_df[bht_df.value.notnull()]

#Remove non-number characters other than "."
bht_df.value = bht_df.value.apply(lambda x: re.sub("[^0-9.]","",x))

#Drop empty values
bht_df = bht_df[bht_df.value != '']

#Drop multiple "." Tool IP addresses?
bht_df = bht_df[bht_df.value.apply(lambda x: str(x).count('.') <= 1)]

We can now change the variable type from strings into float numbers. Once the data is actually numbers we can convert the Celsius values to Fahrenheit and change the unit label. Doing this with `where` allows us to convert units in select locations by using a boolean that is only false where we want to change the data in that column.

In [None]:
bht_df['value'] = bht_df['value'].astype(float)

for C_col in ['DEGC','degC']:
    bht_df['value'] = bht_df['value'].where(bht_df['unit'] != C_col,
                                            other=bht_df['value'].apply(lambda x: (9/5)*x + 32))
    
    bht_df['unit'] = bht_df['unit'].where(bht_df['unit'] != C_col,
                                           other='degF')
#Drop Bad Values
bht_df = bht_df[bht_df.value >= 0]
bht_df = bht_df[bht_df.value < 500]

print(pre_count - bht_df.shape[0],'of',pre_count,'rows dropped.')

### Merging Datasets

Merging large datasets in excel usually involves ordering the lookup table and using 'vlookup' to pull the column we want. We are going to use `geopandas` to open a shapefile and manipulate it with the same syntax as `pandas`.

In [None]:
wells = gpd.read_file('./WellSpot/Wells.shp')
wells.head()

To merge depth and location data from the `well` geodataframe to the `bht_df` dataframe we will need a common column. To achieve this we'll format the API number in `bht_df` to create a new column, “API_Label”, that is the same format as `wells`. Applying a lambda function makes this easy and efficient.

In [None]:
#Make API_Label
bht_df['API_Label'] = bht_df.API.apply(lambda x: x[:2] + '-' + x[2:5] + '-' + x[5:10])

#Merge
bht_df = bht_df.merge(wells[['API_Label','Max_MD','Max_TVD','geometry']],
                      how='left',
                      on='API_Label')

### Plotting Data for QA

Pandas allows us to easily visualize our data in a scatter plot and colorize it with a property of our data. This makes it easier to identify potentially erroneous data points. We will calculate the count of each unique value using `value_counts` then plot setting the color to that property.

In [None]:
#Creating Counts column
vcounts = bht_df.value.value_counts()
values = vcounts.keys().tolist()
counts = vcounts.tolist()
counts_dict = dict(zip(values, counts))
bht_df['count'] = bht_df.value.apply(lambda x: counts_dict[x])

#Plotting Values with TVD
bht_df[bht_df.value < 400].plot.scatter(x='value',
                                        y='Max_TVD',
                                        c='count',
                                        colormap='viridis')

Note the values that are not following the depth trend. Whether defaults or "pencil-whipped", the values do not appear to be correct so let's remove them. We can do this easily with `~`, which means the opposite of, the `isin` boolean which checks for the presence of a value in a list. 

In [None]:
#Removing suspect values
bad_values = [212,211.99986,200,150,70,32,0]
bht_df = bht_df[~bht_df['value'].isin(bad_values)]

#Plot data
bht_df.plot.scatter(x='value',
                    y='Max_TVD',
                    c='count',
                    colormap='viridis')

### Saving Data
Depending on how you'd like to use this data would dictate the file output type. Let's look at three of the most common. 

#### CSV
CSVs allow us a lot of flexibility with their small size and wide range of programs that can open them. If you're a Petrel user you may need tab-delimited import data so let's save out that format as well.

In [None]:
#Save out to csv
bht_df.to_csv("CO_BHT.csv", index=False)

#Save out to tab-delimited
bht_df.to_csv("CO_BHT_tab.csv", index=False, sep="\t")

#### Excel
Don't think that while I may rip on Excel that it doesn't have its place in data analysis. It is the standard software for working with data in most industries, makes nice plots, and has some decent tools built-in. Because of that excel is an excellent choice for sharing work with other people.

In [None]:
bht_df.to_excel("CO_BHT.xlsx", index=False)

#### GIS
The power of preserving & being able to reference data's spatial location can not be understated. Geopandas allows us to convert to a geodataframe and save easily. It also makes managing CRS a breeze.

In [None]:
#Make the GeoDataFrame
gdf = gpd.GeoDataFrame(bht_df, geometry='geometry',crs=wells.crs)

#Save GeoDataFrame
gdf.to_file('bht_gdf.shp')

#Change CRS
gdf.to_crs({'init':'EPSG:4326'},inplace=True)

#Save Out Copy
gdf.to_file('bht_gdf_WGS84.shp')

### Conclusion

In this workflow, we've shown how to use python to parse a large number of LAS files for information data out of the header. While not yet corrected BHT you can already see regional trends. With that, you can identify areas of interest and make corrections from wells we have already identified with BHT data parsed from public LAS files.

![BHT Data in Colorado](BHT.png)

If you'd like this code in a Jupyter notebook, the data it produced, python lessons, or other workflows visit my GitHub at https://github.com/Rocks-n-Code/PythonCourse. 