# GEOL 451 Exercise 7

<img src="../resistivity-graph.jpg" alt="Typical Resistivities of Geologic Materials" />

## (Pre)Processing ERT Data Using QGIS, Python, and Resipy

# Getting to know our data

Using either the code cell below or by simply opening the .dat file containing our ERT data (located at */workspaces/GEOL451/ERT/ERTSampleData/GEOL451_700E_ERT.dat* in Github Codespaces), follow the instructions and answer the following questions.

In [8]:
ertDatFile = r"/workspaces/GEOL451/ERT/ERTSampleData/GEOL451_700E_ERT.dat"
with open(ertDatFile, 'r') as edfile:
    fileText = edfile.read()
print(fileText)

!CLASSPROJECT FISHER451
2.00
11
3
Type of measurement (0=app.resistivity,1=resistance)
1
1980
2
0
4 6.00 0.00 4.00 0.00 0.00 0.00 2.00 0.00 0.615662
4 8.00 0.00 6.00 0.00 0.00 0.00 2.00 0.00 0.16689
4 8.00 0.00 6.00 0.00 2.00 0.00 4.00 0.00 0.655988
4 10.00 0.00 8.00 0.00 0.00 0.00 2.00 0.00 0.0659453
4 12.00 0.00 8.00 0.00 0.00 0.00 4.00 0.00 0.349269
4 10.00 0.00 8.00 0.00 2.00 0.00 4.00 0.00 0.151074
4 12.00 0.00 10.00 0.00 0.00 0.00 2.00 0.00 0.0468058
4 14.00 0.00 12.00 0.00 0.00 0.00 2.00 0.00 0.0259796
4 10.00 0.00 8.00 0.00 4.00 0.00 6.00 0.00 0.511538
4 12.00 0.00 10.00 0.00 2.00 0.00 4.00 0.00 0.0852895
4 16.00 0.00 14.00 0.00 0.00 0.00 2.00 0.00 0.0204932
4 12.00 0.00 10.00 0.00 4.00 0.00 6.00 0.00 0.176473
4 14.00 0.00 12.00 0.00 2.00 0.00 4.00 0.00 0.0435733
4 16.00 0.00 12.00 0.00 0.00 0.00 4.00 0.00 0.117699
4 14.00 0.00 10.00 0.00 2.00 0.00 6.00 0.00 0.37657
4 18.00 0.00 16.00 0.00 0.00 0.00 2.00 0.00 0.0113021
4 14.00 0.00 12.00 0.00 4.00 0.00 6.00 0.00 0.0705695
4 16.

***Equation 1:*** 

$$ρ_a = aπn(n+1)(n+2) * \frac{V}{I}$$

Where: 
* ***$ρ_a$*** is the apparent resistivity of our measurement
* ***a*** is the spacing between the current electrodes (this will also always be the same as the spacing between the potential electrodes)
* ***n*** is the factor by which the a spacing value is multiplied to get the distance between the innermost of the current electrodes and the innermost of the potential electrodes. (see figure below)
* ***I*** is current (what we inject into the subsurface for each measurement)
* ***V*** is voltage (what we actually are measuring with the potential dipole)

<img style="float: left" src="../ERTSampleData/DipoleDipole_a_n.png" width="50%" height="50%" />


For example, if the current electrodes in the diagram above were located at 0 meters and 2 meters and the potential electrodes were at 8 and 10 meters, the a-spacing would be 2 and the n factor would be 3. In this case, if you multiply n * a you get 3 * 2 = 6, which is the spacing between the inner current electrode (C2) and the inner potential electrode (P1)

Additionally, Ohm’s law states:

***Equation 2:***

   $$V = \frac{I}{R}$$

   or, alternatively

   $$R = \frac{V}{I}$$

Where:
* ***V*** = voltage (what we measure)
* ***I*** = current (what we inject into the subsurface for each measurement)
* ***R*** = resistance (what is output into the .dat file)

# Q1. For the first data point in the provided .dat file, provide the following values:
* The a-spacing
* The n-value 
* The apparent resistivity ($\rho_a$) of that data point (You should be able to calculate $\rho_a$ based on the a-spacing, n-value, and the above equations (Eq. 1 and 2)).

Explain how you got your answer (i.e., show your work).

---

Now let’s take a look at the GEOL451_ERTDataFile.txt file. 

Open it in a text reader or however you would like to view it. 

The advantages of looking at the .txt file are that it provides the measurement parameters and some extra pieces of information that may be useful in understanding and manipulating your data. 


In this file, there are 95 lines of metadata before we reach the actual data. About 5 lines up from the start of the data itself (at around line 90) are two lines that say:
  - StackLimitsHigh: 5
  - StackLimitsLow: 2 

When ERT data is collected, the current is injected into the ground from one current electrode to the other and a measurement is taken. Then, after a pause, current is injected in the opposite direction. (Among other things, this prevents the electrodes from becoming polarized). This is one cycle of measurement, and most times, we “stack” our data by doing multiple cycles of measurement and comparing them to one another. If the error between those measurements is small (the acceptable “Error Limit” used by the instrument is one of the parameters shown in the .txt file), the measurement is accepted. Otherwise, additional measurements are taken until the error is resolved or we reach the high stack limit.

To take a closer look at the data, let’s use python. We will read in the .txt file with python using the following code. 

Set the `filePath_to_txt_data_file` variable equal to the filepath to the .txt file containing the ERT data in the *ERTSampleData* folder.

In [14]:
import pandas as pd
filePath_to_txt_data_file = "/workspaces/GEOL451/ERT/ERTSampleData/GEOL451_700E_ERT.txt"
data = pd.read_csv(filePath_to_txt_data_file, skiprows=95, delimiter='\t', nrows=1980, index_col='N').iloc[:, :36]
data # This should be a dataframe with 1980 rows and 36 columns containing all your data and

Unnamed: 0_level_0,Time,MeasID,Seq#,DPID,Channel,A(x),A(y),A(z),B(x),B(y),...,Var(%),App.R(Ohmm),Cycles,Ready?,Pint(V),Pext(V),Temp(C),Latitude,Longitude,Pos.Quality
N,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2021-06-12 19:26:59,355,0,9110,1,6.0,0.0,0.0,4.0,0.0,...,0.0,23.2099,2,1,0.0,12.6,59.5,40.313000,-88.326733,3
2,2021-06-12 19:26:56,354,0,9679,1,8.0,0.0,0.0,6.0,0.0,...,0.1,25.1665,2,1,0.0,12.4,59.6,40.312998,-88.326733,3
3,2021-06-12 19:26:56,354,0,8254,2,8.0,0.0,0.0,6.0,0.0,...,0.0,24.7302,2,1,0.0,12.4,59.6,40.312998,-88.326733,3
4,2021-06-12 19:26:53,353,0,8258,1,10.0,0.0,0.0,8.0,0.0,...,0.1,24.8608,2,1,0.0,12.4,59.5,40.312998,-88.326733,3
5,2021-06-12 19:26:46,351,0,8250,1,12.0,0.0,0.0,8.0,0.0,...,0.0,26.3342,2,1,0.0,12.4,59.5,40.313015,-88.326738,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1976,2021-06-12 19:07:41,14,0,9725,4,124.0,0.0,0.0,122.0,0.0,...,0.0,51.8723,2,1,0.0,12.5,53.6,40.313005,-88.326728,3
1977,2021-06-12 19:07:18,7,0,8036,7,126.0,0.0,0.0,124.0,0.0,...,0.0,49.8305,2,1,0.0,12.7,53.6,40.313015,-88.326733,3
1978,2021-06-12 19:07:45,15,0,9371,1,124.0,0.0,0.0,122.0,0.0,...,0.0,50.2085,2,1,0.0,12.5,53.6,40.313003,-88.326728,3
1979,2021-06-12 19:07:18,7,0,9649,4,126.0,0.0,0.0,124.0,0.0,...,0.1,49.3022,2,1,0.0,12.7,53.6,40.313015,-88.326733,3


We will look at the number of cycles that each measurement took. To do this, use the following command:

In [None]:
data['Cycles'].value_counts()

#  Q2. What number of cycles is most common? Second most common? Why might this be?

---

Now, we can use python to split our data into two data subsets, one containing the measurements using the highest number of cycles (5) and all other data. Use the following code.

In [22]:
hiStackData = data[data['Cycles'] == 5]
otherData = data[data['Cycles'] != 5]
print('Mean variance of 5-stacked data:', hiStackData['Var(%)'].mean())
print('Mean variance of all other data:', otherData['Var(%)'].mean())

Mean variance of 5-stacked data: 1.224590163934426
Mean variance of all other data: 0.21141219385096408


# Q3. How do the two values (average percent variation of the two data subsets) compare to one another? What might this indicate?

___

Running a simple code (for example: hiStackData['N'] ) would enable us to identify these measurements that required a high number of stacks, and use this information to potentially clean the .dat file of them.

For now, though, we will continue to work with our raw data, but it may be helpful to be aware of ways that we can manipulate and clean our data. Doing so manually like this would require a little bit of finessing, since we would then need to output the data in the proper format to then read it into ResiPy. Fortunately, ResiPy has several commands that may be of interest to us to achieve similar results.

Now that we have looked at a few possible ways to identify potentially erroneous measurements in our data, let’s do all the steps necessary to run an inversion. 

# Inversion using Resipy

## Run this cell to setup your notebook

In [None]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pandas as pd

import numpy as np # numpy for electrode generation
from resipy import Project

#This adjusts the plot settings in the notebook
plt.rcParams['xtick.bottom'] = plt.rcParams['xtick.labelbottom'] = False
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True

plt.rcParams["figure.figsize"] = (25, 8) #Adjust these values to adjust plot sizes. (1st value is horizontal/width, 2nd is vertical/height)

Now, let's read our data file into a Resipy Project. You will need to specify the filepath of your .dat ERT Data file (GEOL451_700E_ERT.dat)

In [None]:
ERTFile = 

kIN = Project(typ='R2') #create a Project object in your working directory
kIN.createSurvey(fname=ERTFile, ftype='ResInv') # read the .dat data file
k = kIN # Make a copy of the project to retain one with raw data as we manipulate data later on.
k.showPseudo() # Show a plot of our pseudosection

The last command here plots a pseudosection. A pseudosection is so-called because:

* the points are plotted with pseudo-depth that is indicative of an approximate depth of measurement but does not take into consideration the fact that the measurement includes information about how the electrical field travelled over the entire subsurface region between the dipoles (in the case of dipole-dipole), and 
* it does not indicate the actual resistivity at the location of the point, but only the value of the calculated resistance or apparent resistivity as measured by that specific data point. Again, this is a bulk measurement that fails to consider the fact that the measurement encapsulates the entire region between the dipoles.

However, these are still useful plots to give us a general idea of the distribution of resistivity across the profile and to give us a first glance at the quality of data. In general, it is important not to try to interpret much with respect to the geology of the site from a pseudosection.

Now, using steps shown in the lecture on ERT Data Processing and the outputs that derive from these steps, answer the following questions. For any plots, you may right click on the plot itself in Colab and save or copy the file. You can also take a screenshot of the plot if you prefer. 

# Q4. Show the plot of your raw pseudosection. How many measurements does this file contain? How many electrodes were used to collect the data?

---

## Preprocess data to get x, y, z location

In the ERT Data Preprocessing lecture, the ERT Data Processing lecture, and in the provided notebook at ***/workspaces/GEOL451/ERT/ERTProcessing/DEMO_ERTProcessing.ipynb***, you are shown how to: 
* Extract elevation values along your profile in QGIS
* Format the elevation values in a spreadsheet software
* Reformat/interpolate the data for use in Resipy in python
* Import the electrode information (with elevations) into your project (i.e., use `k.importElec()` to update your electrodes with elevation information) in Resipy

**Carry out the above steps** for the profile delineated by the GPS points in the file at `"ERT/ERTSampleData/GEOL451_700E_ERT_GPSPts.csv"`

After you have read in the elevation data, answer the following questions:

# Q5. Create a plot of the elevation (z) data that you imported into your project and include it in your answer. Use your updated `k.elec` DataFrame to answer the following questions:
* What is the maximum elevation value?  (use pandas to find the exact maximum value programmatically, do not estimate from the plot)
* What is the minimum elevation data? (use pandas to find the exact minimum value programmatically, do not estimate from the plot). 

---

Before we run the inversion on our data, let’s practice making minor edits to the data. 

As in the lecture, let’s filter the data using the `k.filterAppResist()` function. Use the following to filter your data:
* Set the minimum value at 0 ohm-meters 
* Set the maximum value at 75 ohm-meters. 

In the environment and conditions in which this data was collected, we will assume that it could be expected that clayey sediment might have a resistivity of 0-40 ohm-meters and sandy sediment might have a resistivity of 30-100 ohm-meters. (This is probably fairly accurate, but it is not exact and is dependent on a number of outside factors).

# Q6. How many points did this step remove (setting minimum and maximum values)? Assuming the statement above about grain-size and the maximum and minimum values you found above, comment on whether the data we removed had unrealistically high apparent resistivity values. Or, put another way, how important was it to remove that data to ensure we are using realistic data for the inversion?

---

Create a model mesh of your choice. By default, ResiPy will create a triangular mesh. You may add the cl_factor argument to the command you use to create the mesh, but you may also use the default values if you wish.

Now, you can run an inversion. Again, you can play with inversion parameters to the extent that you prefer, but the default values will produce a satisfactory inversion as well. After you are done, use the showResults() command as in the lecture to plot your results

# Q7. Copy or otherwise include the plot showing your inverted data, with the following specifications: please clip the corners, and crop the max depth. 
You may use whatever colormap you prefer (see [here](https://matplotlib.org/stable/gallery/color/colormap_reference.html) for options available to you). As needed, select minimum and maximum values for the colormap that are reasonable, and which adequately display the variation in the subsurface, or use the default values. Easily available options for manipulating the plot are listed in the Resipy API reference [here](https://resipy.org/api.html#resipy.Project.Project.showResults).

---

Now that you have run an inversion, ResiPy will have calculated the error between the data that was collected in the field and the data produced using forward modeling calculations derived from the geologic model. You can use these error calculations to see where the model was unable to fully capture the details of your original data, given the constraints of the mathematics and your input parameters. In the lecture we showed two ways to view these errors. 

For this next question, use the `showInvError()` method to visualize the model error. Then, use the `filterInvError()` command to filter out the datapoints where the normalized error is greater than 3 or less than -3. Invert your data again. 

# Q8. Please write down the values for the Final RMS Misfit from your original inversion and from this second inversion. How do they compare? 
You do not need to include a copy of the plot of this inversion, but comment on how it compares to the previous inversion. (Some questions that you might think about in answering this question is whether this new inversion is “better” and what might constitute a “better” inversion?)

---

For this final question, use the plot from the first set of inverted data that was included in Q7 as your reference.

# Q9. We have well data at this site you will work with in a later exercise. With that well data, we will plot the depth/thickness of a layer of clay near the surface. Is your inverted model consistent with a surface clay layer along at least part of the site? Explain. 

---