
<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>

## Subsurface Data Analytics 

### Tabular Data Structures / DataFrames in Python 

#### Michael Pyrcz, Associate Professor, University of Texas at Austin 

##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

### Exercise: Tabular Data Structures / DataFrames in Python 

This is a tutorial for / demonstration of **Tabular Data Structures in Python**.  In Python, the common tool for dealing with Tabular Data Structures is the DataFrame from the pandas Python package. 

* Tabular Data in subsurface data analytics includes any data set with a limited number of samples as oposed to gridded maps that provide exhaustively sampled data.

This tutorial includes the methods and operations that would commonly be required for and Geoscientists, Engineers and Data Scientists working with Tabular Data Structures for the purpose of:

1. Data Checking and Cleaning
2. Data Mining / Inferential Data Analysis
3. Data Analytics / Building Predictive Models with Geostatistics and Machine Learning

Learning to work with Pandas DataFrames is essential for dealing with tabular data (e.g. well data) in subsurface modeling workflows and for subsurface machine learning.

##### Tabular Data Structures

In Python we will commonly store our data in two formats, tables and arrays.  For sampled data with typically multiple features $1,\ldots,m$ over $1,\ldots,n$ samples we will work with tables.  For exhaustive maps and models usually representing a single feature on a regular grid over $1,\ldots,n_{i}$ for $i = 1,\ldots,n_{dim}$ we will work with arrays.

pandas package provides a convenient DataFrame object for working with data in a table and numpy package provides a convenient ndarray object for working with gridded data. In the following tutorial we will focus on DataFrames although we will utilize ndarrays a couple of times.  There is another section on Gridded Data Structures that focuses on ndarrays.

#### Additional Resources

These workflows are based on standard methods with their associated limitations and assumptions. For more information see:

* [pandas DataFrames Lecture](https://www.youtube.com/watch?v=cggieFcKdiM&list=PLG19vXLQHvSDUmEOmBoaxGbFAbvaLdfx4&index=8&t=0s)

I have provided various workflows for subsurface data analytics, geostatistics and machine learning:

* [Python](https://git.io/fh4eX)

* [Excel](https://github.com/GeostatsGuy/LectureExercises/blob/master/Lecture7_CI_Hypoth_eg_R.xlsx) 
* [R](https://github.com/GeostatsGuy/LectureExercises/blob/master/Lecture7_CI_Hypoth_eg.R)  

and all of my University of Texas at Austin 

* [Lectures](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig/featured?view_as=subscriber)

#### Workflow Goals

Learn the basics for working with Tabular Data Structures in Python. This includes:

* Loading tabular data
* Visualizing tabular data
* Data QC and Cleaning
* Interacting with the tabular data

#### Objective 

I want to provide hands-on experience with building subsurface modeling workflows. Python provides an excellent vehicle to accomplish this. I have coded a package called GeostatsPy with GSLIB: Geostatistical Library (Deutsch and Journel, 1998) functionality that provides basic building blocks for building subsurface modeling workflows. 

The objective is to remove the hurdles of subsurface modeling workflow construction by providing building blocks and sufficient examples. This is not a coding class per se, but we need the ability to 'script' workflows working with numerical methods.    

#### Getting Started

Here's the steps to get setup in Python with the GeostatsPy package:

1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). 
2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. 
3. In the terminal type: pip install geostatspy. 
4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. 

You will need to copy the data file to your working directory.  They are available here:

* Tabular data - 2D_MV_200wells.csv at [here].(https://github.com/GeostatsGuy/GeoDataSets/blob/master/2D_MV_200wells.csv)

I have put together various subsurface workflows for data analytics, geostatistics and machine learning. Go [here](https://git.io/fh4eX) for other example workflows and source code. 

#### Load the required libraries

The following code loads the required libraries.


In [68]:
import os                                                   # to set current working directory 
import numpy as np                                          # arrays and matrix math
import pandas as pd                                         # DataFrames

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing `python -m pip install [package-name]`. More assistance is available with the respective package docs.  



#### Set the working directory

I always like to do this so I don't lose files and to simplify subsequent read and writes (avoid including the full address each time).  Also, in this case make sure to place the required (see below) data file in this directory.  When we are finished with this tutorial we will write our new dataset back to this directory.  

In [94]:
os.chdir("c:/PGE383")                                       # set the working directory

#### Loading Data 

Let's load the provided multivariate, spatial dataset.  '2D_MV_200wells.csv' is available [here](https://github.com/GeostatsGuy/GeoDataSets/blob/master/2D_MV_200wells.csv).  It is a comma delimited file with: 

* X and Y coordinates ($m$)
* facies 1 and 2 (1 is sandstone and 2 interbedded sand and mudstone)
* porosity (fraction)
* permeability ($mD$)
* acoustic impedance ($\frac{kg}{m^3} \cdot \frac{m}{s} \cdot 10^6$). 

We load it with the pandas 'read_csv' function into a data frame we called 'df' and then preview it by printing a slice and by utilizing the 'head' DataFrame member function (with a nice and clean format, see below).

**Python Tip: using functions from a package** just type the label for the package that we declared at the beginning:

```python
import pandas as pd
```

so we can access the pandas function 'read_csv' with the command: 

```python
pd.read_csv()
```

but read csv has required input parameters. The essential one is the name of the file. For our circumstance all the other default parameters are fine. If you want to see all the possible parameters for this function, just go to the docs [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).  

* The docs are always helpful
* There is often a lot of flexibility for Python functions, possible through using various inputs parameters

also, the program has an output, a pandas DataFrame loaded from the data.  So we have to specficy the name / variable representing that new object.

```python
df = pd.read_csv("2D_MV_200wells.csv")  
```

Let's run this command to load the data and then look at the resulting DataFrame to ensure that we loaded it.

In [95]:
df = pd.read_csv("2D_MV_200wells.csv")                      # read a .csv file in as a DataFrame
#print(df.iloc[0:5,:])                                      # display first 4 samples in the table as a preview
df.head()                                                   # we could also use this command for a table preview 

Unnamed: 0,X,Y,facies_threshold_0.3,porosity,permeability,acoustic_impedance
0,565,1485,1,0.1184,6.17,2.009
1,2585,1185,1,0.1566,6.275,2.864
2,2065,2865,2,0.192,92.297,3.524
3,3575,2655,1,0.1621,9.048,2.157
4,1835,35,1,0.1766,7.123,3.979


#### Summary Statistics

It is useful to review the summary statistics of our loaded DataFrame.  That can be accomplished with the 'describe' DataFrame member function.  We transpose to switch the axes for ease of visualization.

In [96]:
df.describe()

Unnamed: 0,X,Y,facies_threshold_0.3,porosity,permeability,acoustic_impedance
count,200.0,200.0,200.0,200.0,200.0,200.0
mean,2053.4,1876.15,1.33,0.1493,25.287462,3.000435
std,1113.524641,1137.58016,0.471393,0.032948,64.470135,0.592201
min,25.0,35.0,1.0,0.05,0.01582,2.009
25%,1112.5,920.0,1.0,0.132175,1.36675,2.48325
50%,2160.0,1855.0,1.0,0.15015,4.8255,2.9645
75%,2915.0,2782.5,2.0,0.1742,14.597,3.527
max,3955.0,3995.0,2.0,0.2232,463.641,3.984


#### Rename a Variable / Features

Let's rename the facies, permeability and acoustic impedance for convenience.

In [97]:
df = df.rename(columns={'facies_threshold_0.3': 'facies','permeability':'perm','acoustic_impedance':'ai'}) # rename columns of the 
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai
0,565,1485,1,0.1184,6.17,2.009
1,2585,1185,1,0.1566,6.275,2.864
2,2065,2865,2,0.192,92.297,3.524
3,3575,2655,1,0.1621,9.048,2.157
4,1835,35,1,0.1766,7.123,3.979


#### Slicing and Subsets

It is straightforward to extract subsets from a DataFrame to make a new DataFrame.  

* This is useful for cleaning up data by removing features that are no longer of interest.  

If the samples are in random order then the first $n_{s}$ samples are a random sample of size $n_{s}$.  Below we make a new DataFrame, 'df_subset', with the rows 0 to 4 and columns 2 to 6 and the **X and Y coordinates removed**.

In [98]:
df_subset = df.iloc[0:5,2:7]                                # make a new dataframe with just the first 4 samples and no X,Y
print(df_subset)

   facies  porosity    perm     ai
0       1    0.1184   6.170  2.009
1       1    0.1566   6.275  2.864
2       2    0.1920  92.297  3.524
3       1    0.1621   9.048  2.157
4       1    0.1766   7.123  3.979


Let's demonstrate some more complicated slicing options.  We demonstrate two methods:

* list the exact indexes that you want

```python
df_subset2 = df.iloc[[0,2,4,5,10,43],:]      # extract rows 0,2,4,5...,43 for all columns
```

In [99]:
df_subset2 = df.iloc[[0,2,4,5,10,43],:]       # new dataframe with samples 0, 2 ,...,43 and all features
print(df_subset2)

df_subset3 = df.iloc[2:,[2,4,5]]              # new dataframe with all samples from 2 and features 2,4,5
print(df_subset3)

       X     Y  facies  porosity    perm     ai
0    565  1485       1    0.1184   6.170  2.009
2   2065  2865       2    0.1920  92.297  3.524
4   1835    35       1    0.1766   7.123  3.979
5   3375  2525       1    0.1239   1.468  2.337
10  2125  1105       1    0.1369   3.693  3.627
43  1785  1045       1    0.1517   2.766  3.562
     facies       perm     ai
2         2   92.29700  3.524
3         1    9.04800  2.157
4         1    7.12300  3.979
5         1    1.46800  2.337
6         1   31.93300  3.491
7         2  116.78100  2.187
8         1    3.00300  2.048
9         1    5.21300  2.251
10        1    3.69300  3.627
11        1    0.26270  2.860
12        1    9.91400  2.742
13        1   14.31100  3.045
14        1    0.77310  2.323
15        2   22.57800  2.711
16        2   18.74300  2.583
17        1    9.09200  3.801
18        2    0.46420  3.771
19        1  146.31900  2.341
20        1    0.06073  3.872
21        1   31.18100  2.316
22        1    1.43200  3.081
23  

#### Adding a Variable / Features

It is also easy to add a column to our data frame.  

* Note, we assume that the array is in the same order as the DataFrame.  This could be an issue if any rows were removed form either before adding etc.  

To demonstrate we make a 1D numpy array of zeros using the 'zeros' function and add it to our DataFrame with the feature name indicated as 'zero'.

In [100]:
zeros = np.zeros(200)                                       # make a array of zeros
df['zero'] = pd.Series(zeros)                               # add the array to our DataFrame
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,zero
0,565,1485,1,0.1184,6.17,2.009,0.0
1,2585,1185,1,0.1566,6.275,2.864,0.0
2,2065,2865,2,0.192,92.297,3.524,0.0
3,3575,2655,1,0.1621,9.048,2.157,0.0
4,1835,35,1,0.1766,7.123,3.979,0.0


We can also remove unwanted columns without having to subset the DataFrame.  

* That's why we just added a column of zeros, I wanted to also demonstrated removing a column.

We do this with the 'drop' member function of the DataFrame object. We just have the give the column name and by indicating axis=1 we specify to drop a column instead of a row.

In [101]:
df = df.drop('zero',axis=1)                                      # remove the zero column
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai
0,565,1485,1,0.1184,6.17,2.009
1,2585,1185,1,0.1566,6.275,2.864
2,2065,2865,2,0.192,92.297,3.524
3,3575,2655,1,0.1621,9.048,2.157
4,1835,35,1,0.1766,7.123,3.979


#### Standardizing / Manipulating Variables / Features

We may want to make new features by using mathematical operators applied to existing features.

* We can use any combinations of features, constants, math and even other data tables

For example, we can make a porosity feature that is in percentage instead of fraction (called 'porosity100') or a ratio of permeability divided by porosity (called 'permpor') may be useful for subsequent calculations such as the Lorenz Coefficient.  

In [102]:
df['porosity100'] = df['porosity']*100                      # add a new column with porosity in percentage
df['permpor'] = df['perm']/df['porosity']           # add a new feature with ratio of perm / por 
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor
0,565,1485,1,0.1184,6.17,2.009,11.84,52.111486
1,2585,1185,1,0.1566,6.275,2.864,15.66,40.070243
2,2065,2865,2,0.192,92.297,3.524,19.2,480.713542
3,3575,2655,1,0.1621,9.048,2.157,16.21,55.817397
4,1835,35,1,0.1766,7.123,3.979,17.66,40.334088


#### Truncating and Categorizing Variables / Features

We could also use conditional statements when assigning values to a new feature.  

* We could use any condition with any combination of features and variables from any sourc

For example, we could have a categorical porosity measure for high and low porosity, called 'tporosity' for truncated porosity.

In [103]:
df['tporosity'] = np.where(df['porosity']>=0.12, 'high', 'low') # conditional statement assign a new feature
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity
0,565,1485,1,0.1184,6.17,2.009,11.84,52.111486,low
1,2585,1185,1,0.1566,6.275,2.864,15.66,40.070243,high
2,2065,2865,2,0.192,92.297,3.524,19.2,480.713542,high
3,3575,2655,1,0.1621,9.048,2.157,16.21,55.817397,high
4,1835,35,1,0.1766,7.123,3.979,17.66,40.334088,high


#### Truncating to Remove Nonphysical Values

Here's an example where we use a conditional statement to assign a very low permeability value (0.0001 mD) for all porosity values below a threshold. Of course, this is for demonstration, in practice a much lower porosity threshold would likely be applied.  

In [104]:
df['perm_cutoff'] = np.where(df['porosity']>=0.12, df['perm'],0.0001) # conditional statement assign a new feature
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity,perm_cutoff
0,565,1485,1,0.1184,6.17,2.009,11.84,52.111486,low,0.0001
1,2585,1185,1,0.1566,6.275,2.864,15.66,40.070243,high,6.275
2,2065,2865,2,0.192,92.297,3.524,19.2,480.713542,high,92.297
3,3575,2655,1,0.1621,9.048,2.157,16.21,55.817397,high,9.048
4,1835,35,1,0.1766,7.123,3.979,17.66,40.334088,high,7.123


#### Dealing with Missing Data

What about missing or invalid values?  Let's assign a single porosity value to NaN, 'not a number', indicating a missing or eroneous value.  

**Python Tip: manipulating DataFrames manually**

We us this command to access anyone of the individual records in our data table.

```p
df.at[irow,'feature name'] 
```

We can get the value or set the value to a new value:

```p
df.at[irow,'feature name'] = new_value
```

We will then check for the number of NaN values in our DataFrame.  Then we can search for and display the sample with the NaN porosity value.

In [105]:
df.at[1,'porosity'] = np.NaN # let's give ourselves a NaN / missing value in our table

# Count the number of samples with missing values
print('Number of null values in our DataFrame = ', str(df.isnull().sum().sum()))  
# Find the row(s) with missing values and look at them
nan_rows = df[df['porosity'].isnull()]                      # find the row with missing values
print(nan_rows)

Number of null values in our DataFrame =  1
      X     Y  facies  porosity   perm     ai  porosity100    permpor  \
1  2585  1185       1       NaN  6.275  2.864        15.66  40.070243   

  tporosity  perm_cutoff  
1      high        6.275  


We can see that sample 1 (see the '1' on the left hand side, that is the sample index in the table) has a NaN porosity value.  

#### Drop Samples with Missing Values

Now we may choose to remove the sample with the NaN.  The 'dropna' DataFrame member function will remove all samples with NaN entries from the entire DataFrame.  By visualizing the index at the left of the DataFrame preview you can confirm that sample 1 is removed (it jumps from 0 to 2).

In [106]:
df = df.dropna()                                            # drop any rows (samples) with atleast one missing value        
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity,perm_cutoff
0,565,1485,1,0.1184,6.17,2.009,11.84,52.111486,low,0.0001
2,2065,2865,2,0.192,92.297,3.524,19.2,480.713542,high,92.297
3,3575,2655,1,0.1621,9.048,2.157,16.21,55.817397,high,9.048
4,1835,35,1,0.1766,7.123,3.979,17.66,40.334088,high,7.123
5,3375,2525,1,0.1239,1.468,2.337,12.39,11.848265,high,1.468


#### Searching with Tabular Data

One could extract samples into a new DataFrame with multiple criteria.  This is shown below.

In [107]:
df_extract = df.loc[(df['porosity'] > 0.12) & (df['perm'] > 10.0)] # extract with multiple conditions to a new table
df_extract.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity,perm_cutoff
2,2065,2865,2,0.192,92.297,3.524,19.2,480.713542,high,92.297
6,2295,1325,1,0.179,31.933,3.491,17.9,178.396648,high,31.933
7,3715,3045,2,0.1914,116.781,2.187,19.14,610.141066,high,116.781
13,545,3765,1,0.1817,14.311,3.045,18.17,78.761695,high,14.311
15,1385,2415,2,0.1774,22.578,2.711,17.74,127.271702,high,22.578


#### Making Data Frames

We already covered the idea of making a DataFrame by loading data from a file. 

It is also possible to build a brandnew DataFrame from a set of 1D arrays.  

* Note, they must have the same size and be sorted consistently.  

We will extract 'porosity' and 'perm' features as arrays.

**Python Tip: extracting data from DataFrames**

We can extract the data for a single feature with this command:

```python
1D_series = df['feature_name']
```

The 'series' retains information about the feature that was included in the DataFrame including the name and the indexing. This is fine, but some methods don't work with series so we can also extract the data as a 1D ndarray.  By adding the '.values' the series is converted to a 1D array.

```python
1D_ndarray = df['feature_name'].values
```

We then use the pandas DataFrame command to make a new DataFrame with each 1D array and the column names specified as 'porosity' and 'permeabilty'.

In [108]:
por = df['porosity'].values                                 # extract porosity column as vector
perm = df['perm'].values                                    # extract permeability column as vector
df_new = pd.DataFrame({'porosity': por, 'permeability': perm}) # make a new DataFrame from the vectors
df_new.head()

Unnamed: 0,porosity,permeability
0,0.1184,6.17
1,0.192,92.297
2,0.1621,9.048
3,0.1766,7.123
4,0.1239,1.468


#### Information About Our Tabular Data

We can reach in and retrieve the actual raw information in the DataFrame including the column names and actual values as an numpy array.  

* We can't edit them like this, but we can access and use this information.  

This includs:

1. 'index' with information about the index (i.e. index from start to stop with step)
2. 'columns' with the names of the features 
3. 'values' with the data table entries as a 2D array.  

Let's look at these components of our DataFrame:

In [109]:
print(df.index)                                             # get information about the index
print(df.columns)                                           # get the list of feature names
print(df.values)                                            # get the 2D array with all the table data

Int64Index([  0,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            190, 191, 192, 193, 194, 195, 196, 197, 198, 199],
           dtype='int64', length=199)
Index(['X', 'Y', 'facies', 'porosity', 'perm', 'ai', 'porosity100', 'permpor',
       'tporosity', 'perm_cutoff'],
      dtype='object')
[[565 1485 1 ... 52.111486486486484 'low' 0.0001]
 [2065 2865 2 ... 480.71354166666674 'high' 92.29700000000001]
 [3575 2655 1 ... 55.81739666872301 'high' 9.048]
 ...
 [375 1705 1 ... 18.198334595003786 'high' 2.404]
 [3795 535 1 ... 0.25968483256730135 'low' 0.0001]
 [3455 1645 1 ... 6.578073089700997 'high' 0.99]]


Here's a method for getting a list of the DataFrame feature names:

In [110]:
list(df)                                                    # get a list with the feature names

['X',
 'Y',
 'facies',
 'porosity',
 'perm',
 'ai',
 'porosity100',
 'permpor',
 'tporosity',
 'perm_cutoff']

#### More Precise Information from Tabular Data

Let's interact with the DataFrame more surgically, one feature and sample at a time.  Here we retrieve the 4th column feature name and the porosity value for sample \#1.  

In [111]:
col2_name = df.columns[3]                                   # get the name of the 4th feature (porosity)
print(col2_name)                                          
por1 = df.values[1,3]                                       # get the value for sample 1 of the 4th feature (porosity)
print('Porosity value for sample number 1 is ' + str(por1) + '.') 

porosity
Porosity value for sample number 1 is 0.192.


We can also manually change values.  

* We can use the 'at' pandas DataFrame member function to get and set manually individual records.  

We look up the porosity value for sample 1 and then we use the 'at' again DataFrame member function to change the value to 0.1000.  

In [112]:
por = df.at[2,'porosity']                               # get the value for sample 1 of the porosity feature
print('The value of porosity for sample 2 is ' + str(por) + '.')
df.at[2,'porosity'] = 0.10                              # set the value for sample 1 of the porosity feature
print('The value of porosity for sample 2 is now 0.1000.')
df.head()

The value of porosity for sample 2 is 0.192.
The value of porosity for sample 2 is now 0.1000.


Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity,perm_cutoff
0,565,1485,1,0.1184,6.17,2.009,11.84,52.111486,low,0.0001
2,2065,2865,2,0.1,92.297,3.524,19.2,480.713542,high,92.297
3,3575,2655,1,0.1621,9.048,2.157,16.21,55.817397,high,9.048
4,1835,35,1,0.1766,7.123,3.979,17.66,40.334088,high,7.123
5,3375,2525,1,0.1239,1.468,2.337,12.39,11.848265,high,1.468


#### Sorting a DataFrame

Let's try sorting our unconventional wells in descending order.  An example application would be to calculate the Lorenz ceofficient. 

We can use the command

```python
df.sort_values('permpor')
```

In [113]:
df = df.sort_values('permpor', ascending = False)
df.head(n=13)

Unnamed: 0,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity,perm_cutoff
97,1855,3025,2,0.2154,463.641,3.091,21.54,2152.465181,high,463.641
63,845,3915,1,0.1982,410.57,2.781,19.82,2071.493441,high,410.57
148,1975,2745,2,0.2158,361.704,3.839,21.58,1676.107507,high,361.704
138,1855,3095,2,0.2019,257.99,2.886,20.19,1277.810797,high,257.99
117,2665,3205,2,0.2159,273.98,2.551,21.59,1269.013432,high,273.98
55,1815,3345,2,0.1965,199.952,2.747,19.65,1017.56743,high,199.952
26,1785,3145,2,0.215,201.363,2.877,21.5,936.572093,high,201.363
129,505,475,1,0.1524,131.478,3.76,15.24,862.716535,high,131.478
19,2485,3525,1,0.1729,146.319,2.341,17.29,846.263736,high,146.319
127,1615,2285,2,0.1974,159.567,3.167,19.74,808.343465,high,159.567


#### Reseting Indices

The DataFrame indices are now out of order. Let's reset the DataFrame indices with:

```python
df.reset_index()
```

In [114]:
df = df.reset_index()
df

Unnamed: 0,index,X,Y,facies,porosity,perm,ai,porosity100,permpor,tporosity,perm_cutoff
0,97,1855,3025,2,0.21540,463.64100,3.091,21.540,2152.465181,high,463.6410
1,63,845,3915,1,0.19820,410.57000,2.781,19.820,2071.493441,high,410.5700
2,148,1975,2745,2,0.21580,361.70400,3.839,21.580,1676.107507,high,361.7040
3,138,1855,3095,2,0.20190,257.99000,2.886,20.190,1277.810797,high,257.9900
4,117,2665,3205,2,0.21590,273.98000,2.551,21.590,1269.013432,high,273.9800
5,55,1815,3345,2,0.19650,199.95200,2.747,19.650,1017.567430,high,199.9520
6,26,1785,3145,2,0.21500,201.36300,2.877,21.500,936.572093,high,201.3630
7,129,505,475,1,0.15240,131.47800,3.760,15.240,862.716535,high,131.4780
8,19,2485,3525,1,0.17290,146.31900,2.341,17.290,846.263736,high,146.3190
9,127,1615,2285,2,0.19740,159.56700,3.167,19.740,808.343465,high,159.5670


#### Shallow and Deep Copies

Let's explore the topic of shallow and deep copies with data frames.

* many Python methods assumes shallow copies, this confuses many people 

* this means when we do this:

```python
data_frame2 = data_frame
```

we are creating a pointer to the original DataFrame, not a new DataFrame!  

* the pointer to the object, is known as a shallow copy

* anything we do to the shallow copy will be done to the original object

To demonstrate this, let's make a copy of the DataFrame, as a slice with facies, permeability and acoustic impedance.

In [87]:
df2 = df 
df_subset2.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai
0,565,1485,1,0.1184,6.17,2.009
2,2065,2865,2,0.192,92.297,3.524
4,1835,35,1,0.1766,7.123,3.979
5,3375,2525,1,0.1239,1.468,2.337
10,2125,1105,1,0.1369,3.693,3.627


Now let's go ahead and change the facies for sample '0' to 5

* we can use the .at member function again to change a single value in a Pandas DataFrame

In [88]:
df2.loc[0,'facies'] = 5
df2.head(n=5)

Unnamed: 0,X,Y,facies,porosity,perm,ai,random
0,565,1485,5,0.1184,6.17,2.009,0.640639
1,2585,1185,1,0.1566,6.275,2.864,0.155636
2,2065,2865,2,0.192,92.297,3.524,0.043813
3,3575,2655,1,0.1621,9.048,2.157,0.621904
4,1835,35,1,0.1766,7.123,3.979,0.917467


Now let's check the original DataFrame

In [89]:
df.head(n=5)

Unnamed: 0,X,Y,facies,porosity,perm,ai,random
0,565,1485,5,0.1184,6.17,2.009,0.640639
1,2585,1185,1,0.1566,6.275,2.864,0.155636
2,2065,2865,2,0.192,92.297,3.524,0.043813
3,3575,2655,1,0.1621,9.048,2.157,0.621904
4,1835,35,1,0.1766,7.123,3.979,0.917467


Let's make a new feature, add it to the copy of the DataFrame and find out if it is in the original DataFrame.

In [90]:
df2['random'] = np.random.rand(200)
df.head(n=5)

Unnamed: 0,X,Y,facies,porosity,perm,ai,random
0,565,1485,5,0.1184,6.17,2.009,0.833788
1,2585,1185,1,0.1566,6.275,2.864,0.734056
2,2065,2865,2,0.192,92.297,3.524,0.226
3,3575,2655,1,0.1621,9.048,2.157,0.677244
4,1835,35,1,0.1766,7.123,3.979,0.252736


We added an array to the copy of the DataFrame and it was added to the original DataFrame!

Let's make a deep copy and try again.

In [91]:
df3 = df.copy()
df3.loc[1,'facies'] = 7
df3.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,random
0,565,1485,5,0.1184,6.17,2.009,0.833788
1,2585,1185,7,0.1566,6.275,2.864,0.734056
2,2065,2865,2,0.192,92.297,3.524,0.226
3,3575,2655,1,0.1621,9.048,2.157,0.677244
4,1835,35,1,0.1766,7.123,3.979,0.252736


Let's check the original.  

In [92]:
df.head()

Unnamed: 0,X,Y,facies,porosity,perm,ai,random
0,565,1485,5,0.1184,6.17,2.009,0.833788
1,2585,1185,1,0.1566,6.275,2.864,0.734056
2,2065,2865,2,0.192,92.297,3.524,0.226
3,3575,2655,1,0.1621,9.048,2.157,0.677244
4,1835,35,1,0.1766,7.123,3.979,0.252736


We made a deep copy so now the edits to the copy do not impact the original object.

* be careful when copying and check if you have a deep or shallow copy

#### Writing the Tabular Data to a File

It may be useful to write the DataFrame out for storage or curation and / or to be utilize with another platform (even R or Excel!).  It is easy to write the DataFrame back to a comma delimited file.  We have the 'to_csv' DataFrame member function to accomplish this.  The file will write to the working directory (another reason we set that at the beginning).  Go to that folder and open this new file with TextPad, Excel or any other program that opens .txt files to check it out.

In [None]:
df.to_csv("2D_MV_200wells_out.csv")                      # write out the df DataFrame to a comma delimited file 

#### More Exercises

There are so many more exercises and tests that one could attempt to gain experience with the pandas package, DataFrames objects in Python. I'll end here for brevity, but I invite you to continue. Check out the docs at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.  I'm always happy to discuss,

*Michael*

#### The Author:

### Michael Pyrcz, Associate Professor, University of Texas at Austin 
*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. 

For more about Michael check out these links:

#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Want to Work Together?

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! 

* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

* I can be reached at mpyrcz@austin.utexas.edu.

I'm always happy to discuss,

*Michael*

Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin