# 🌎 GPGN268 - Geophysical Data Analysis
- **Instructor:** Bia Villas Boas  
- **TA:** Seunghoo Kim

## Lecture 14: Introduction to pandas.

#### 🎯 Learning Objectives from this Lecture:
- Import the Pandas library.
- Use Pandas to load a simple CSV data set.
- Get some basic information about a Pandas DataFrame.

### Using the Pandas library to do statistics on tabular data.

[Pandas](http://pandas.pydata.org/) is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

- Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Borrows many features from R’s dataframes.
    - A 2-dimensional table whose columns have names and potentially have different data types.
- Load it with import pandas as pd. The alias pd is commonly used for Pandas.
- Read a Comma Separated Values (CSV) data file with pd.read_csv.
    - Argument is the name of the file to be read.
    - Assign result to a variable to store the data that was read.
    
In this lecture, we will go over the basic capabilities of Pandas. It is a very deep library, and you will need to dig into the [documentation](http://pandas.pydata.org/pandas-docs/stable/) for more advanced usage.

In [1]:
import numpy as np
import pandas as pd

### Reading data into a pandas dataframe
Lets use pandas to read one of our well-log data files

In [2]:
path = '/Users/bia/work/classes/GPGN268/coursework-villasboas/ds01-well-log/data/iodp-logging-data/EXP372/U1517A/372-U1517A_res-phase-nscope.csv'
df = pd.read_csv(path)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 16


We see that the pandas printed things into a table but there is something weird about it. It seem like the first couple of lines have a different format. Let's use the terminal to look at the data file. Open a new terminal window from Jupyter Lab and navigate to the data folder

```
$ cd ~/work/classes/GPGN268/coursework-villasboas/ds01-well-log/data/iodp-logging-data/EXP372/U1517A
$ cat 372-U1517A_cali-nscope.csv | head
```


In [3]:
df = pd.read_csv(path, skiprows=[0, 1, 2, 3, 5])

In [4]:
df

Unnamed: 0,DEPTH_LSF,P16B,P16H,P16L,P22B,P22H,P22L,P28B,P28H,P28L,P34B,P34H,P34L,P40B,P40H,P40L
0,0.1348,0.609138,0.606420,0.609138,0.625390,0.628182,0.625390,0.633311,0.642043,0.633311,0.635440,0.646374,0.635440,0.640069,0.650177,0.640069
1,0.2872,0.627580,0.606823,0.627579,0.646545,0.632340,0.646545,0.654598,0.649771,0.654598,0.654733,0.655505,0.654733,0.657815,0.659040,0.657815
2,0.4396,0.644199,0.610627,0.644199,0.663118,0.632921,0.663118,0.674138,0.650050,0.674138,0.678495,0.658393,0.678495,0.684064,0.664154,0.684064
3,0.5920,0.663892,0.619708,0.663892,0.681017,0.637017,0.681017,0.693011,0.653297,0.693011,0.699174,0.662530,0.699174,0.704357,0.666172,0.704357
4,0.7444,0.682845,0.630840,0.682845,0.697899,0.645322,0.697899,0.709838,0.659337,0.709838,0.717195,0.668764,0.717195,0.722439,0.674389,0.722439
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1221,186.2152,1.349504,1.321876,1.365414,1.367666,1.349495,1.378745,1.385877,1.375210,1.392777,1.399951,1.391715,1.405571,1.409557,1.401155,1.415531
1222,186.3676,1.307980,1.264115,1.329541,1.327183,1.286108,1.349225,1.350870,1.316730,1.371008,1.374388,1.345850,1.392854,1.392158,1.364525,1.411478
1223,186.5200,1.283353,1.242980,1.300715,1.326024,1.298547,1.340192,1.358904,1.344285,1.367395,1.381087,1.372510,1.386490,1.402996,1.395565,1.408120
1224,186.6724,1.253862,1.187660,1.279550,1.308796,1.272658,1.326299,1.348609,1.325449,1.361735,1.371802,1.356053,1.381517,1.399445,1.405782,1.395303


Now, we read our data using pandas, but what type of object `df` is?

In [5]:
type(df)

pandas.core.frame.DataFrame

Before we were using `numpy` to read data into numpy arrays. With pandas, our data is stored as a pandas DataFrame.

### Use pandas.DataFrame.set_index to specify that a column’s values should be used as row headings
Now, we see that the row labels are currently numbers (1-1226), but we would really want the depths to be the row labels. We can use

In [6]:
df.set_index('DEPTH_LSF', inplace=True)
df

Unnamed: 0_level_0,P16B,P16H,P16L,P22B,P22H,P22L,P28B,P28H,P28L,P34B,P34H,P34L,P40B,P40H,P40L
DEPTH_LSF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0.1348,0.609138,0.606420,0.609138,0.625390,0.628182,0.625390,0.633311,0.642043,0.633311,0.635440,0.646374,0.635440,0.640069,0.650177,0.640069
0.2872,0.627580,0.606823,0.627579,0.646545,0.632340,0.646545,0.654598,0.649771,0.654598,0.654733,0.655505,0.654733,0.657815,0.659040,0.657815
0.4396,0.644199,0.610627,0.644199,0.663118,0.632921,0.663118,0.674138,0.650050,0.674138,0.678495,0.658393,0.678495,0.684064,0.664154,0.684064
0.5920,0.663892,0.619708,0.663892,0.681017,0.637017,0.681017,0.693011,0.653297,0.693011,0.699174,0.662530,0.699174,0.704357,0.666172,0.704357
0.7444,0.682845,0.630840,0.682845,0.697899,0.645322,0.697899,0.709838,0.659337,0.709838,0.717195,0.668764,0.717195,0.722439,0.674389,0.722439
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186.2152,1.349504,1.321876,1.365414,1.367666,1.349495,1.378745,1.385877,1.375210,1.392777,1.399951,1.391715,1.405571,1.409557,1.401155,1.415531
186.3676,1.307980,1.264115,1.329541,1.327183,1.286108,1.349225,1.350870,1.316730,1.371008,1.374388,1.345850,1.392854,1.392158,1.364525,1.411478
186.5200,1.283353,1.242980,1.300715,1.326024,1.298547,1.340192,1.358904,1.344285,1.367395,1.381087,1.372510,1.386490,1.402996,1.395565,1.408120
186.6724,1.253862,1.187660,1.279550,1.308796,1.272658,1.326299,1.348609,1.325449,1.361735,1.371802,1.356053,1.381517,1.399445,1.405782,1.395303


### Use Use pandas.DataFrame.head() to show the first few rows of the DataFrame

- You can also specify the number of rows with `df.head(10)`

In [14]:
df.head()

Unnamed: 0_level_0,P16B,P16H,P16L,P22B,P22H,P22L,P28B,P28H,P28L,P34B,P34H,P34L,P40B,P40H,P40L
DEPTH_LSF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0.1348,0.609138,0.60642,0.609138,0.62539,0.628182,0.62539,0.633311,0.642043,0.633311,0.63544,0.646374,0.63544,0.640069,0.650177,0.640069
0.2872,0.62758,0.606823,0.627579,0.646545,0.63234,0.646545,0.654598,0.649771,0.654598,0.654733,0.655505,0.654733,0.657815,0.65904,0.657815
0.4396,0.644199,0.610627,0.644199,0.663118,0.632921,0.663118,0.674138,0.65005,0.674138,0.678495,0.658393,0.678495,0.684064,0.664154,0.684064
0.592,0.663892,0.619708,0.663892,0.681017,0.637017,0.681017,0.693011,0.653297,0.693011,0.699174,0.66253,0.699174,0.704357,0.666172,0.704357
0.7444,0.682845,0.63084,0.682845,0.697899,0.645322,0.697899,0.709838,0.659337,0.709838,0.717195,0.668764,0.717195,0.722439,0.674389,0.722439


### Use Use pandas.DataFrame.tail() to show the last few rows of the DataFrame

- You can also specify the number of rows with `df.tail(10)`

In [16]:
df.tail()

Unnamed: 0_level_0,P16B,P16H,P16L,P22B,P22H,P22L,P28B,P28H,P28L,P34B,P34H,P34L,P40B,P40H,P40L
DEPTH_LSF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
186.2152,1.349504,1.321876,1.365414,1.367666,1.349495,1.378745,1.385877,1.37521,1.392777,1.399951,1.391715,1.405571,1.409557,1.401155,1.415531
186.3676,1.30798,1.264115,1.329541,1.327183,1.286108,1.349225,1.35087,1.31673,1.371008,1.374388,1.34585,1.392854,1.392158,1.364525,1.411478
186.52,1.283353,1.24298,1.300715,1.326024,1.298547,1.340192,1.358904,1.344285,1.367395,1.381087,1.37251,1.38649,1.402996,1.395565,1.40812
186.6724,1.253862,1.18766,1.27955,1.308796,1.272658,1.326299,1.348609,1.325449,1.361735,1.371802,1.356053,1.381517,1.399445,1.405782,1.395303
186.8248,1.253858,1.187652,1.279547,1.308793,1.272654,1.326297,1.348608,1.325446,1.361734,1.371801,1.35605,1.381517,1.399445,1.405784,1.395301


### Use the DataFrame.info() method to find out more about a dataframe.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 1226 entries, 0.1348 to 186.8248
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   P16B    1226 non-null   float64
 1   P16H    1226 non-null   float64
 2   P16L    1226 non-null   float64
 3   P22B    1226 non-null   float64
 4   P22H    1226 non-null   float64
 5   P22L    1226 non-null   float64
 6   P28B    1226 non-null   float64
 7   P28H    1226 non-null   float64
 8   P28L    1226 non-null   float64
 9   P34B    1226 non-null   float64
 10  P34H    1226 non-null   float64
 11  P34L    1226 non-null   float64
 12  P40B    1226 non-null   float64
 13  P40H    1226 non-null   float64
 14  P40L    1226 non-null   float64
dtypes: float64(15)
memory usage: 153.2 KB


#### Here is the information given to us
- This is a DataFrame
- There are 1226 rows with values from 0.1348 to 186.8248
- There are fourteen columns, each of which has two actual 64-bit floating point values.
- We will talk later about null values, which are used to represent missing observations.
- Uses 153.2 KB of memory.

### The DataFrame.columns variable stores information about the dataframe’s columns.

- Note that this is data, not a method (It doesn’t have parentheses), so do not use () to try to call it.

In [8]:
df.columns

Index(['P16B', 'P16H', 'P16L', 'P22B', 'P22H', 'P22L', 'P28B', 'P28H', 'P28L',
       'P34B', 'P34H', 'P34L', 'P40B', 'P40H', 'P40L'],
      dtype='object')

### Use DataFrame.T to transpose a DataFrame

- Sometimes want to treat columns as rows and vice versa.
- Transpose (written .T) doesn’t copy the data, just changes the program’s view of it.
- Like columns, it is a member variable.

In [9]:
df.T

DEPTH_LSF,0.1348,0.2872,0.4396,0.5920,0.7444,0.8968,1.0492,1.2016,1.3540,1.5064,...,185.4532,185.6056,185.7580,185.9104,186.0628,186.2152,186.3676,186.5200,186.6724,186.8248
P16B,0.609138,0.62758,0.644199,0.663892,0.682845,0.731617,0.761301,0.804455,0.844715,0.910954,...,1.310211,1.30825,1.291665,1.339404,1.295607,1.349504,1.30798,1.283353,1.253862,1.253858
P16H,0.60642,0.606823,0.610627,0.619708,0.63084,0.694116,0.738197,0.792348,0.84885,0.908871,...,1.253256,1.230254,1.216608,1.269162,1.238327,1.321876,1.264115,1.24298,1.18766,1.187652
P16L,0.609138,0.627579,0.644199,0.663892,0.682845,0.731617,0.761301,0.804454,0.844715,0.910954,...,1.339484,1.350304,1.328359,1.383009,1.322925,1.365414,1.329541,1.300715,1.27955,1.279547
P22B,0.62539,0.646545,0.663118,0.681017,0.697899,0.745441,0.776537,0.815348,0.853703,0.915684,...,1.34868,1.361344,1.338605,1.370615,1.322849,1.367666,1.327183,1.326024,1.308796,1.308793
P22H,0.628182,0.63234,0.632921,0.637017,0.645322,0.712391,0.751259,0.798334,0.856051,0.914294,...,1.309219,1.305284,1.281755,1.31384,1.272447,1.349495,1.286108,1.298547,1.272658,1.272654
P22L,0.62539,0.646545,0.663118,0.681017,0.697899,0.745441,0.776537,0.815348,0.853703,0.915684,...,1.372062,1.398481,1.372327,1.410082,1.349988,1.378745,1.349225,1.340192,1.326299,1.326297
P28B,0.633311,0.654598,0.674138,0.693011,0.709838,0.756381,0.789001,0.826379,0.864378,0.925173,...,1.372776,1.396027,1.369928,1.401347,1.347189,1.385877,1.35087,1.358904,1.348609,1.348608
P28H,0.642043,0.649771,0.65005,0.653297,0.659337,0.728494,0.762553,0.808085,0.865635,0.923758,...,1.340607,1.359139,1.324898,1.364223,1.30072,1.37521,1.31673,1.344285,1.325449,1.325446
P28L,0.633311,0.654598,0.674138,0.693011,0.709838,0.756381,0.789,0.826378,0.864378,0.925173,...,1.393662,1.423079,1.399942,1.429269,1.37508,1.392777,1.371008,1.367395,1.361735,1.361734
P34B,0.63544,0.654733,0.678495,0.699174,0.717195,0.762395,0.795984,0.832787,0.871146,0.933119,...,1.385789,1.411744,1.384295,1.417707,1.362655,1.399951,1.374388,1.381087,1.371802,1.371801


### Use DataFrame.describe() to get summary statistics about data

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'.

In [10]:
df.describe()

Unnamed: 0,P16B,P16H,P16L,P22B,P22H,P22L,P28B,P28H,P28L,P34B,P34H,P34L,P40B,P40H,P40L
count,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0,1226.0
mean,1.341873,1.303217,1.356043,1.373337,1.339729,1.387603,1.40006,1.371811,1.41303,1.418855,1.393411,1.431126,1.43627,1.412525,1.448081
std,0.223072,0.258176,0.224783,0.213214,0.245582,0.213473,0.206943,0.236748,0.205067,0.202675,0.231264,0.199095,0.201042,0.229986,0.195603
min,0.249499,0.229939,0.249499,0.264302,0.234238,0.264302,0.279298,0.261545,0.279298,0.291382,0.269856,0.291382,0.301117,0.264072,0.301117
25%,1.271618,1.240749,1.285699,1.297918,1.27345,1.310082,1.317118,1.293844,1.32836,1.329962,1.306659,1.341992,1.339276,1.315795,1.354473
50%,1.340664,1.319333,1.357524,1.363818,1.346272,1.379367,1.387204,1.367307,1.40447,1.40477,1.385873,1.421322,1.422908,1.406614,1.438637
75%,1.461021,1.444107,1.475915,1.489279,1.474051,1.499507,1.505077,1.499532,1.521235,1.521931,1.521799,1.539843,1.552212,1.548337,1.567602
max,2.068212,2.06821,2.018047,2.09974,2.099739,2.012895,2.126745,2.126745,2.015623,2.148747,2.148747,2.023546,2.170441,2.170441,2.02949


#### ✅ Activity: 
Look at the table above and try to explain to one of your peers what exactly `df.describe()` represents. Discuss the meaning of each summary statistics.  

🔎 Key Points
- Use the Pandas library to get basic statistics out of tabular data.
- Use `index_col` to specify that a column’s values should be used as row headings.
- Use `DataFrame.info` to find out more about a dataframe.
- The `DataFrame.columns` variable stores information about the dataframe’s columns.
- Use `DataFrame.T` to transpose a dataframe.
- Use `DataFrame.describe` to get summary statistics about data.