# Introduction to data science: data

We'll explore the Pandas package for simple data handling tasks using geoscience data examples, and we'll follow it up with a quick look at `scikit-learn` for fitting machine learning models and making predictions.

## Reading a CSV

Pandas reads files from disk in tabular form &mdash; [here is a list](https://pandas.pydata.org/docs/user_guide/io.html) of all the formats that it can read and write. A very common format is CSV, so let's load one!

Conveniently, you can give `pandas` the CSV in a URL or a file path:

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/scienxlab/datasets/main/rpc/3-lithologies.csv"
df = pd.read_csv(url)
df.head()

In [None]:
df.dtypes

## Exploring your data

- [ ] Get a column, plot a column
- [ ] Do some maths
- [ ] Use `df.loc`
- [ ] Change the index with `df.set_index()`
- [ ] Use `df.describe()` with `include='all'`

### Exercise

- Have a quick look at [the Pandas documentation](https://pandas.pydata.org/docs/).
- How are the missing values distributed across the lithologies?
- Use your new library to compute the Gardner density estimate and put it in a new column called `Rho_Gardner`.
- Use `sns.kdeplot()` to compare the distribution of the new column to the actual `Rho` data. (You could also use `sns.histplot()` but it needs a bit of parameterization.)
- Fill the empty (NaN) values in density with your computed densities. <a title="Use `pd.Series.fillna()` on the 'Rho' column.">Hover for hint.</a>

In [None]:
df.groupby('Lithology').count()

In [None]:
# One way.
from my_package import gardner

df['Rho_Gardner'] = gardner(df['Vp'])

In [None]:
def gardner(vp): return 0.31*vp**0.25

In [None]:
# Functional.
df['Rho_Gardner'] = df['Vp'].map(gardner)

If we wanted to be more awesome, we could try fitting our own Gardner parameters to the data we have for this rock type.

In [None]:
ax = sns.kdeplot(df["Rho_Gardner"])
_ = sns.kdeplot(df['Rho'], ax=ax)

In [None]:
df['Rho'] = df['Rho'].fillna(df['Rho_Gardner'])

## Visual exploration

We can easily visualize the properties of each facies and how they compare using a `PairPlot`. The library `seaborn` integrates with matplotlib to make these kind of plots easily.

In [None]:
import seaborn as sns

sns.pairplot(df)

In [None]:
sns.pairplot(df,
             hue="Lithology",
             vars=['Vp','Vs','Rho'])

We can have a lot of control over all of the elements in the pair-plot by using the `PairGrid` object.

In [None]:
import matplotlib.pyplot as plt

g = sns.PairGrid(df, hue="Lithology", vars=['Vp','Vs','Rho'], height=4)

g.map_upper(plt.scatter, alpha=0.4)  
g.map_lower(plt.scatter, alpha=0.4)
g.map_diag(plt.hist, bins=20)  
g.add_legend()

## Export for machine learning

### EXERCISE

Export the following columns to a new CSV, in this order: Rho, Vp, Vs, and Lithology. Some details:

- Call the file `mydata.csv`.
- Use 3 decimal places for all floats.
- Make sure Pandas does not include the RPC catalog numbers.

<hr />

<p style="color:gray">©2024 Matt Hall / Equinor. Licensed CC BY.</p>