# Introduction to data science: data

[**This notebook is available on Google Colab.**](https://colab.research.google.com/drive/1U_8I6k80LvgC6bwE6RgRvF6QI0cIdEQK)

#### This is a very quick introduction. Time needed: about an hour.

We'll explore the Pandas package for simple data handling tasks using geoscience data examples, and we'll follow it up with a quick look at `scikit-learn` for fitting machine learning models and making predictions.

## Reading a CSV

Pandas reads files from disk in tabular form &mdash; [here is a list](https://pandas.pydata.org/docs/user_guide/io.html) of all the formats that it can read and write. A very common format is CSV, so let's load one!

Conveniently, you can give `pandas` the CSV in a URL or a file path:

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/scienxlab/datasets/main/rpc/rpc-3-imbalanced.csv"
df = pd.read_csv(url)
df.head()

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Let's explore the data</h3>

- Get a column, plot a column
- Do (vectorized) math on column(s)
- Using `df.loc`
- Changing the index with `df.set_index()`
- Selecting subsets of `df` with `df.loc`
- Using `df.describe()` with `include='all'`
</div>

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>Exercise</h3>

- Have a quick look at [the Pandas documentation](https://pandas.pydata.org/docs/).
- How are the missing values distributed across the lithologies?
- The Gardner equation for computing density is given by $\rho_{\text{Gardner}} = 0.310 \times (V_p)^{0.25}$. Define a function that computes the Gardner density, then estimate and put it in a new column called `Rho_Gardner`.
- Use `sns.kdeplot()` to compare the distribution of the new column to the actual `Rho` data.
    -  EXTRA: You could also use `sns.histplot()`, which gives a slightly different representation. How do they differ, and can you make it closer to the representation from `kdeplot()`? Can you spot any issues with using any of these methods with default parameters?
- Fill the empty (NaN) values in density with your computed densities. <a title="Use `pd.Series.fillna()` on the 'Rho' column."><ins>Hover here for hint</ins>.</a>
</div>

## Visual exploration of the data

We can easily visualize the properties of each variable, both individuallt and pairwise, using a `PairPlot`. The library `seaborn` integrates with matplotlib to make these kind of plots easily.

In [None]:
import seaborn as sns

sns.pairplot(df)

In [None]:
sns.pairplot(df,
             hue="Lithology",
             vars=['Vp','Vs','Rho'])

We can have a lot of control over all of the elements in the pair-plot by using the `PairGrid` object.

In [None]:
import matplotlib.pyplot as plt

g = sns.PairGrid(df, hue="Lithology", vars=['Vp','Vs','Rho'], height=4)

g.map_upper(plt.scatter, alpha=0.4)
g.map_lower(plt.scatter, alpha=0.4)
g.map_diag(plt.hist, bins=20,)
g.add_legend()

## Export for machine learning

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

Export the following columns to a new CSV, in this order: Rho, Vp, Vs, and Lithology. Some details:

- Call the file `mydata.csv`.
- Use 3 decimal places for all floats.
- Make sure Pandas does not include the RPC catalog numbers.
</div>

<hr />

<p style="color:gray"> adapted from ©2024 Matt Hall / Equinor. Licensed CC BY.</p>