# How to inspect your data ? 


**Authors**: Antoine Lucas (lucas@ipgp.fr) - Grégory Sainton (sainton@ipgp.fr)

**Version**: 1.0

**Date**: 2020/12/17

----
**Purpose**:
This short notebook is made to show you quickly how to inspect a new set of data

This step is an **unvoidable prerequisite** to any data analysis, whatever you call it, `data science`, `machine learning`, `deep learning`. As reminded during the courses, don't expect to get good results with bad input data: **Garbage in, garbage out**

During your future Machine Learning projects, you will face several challenges
- Insufficient quantity of training data
- Non representative training data
- Poor-Quality Data
- Irrelevant Features
- Overfitting or Underfitting the training data


----
**Reference**: A. Géron, Machine Learning avec Scikit-Learn, Ed. DUNOD, 201
                     Especially the github of the author : https://github.com/ageron   

## Pandas profiling tool

Among the tools used to inspect the data, ```pandas_profiling``` is certainly beeing more and more used in the Data Science Community. 

### Install using conda 
`conda env create -n pandas-profiling`

`conda activate pandas-profiling`

`conda install -c conda-forge pandas-profiling`

### Add the extension to your Jupyter notebook

`jupyter nbextension enable --py widgetsnbextension`

## Test on the data of the first Lab

In the first lab, we will work on data from [ObsEra](http://www.ipgp.fr/en/obsera/obsera-observatory-of-water-and-erosion-in-the-caribbean). As an introduction to our purpose, we propose you to test some tools and commands on this dataset.

Make sure that you cloned the repository from :

```git clone https://pss-gitlab.math.univ-paris-diderot.fr/dralucas/earth-data-science```


The data are saved in `./data` directory. In this data directory, there are `/CHEM` and `/HYDRO` directories. We are only focusing on the first one. 


### Load the data

In [None]:
%reset -f     
# The previous line is used to reset all the variables at each runs

import os, sys
import pandas as pd
import numpy as np
from glob import glob

#Chemical data
ObseraDir_chem = './data/CHEM/'
filelist_chem = glob(ObseraDir_chem + 'C*.csv')

data_chem = pd.read_csv(filelist_chem[0], sep=';')

### Inspect with Pandas 

#### Basics Pandas tools

- df.head() -> to display the first rows of the DataFrame (5 rows by default)
- df.tail() -> to display the last rows of the DataFrame  (5 rows by default)
- df.infos()


In [None]:
data_chem.head()

In [None]:
data_chem.tail()

In [None]:
data_chem.info()

`df.info()` is very useful to see the type of each fields especially when they are not numerical but categorical which of course may orient the choice of your ML models.

In [None]:
data_chem.describe()

`df.describe()` focus on the numerical data giving very basic statistics for each fields.

One can focus only on a single field: 

In [None]:
data_chem["Conductivity"].describe()

##### Basic plots of the data 

Here is a command to plot all the `histogram` of the data. Very useful because most of the time, to improve and fasten your model, you will have to normalize your data.

In [None]:
data_chem.hist(figsize=(12,8))

Now, another command to plot the `scatter matrix` to have a look on the correlations between your features.

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["Conductivity", "Suspended Load", "Twater (°C)"] # test with a subset of data
scatter_matrix(data_chem[attributes])

#### One tool to rule them all: Pandas profiling

For sure, you didn't understand, why we asked you to install Pandas profiling tool without using it... 
Its time has come !

In [None]:
try:
    from pandas_profiling import ProfileReport
    profile = ProfileReport(data_chem, title="Pandas Profiling Report")
    
except ModuleNotFoundError:
    print("module 'ProfileReport' is not installed")


In [None]:
profile