# MiBiPreT example: Data handling using Griftpark data

## Background: Griftpark contaminant site

Text to be added.

In [1]:
#!pip install mibipret

Comment optional when package is not pip-installed:

In [2]:
import sys
path = '/home/alraune/GitHub/MiBiPreT/mibipret'
sys.path.append(path) # append the path to module

In [3]:
#from mibipret.data.data import load_csv, check_columns, check_units, check_values, standardize

In [11]:
import mibipret.data.data as md

### Data handling

**Load in data:**

Returns loaded data as DataFrame which is reported when verbose flag is `True`. 

In [5]:
file_path = './grift_BTEXIIN.csv'
data_raw,units = md.load_csv(file_path,verbose=True)

 Running function 'load_csv()' on data file  ./grift_BTEXIIN.csv
Units of quantities:
-------------------
  sample nr obs_well well type Depth  pH     EC Redox  pE  NPOC sulfate  ...  \
0       NaN      NaN       NaN     m NaN  uS/cm    mV NaN  mg/L    mg/L  ...   

  O Xylene Cumene Propylbenzene M-Ethyltoluene O-Ethyltoluene  \
0     ug/L   ug/L          ug/L           ug/L           ug/L   

  1,2,4-Trimethylbenzene 1,2,3-Trimethylbenzene Indane Indene Naphthalene  
0                   ug/L                   ug/L   ug/L   ug/L        ug/L  

[1 rows x 29 columns]
________________________________________________________________
Loaded data as pandas DataFrame:
--------------------------------
       sample nr           obs_well  well type      Depth    pH     EC Redox  \
0            NaN                NaN        NaN          m   NaN  uS/cm    mV   
1   2019-031-001        A-32mm-52,5  dsn 32 mm      -52.5  7.31  149.8  -197   
2   2019-031-002        A-32mm-65,5  dsn 32 mm      -65.

**Check on column names:**

In [6]:
column_names_known,column_names_unknown,column_names_standard = md.check_columns(data_raw, verbose = True)
#print("\nQuantities to be checked on column names: \n",column_names_unknown)

 Running function 'check_columns()' on data
23 quantities identified in provided data.
List of names with standard names:
----------------------------------
sample nr  -->  sample_nr
obs_well  -->  obs_well
well type  -->  well_type
Depth  -->  depth
pH  -->  pH
EC  -->  EC
Redox  -->  redoxpot
pE  -->  pE
NPOC  -->  NOPC
sulfate  -->  sulfate
ammonium  -->  ammonium
sulfide  -->  sulfide
methane  -->  methane
Fe II  -->  ironII
Mn  -->  manganese
Benzene  -->  benzene
Toluene  -->  toluene
Ethylbenzene  -->  ethylbenzene
P/M Xylene  -->  pm_xylene
O Xylene  -->  o_xylene
Indane  -->  indane
Indene  -->  indene
Naphthalene  -->  naphthalene
----------------------------------

Renaming can be done by running standardize().

________________________________________________________________
6 quantities have not bee identified in provided data:
-------------------------------------------------------
Cumene
Propylbenzene
M-Ethyltoluene
O-Ethyltoluene
1,2,4-Trimethylbenzene
1,2,3-Trimethylbe

**Check on units:**

In [7]:
check_list = md.check_units(data_raw, verbose = True)
#print("\nQuantities to be checked on units: \n",check_list)

 Running function 'check_units()' on data
________________________________________________________________
 All identified quantities given in requested units.


**Check on values in columns (transformation to numerical values, handling of nan-values):**

Returns DataFrame without units containing quantities in numerical type.

In [8]:
data_pure = md.check_values(data_raw, verbose = True)

 Running function 'check_values()' on data
________________________________________________________________
________________________________________________________________
Quantities with values transformed to numerical(int/float):
-----------------------------------------------------------
redoxpot
pH
EC
pE
NOPC
sulfate
sulfide
ammonium
methane
ironII
manganese
benzene
toluene
ethylbenzene
pm_xylene
o_xylene
indane
indene
naphthalene


**Standardization of input data:**

Runs all checks on data, i.e. column names, units and values in one go and returns transformed data with standard column names and valueas in numerical type where possible. Data is not reduced to those columns containing known quantities, i.e. also columns remain with unknown column names. Unknown quantities can be contaminants/metabolites not yet included in analysis scheme or quantities with typo in the name. 

In [9]:
data,units = md.standardize(data_raw,reduce = False,  verbose=True)
print("Number of columns:", data.shape[1])

 Running function 'standardize()' on data
23 quantities identified and renamed:
-------------------------------------
sample nr  -->  sample_nr
obs_well  -->  obs_well
well type  -->  well_type
Depth  -->  depth
pH  -->  pH
EC  -->  EC
Redox  -->  redoxpot
pE  -->  pE
NPOC  -->  NOPC
sulfate  -->  sulfate
ammonium  -->  ammonium
sulfide  -->  sulfide
methane  -->  methane
Fe II  -->  ironII
Mn  -->  manganese
Benzene  -->  benzene
Toluene  -->  toluene
Ethylbenzene  -->  ethylbenzene
P/M Xylene  -->  pm_xylene
O Xylene  -->  o_xylene
Indane  -->  indane
Indene  -->  indene
Naphthalene  -->  naphthalene
________________________________________________________________
6 quantities not identified (and removed) from standard data:
--------------------------------------------------------------
Cumene
Propylbenzene
M-Ethyltoluene
O-Ethyltoluene
1,2,4-Trimethylbenzene
1,2,3-Trimethylbenzene
 Running function 'check_units()' on data
_____________________________________________________________

**Standardization of input data:**

Rerun data standardization without reporting. Now we reduce the data to known/identified quantities and write standard data frame to file
Provides option to save the standardized data to a csv-file.

In [10]:
file_standard = './grift_BTEXIIN_standard.csv'
data,units = md.standardize(data_raw,reduce = True, store_csv=file_standard, verbose=False)
print("Number of columns:", data.shape[1])

________________________________________________________________
Number of columns: 23
