<a href="https://colab.research.google.com/github/ObiAU/LoL-TensorFlow-Projects/blob/main/MDS_Thesis_Initial_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import RMSprop
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import seaborn as sns

# Overall Notes/Ideas

* Consider comparing lithium against trace elements and lithium againt major elements
  * In other words, lithium against 'ppm' elements and lithium against 'WT%' elements.

####################################################


The initial dataset is taken from the GEOROC geochemical database and is a large dataset of micas taken from different papers.

* Column R gives the kind of mica (mineral name) – you should **exclude** celadonite, glauconite, hydromica, hydromuscovite, margarite, sericite and yangzhumingite in the first instance.

* That leaves two broad kinds of mica – muscovite (which is essentially free of Fe and Mg) and biotite (which contains Fe and Mg).

I’m sure you’ll have lots of questions about all this! So just get in touch when you want to.



It's also worth saying that I’m sure this is also **not complete**. For example, I know there was a very recent paper **(Breiter et al. 2023)** which has a lot of data presented in figures but not in dataset form – there are probably also others. A detailed literature search might produce more, and there may be good ways to extract data from digital figures.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving FULL_GEOROC_2022_12_SGFTFN_MICA.xlsx to FULL_GEOROC_2022_12_SGFTFN_MICA.xlsx


In [3]:
ini = pd.read_excel('FULL_GEOROC_2022_12_SGFTFN_MICA.xlsx')

## Further Instructions

You can delete age information, that’s no problem.
Age represents a geological age (i.e. timing of when the mineral formed) so is not really relevant.

Similarly, you’ll find columns with **isotopic compositions (e.g. AR36(CCMSTP/G), AR39(MOL/G))** which are also not relevant to what you’re doing. Those can be ignored, hidden or deleted.

## Drop Non-Relevant Columns

Drop He3 (CR), He4 (CS), Ne20 (CY), the Ar36-Ar40 columns (DG-DO), K40 (DQ), Ca42 (DS), Ca43 (DT), Ge73 (EG), Ge74 (EH), KR84 (EL), Mo95-100 (ES-EW), XE132 (FC), RA226 (GC) and everything from GG to JC inclusive.

In [4]:
# We can update this as we go
non_relevant = ['AGE(KA)', 'AGE(MA)', 'CITATION', 'AR36(CCMSTP/G)',
                'AR39(CCMSTP/G)', 'AR39(MOL/G)', 'AR40(AT/G)',
                'AR40(CCMSTP/G)', 'AR40(MOL/G)', 'AR40(PPM)', 'AR40(PPB)',
                'HE3(CCMSTP/G)', 'HE4(CCMSTP/G)', 'K40(PPM)',
                'NE20(CCMSTP/G)', 'CA42(PPM)', 'CA43(PPM)', 'GE73(PPM)',
                'GE74(PPM)', 'KR84(CCMSTP/G)', 'MO95(PPM)', 'MO97(PPM)',
                'MO98(PPM)', 'MO100(PPM)', 'XE132(CCMSTP/G)', 'RA226_ACT(BQ/KG)',
                'ND143_ND144','EPSILON_ND','EPSILON_ND_INI','SM147_ND144',
                'SR87_SR86', 'RB87_SR86', 'PB206_PB204','PB207_PB204',
 'PB207_PB206', 'PB208_PB204', 'PB208_PB206','HF176_HF177', 'LU176_HF177',
 'HE4_HE3', 'K40_AR36','K40_CA44', 'AR36_AR38', 'AR36_AR39',
 'AR36_AR40','AR37_AR39', 'AR37_AR40','AR38_AR36', 'AR38_AR39', 'AR38_AR40',
 'AR39_AR36','AR39_AR40', 'AR40_AR36', 'AR40_AR38', 'AR40_AR39', 'AR40_K40', 'PB207_U235',
 'PB210_TH230_ACT', 'RA226_TH230_ACT', 'TH230_TH232_ACT', 'TH230_U238_ACT', 'TH232_PB204',
 'TH232_PB208', 'TH232_U238', 'U234_U238_ACT','U235_PB204', 'U238_PB204', 'U238_PB206',
 'U238_TH230_ACT', 'U238_TH232_ACT', 'N2(MOL/G)', 'D7LI(VS LSVEC)', 'D7LI(VS NBS8545)',
 'D7LI(VS IRMM-016)', 'D11B(VS NBS951)', 'B11_B10', 'D13C(VS VPDB)', 'D15N(PER MIL)',
 'D18O(PER MIL)', 'D18O(VS SMOW)', 'D18O(VS VSMOW)', 'D25MG(VS DSM3)', 'D26MG(VS DSM3)',
 'CA40_CA44', 'D41K(VS SRM3141A)', 'D44_40CA(VS SRM915A)', 'D44_42CA(VS SRM915A)',
 'D49TI(VS OL-TI)', 'D56FE(VS IRMM-014)', 'D57FE(VS IRMM-014)', 'D60NI(VS NBS986)',
 'D66ZN(VS JMC3-0749)', 'D68ZN(VS JMC3-0749)', 'D98_95MO(PER MIL)', 'D98MO(VS NIST3134)',
 'D98_95MO(VS NIST3134)', 'D137_134BA(VS SRM3104A)', 'D138_134BA(VS SRM3104A)',
 'DD(PER MIL)', 'DD(VS SMOW)', 'DD(VS VSMOW)', 'SM2O3(WT%)', 'EU2O3(WT%)',
 'GD2O3(WT%)','DY2O3(WT%)','SO2(WT%)','SO3(WT%)','S(WT%)','H2S(WT%)','LOI(WT%)',
 'O(WT%)', 'YB2O3(WT%)', 'PR2O3(WT%)', 'AS2O3(WT%)', 'CH4(WT%)','O2(WT%)',
'CO(WT%)', 'H2O(WT%)', 'H2OP(WT%)', 'H2OM(WT%)','CO2(WT%)', 'OH(WT%)',
'HE3(CCMSTP/G)', 'HE4(CCMSTP/G)', 'NE20(CCMSTP/G)', 'AR36(CCMSTP/G)',
'AR39(CCMSTP/G)', 'AR39(MOL/G)', 'AR40(AT/G)', 'AR40(CCMSTP/G)', 'AR40(MOL/G)',
'AR40(PPM)', 'AR40(PPB)', 'K40(PPM)', 'CA42(PPM)', 'CA43(PPM)', 'GE73(PPM)',
'GE74(PPM)', 'KR84(CCMSTP/G)', 'MO95(PPM)', 'MO97(PPM)', 'MO98(PPM)',
 'MO100(PPM)', 'XE132(CCMSTP/G)']

# You can scrap AJ-AN inclusive, AU, BR, CB-CI inclusive and CP.
# Other elements that won’t be of interest for your regression include BS-BV, CA

In [5]:
# Drop Non-Relevant Columns
df = ini.drop(non_relevant, axis = 1)
df.columns

Index(['SAMPLE NAME', 'TECTONIC SETTING', 'LOCATION', 'LOCATION COMMENT',
       'LATITUDE (MIN.)', 'LATITUDE (MAX.)', 'LONGITUDE (MIN.)',
       'LONGITUDE (MAX.)', 'LAND/SEA (SAMPLING)', 'ELEVATION (MIN.)',
       ...
       'ANORTHITE(MOL%)', 'ANNITE(MOL%)', 'ENSTATITE(MOL%)',
       'FERROSILITE(MOL%)', 'MARGARITE(MOL%)', 'MUSCOVITE(MOL%)',
       'ORTHOCLASE(MOL%)', 'PARAGONITE(MOL%)', 'PHLOGOPITE(MOL%)',
       'WOLLASTONITE(MOL%)'],
      dtype='object', length=154)

### **Exclude** CELADONITE, GLAUCONITE, HYDROMICA, HYDROMUSCOVITE, MARGARITE, SERICITE and YANGZHUMINGITE from the column titled 'MINERAL'

In [6]:
df['MINERAL'].value_counts()

BIOTITE               25656
PHLOGOPITE            15369
MICA                   5274
MUSCOVITE              1960
PHENGITE                228
CELADONITE              160
SERICITE                126
PARAGONITE              107
ANNITE                   85
YANGZHUMINGITE           54
ZINNWALDITE              46
SIDEROPHYLLITE            8
HYDROMUSCOVITE            7
MARGARITE                 3
LEPIDOLITE                3
HYDROMICA                 3
GLAUCONITE                3
PHENGITE-MUSCOVITE        2
Name: MINERAL, dtype: int64

In [7]:
excluded = ['CELADONITE', 'GLAUCONITE', 'HYDROMICA', 'HYDROMUSCOVITE',
            'MARGARITE', 'SERICITE', 'YANGZHUMINGITE']

In [8]:
extract = ~df['MINERAL'].isin(excluded)
df = df[extract]

In [9]:
df['MINERAL'].value_counts()

BIOTITE               25656
PHLOGOPITE            15369
MICA                   5274
MUSCOVITE              1960
PHENGITE                228
PARAGONITE              107
ANNITE                   85
ZINNWALDITE              46
SIDEROPHYLLITE            8
LEPIDOLITE                3
PHENGITE-MUSCOVITE        2
Name: MINERAL, dtype: int64

## Mol Calculations

Because the data have been taken from published papers in exactly the way that they are presented there, you’ll see that some elements are reported as “ppm” (parts per million by mass) whereas others are reported as “Wt %” (parts per hundred by mass), and some are reported as the “oxide” version (e.g. thorium oxide, ThO2) instead of “element” (just thorium).

ppm, parts per million, is equivalent to ug/g (micrograms, or 10^-6 of an element per gram of rock).

wt% is equivalent to parts per hundred, is equivalent to 10 milligrams of an element per gram of rock



To convert wt% Rb (rubidium) to ppm Rb (for example) we just multiply by 10^4

So **10,000 ppm Rb = 1 wt%**



But some material are reported as their “oxide” form because they can easily combine with oxygen during some kinds of analysis. To convert from oxide to element, we need to calculate the number of moles of that element. A mole is the mass of an element when there are Avogadro’s number (6.022 x 10^23) of molecules present. The number of moles of Titanium, Ti, is the **same whether we have it in the form of TiO2 or Ti (element)**. However, the mass of one mole of TiO2 will be greater than one mole of Ti because we **also have to account for the oxygen.**

As an example,

If we have 10g of TiO2 in 1 kg of rock, we can calculate the number of moles of this:

n = m/ RMM where n is the number of moles, m = mass and RMM = relative molecular mass.

n (TiO2) = 10/79.9 (atomic mass of Ti = 48, atomic mass of O = 16).

So n (TiO2) = 0.125 moles



Then the number of moles of Ti (element) is also 0.125 so the mass of Ti = 0.125 * 48 (RMM for Ti is 48).



For something like Na2O, the number of moles of Na2O is half the number of moles of Na (metal).

To find the RMM of a compound you add up its individual RAMs

### To perform these Mol Calculation we need to functionize them

In [10]:
# We will write mol, wt%, and ppm calculation functions here

def wt2ppm(item): # to ppm
  return item * 10000

def ppm2wt(item): # to WT%
  return item / 10000

def ppb2m(item): # to ppm
  return item / 1000

def ppb2wt(item): # to WT%
  return item / 10000000


### Major elements that should be left as oxide include SiO2, TiO2, Al2O3, FeO (or Fe2O3), MnO, MgO, CaO, Na2O, K2O, P2O5

It’s generally preferable to have the trace elements (those in lower concentrations) as ppm. Major elements that should be left as oxide include SiO2, TiO2, Al2O3, FeO (or Fe2O3), MnO, MgO, CaO, Na2O, K2O, P2O5, H2O (if present). I notice that some of these elements are reported as element (wt%) e.g. Si (wt%) in column CJ. You can convert the above elements from “wt% element” to “wt% oxide” (i.e. from Si to SiO2) and then merge these columns. Basically, some analytical techniques will generate a report in wt% oxides, some in wt%element and some in ppm. We’re not fussed about the technique but more about the overall range of concentrations.



In [11]:
# Functions to convert materials from their oxide form to elements
def ZRO2(item):
  return item*(91.2/(91.2 + 32))

def HFO2(item):
  return item*(178.5/(178.5 + 32))

def THO2(item):
  return item*(232/(232 + 32))

def UO2(item):
  return item*(238/(238 + 32))

def UO3(item):
  return item*(238/(238 + 48))

def CR2O3(item):
  return item*((2*52)/(2*52 + 48))

def LA2O3(item):
  return item*((2*138.9)/(2*138.9 + 48))

def CE2O3(item):
  return item*((2*140.1)/(2*140.1 + 48))

def ND2O3(item):
  return item*((2*144.2)/(2*144.2 + 48))

def Y2O3(item):
  return item*((2*88.9)/(2*88.9 + 48))

def V2O3(item):
  return item*((2*50.9)/(2*50.9 + 48))

def V2O5(item):
  return item*((2*50.9)/(2*50.9 + 80))

def NB2O3(item):
  return item*((2*92.9)/(2*92.9 + 48))

def NB2O5(item):
  return item*((2*92.9)/(2*92.9 + 80))

def TA2O5(item):
  return item*((2*180.9)/(2*180.9 + 80))

def WO3(item):
  return item*((183.8)/(183.8 + 48))

def BAO(item):
  return item*(137.3/(137.3 + 16))

def SRO(item):
  return item*(87.6/(87.6 + 16))

def PBO(item):
  return item*(207.2/(207.2 + 16))

def SNO2(item):
  return item*(118.7/(118.7 + 32))

def NIO(item):
  return item*(58.7/(58.7 + 16))

def ZNO(item):
  return item*(65.4/(65.4 + 16))

def COO(item):
  return item*(58.9/(58.9 + 16))

def CUO(item):
  return item*(63.5/(63.5 + 16))

def CS2O(item):
  return item*((2*132.9)/(2*132.9 + 16))

def RB2O(item):
  return item*((2*85.5)/(2*85.5 + 16))

def LI2O(item):
  return item*((2*6.9)/(2*6.9 + 16))

def F2O(item):
  return item*((2*19)/(2*19 + 16))

def CL2O(item): # BZ
  return item*((2*35.5)/(2*35.5 + 16))


##############################################################################

# You can scrap AJ-AN inclusive, AU, BR, CA-CI inclusive and CP.
# Other elements that won’t be of interest for your regression include BS-BV

In [12]:
# Functions to convert certain elements to their oxide form
# Elements that should be left as oxide include SiO2, TiO2,
# Al2O3, FeO (or Fe2O3), MnO, MgO, CaO, Na2O, K2O, P2O5, H2O (if present).

# SiO2
def Si(revert):
  return revert*((28.1 + 32)/28.1)

# Al2O3
def Al(revert):
  return revert*((2*27 + 48)/(2*27))

# Fe2O3 to FeO
def FeO(revert):
  return revert*((2*(55.8 + 16))/(2*55.8 + 48))

# Fe to FeO
def Fe(revert):
  return revert*((55.8 + 16)/(58.8))

# MgO
def Mg(revert):
  return revert*((24.3 + 16)/24.3)

# Na2O
def Na(revert):
  return revert*((2*23 + 16)/(2*23))

# K2O
def K(revert):
  return revert*((2*39.1 + 16)/(2*39.1))

# P2O5
def P(revert):
  return revert*((2*31 + 80)/(2*31))

# CaO
def Ca(revert):
  return revert*((40.1 + 16)/40.1)

# MnO
def Mn(revert):
  return revert*((54.9 + 16)/54.9)

## Conversions and Merging

#### Notes

* FeOT and FeO should be merged in advance
* Same with Fe2O3T and Fe2O3

## Pre-Merging

In [13]:
# Merge FeOT and FeO
df['FEO(WT%)'] = df['FEOT(WT%)'].fillna(df['FEO(WT%)'])
# Merge Fe2O3T and Fe2O3
df['FE2O3(WT%)'] = df['FE2O3T(WT%)'].fillna(df['FE2O3(WT%)'])

## Transforming Columns to Floats

Some columns contain non float and non int type data. This makes them incompatible with our above functions.

* CR2O3
* BAO
* NIO
* LI2O
* RB2O
* K

We will iterate over these values to reduce code duplication and ensure conversion consistency.



In [14]:
nonfloat = ['CR2O3(WT%)', 'BAO(WT%)', 'NIO(WT%)',
            'LI2O(WT%)', 'RB2O(WT%)', 'K(WT%)']

In [15]:
for column in nonfloat:
    # Convert the column to string type (in case it's not already)
    df[column] = df[column].astype(str)

    # Remove backslashes and split the string into a list of values
    df[column] = df[column].str.replace('\\', '').str.split()

    # Extract the first value from the list (assuming you're only interested in the first value)
    df[column] = df[column].str[0]

    # Convert the column to float
    df[column] = df[column].astype(float)

  df[column] = df[column].str.replace('\\', '').str.split()


## Column Conversions

### Oxides

In [16]:
# Here, we convert each relevant column and will merge them in later code cells

# We use .pipe() to chain functions
# Converting WT% to WT% Oxides
# SiO2
df['SI(WT%)'] = df['SI(WT%)'].apply(Si)
df['SI(PPM)'] = df['SI(PPM)'].pipe(ppm2wt).pipe(Si)

# Al2O3
df['AL(WT%)'] = df['AL(WT%)'].apply(Al)
df['AL(PPM)'] = df['AL(PPM)'].pipe(ppm2wt).pipe(Al)

# Fe2O3 (FeOT and Fe2O3T are the only ones measured with the T method)
df['FE2O3(WT%)'] = df['FE2O3(WT%)'].apply(FeO)
# Fe
df['FE(WT%)'] = df['FE(WT%)'].apply(Fe)
df['FE(PPM)'] = df['FE(PPM)'].pipe(ppm2wt).pipe(Fe)

# MnO
df['MN(PPM)'] = df['MN(PPM)'].pipe(ppm2wt).pipe(Mn)

# MgO
df['MG(WT%)'] = df['MG(WT%)'].apply(Mg)
df['MG(PPM)'] = df['MG(PPM)'].pipe(ppm2wt).pipe(Mg)

# CaO
df['CA(PPM)'] = df['CA(PPM)'].pipe(ppm2wt).pipe(Ca)

# Na2O
df['NA(WT%)'] = df['NA(WT%)'].apply(Na)
df['NA(PPM)'] = df['NA(PPM)'].pipe(ppm2wt).pipe(Na)

# K2O
df['K(WT%)'] = df['K(WT%)'].apply(K)
df['K(PPM)'] = df['K(PPM)'].pipe(ppm2wt).pipe(K)

# P2O5
df['P(PPM)'] = df['P(PPM)'].pipe(ppm2wt).pipe(P)

### Non-Oxides

In [17]:
# Trace elements; those in lower concentrations, should be left as ppm
# MAJOR ELEMENTS - SiO2, TiO2, Al2O3, FeO, MnO, MgO, CaO, Na2O, K2O, P2O5

# ZrO2 // ZR(PPM) - EP
df['ZRO2(WT%)'] = df['ZRO2(WT%)'].pipe(ZRO2).pipe(wt2ppm)

# HFO2 // HF(PPM) - FT
df['HFO2(WT%)'] = df['HFO2(WT%)'].pipe(HFO2).pipe(wt2ppm)

# THO2 // TH(PPM) - GD
df['THO2(WT%)'] = df['THO2(WT%)'].pipe(THO2).pipe(wt2ppm)

# UO2, UO3 // U(PPM), U(PPB) - GE, GF
df['UO2(WT%)'] = df['UO2(WT%)'].pipe(UO2).pipe(wt2ppm)
df['UO3(WT%)'] = df['UO3(WT%)'].pipe(UO3).pipe(wt2ppm)
df['U(PPB)'] = df['U(PPB)'].pipe(ppb2m)

# CR2O3 // CR(PPM) - DX
df['CR2O3(WT%)'] = df['CR2O3(WT%)'].pipe(CR2O3).pipe(wt2ppm)
df['CR(WT%)'] = df['CR(WT%)'].pipe(wt2ppm)

# LA2O3 // LA(PPM) - FF
df['LA2O3(WT%)'] = df['LA2O3(WT%)'].pipe(LA2O3).pipe(wt2ppm)

# CE2O3 // CE(PPM) - FG
df['CE2O3(WT%)'] = df['CE2O3(WT%)'].pipe(CE2O3).pipe(wt2ppm)

# ND2O3 // ND(PPM) - FI
df['ND2O3(WT%)'] = df['ND2O3(WT%)'].pipe(ND2O3).pipe(wt2ppm)

# Y2O3 // Y(PPM) - EO
df['Y2O3(WT%)'] = df['Y2O3(WT%)'].pipe(Y2O3).pipe(wt2ppm)

# V2O3, V2O5 // V(PPM) - DW
df['V2O3(WT%)'] = df['V2O3(WT%)'].pipe(V2O3).pipe(wt2ppm)
df['V2O5(WT%)'] = df['V2O3(WT%)'].pipe(V2O5).pipe(wt2ppm)

# NB2O3, NB2O5 // NB(PPM) - EQ
df['NB2O3(WT%)'] = df['NB2O3(WT%)'].pipe(NB2O3).pipe(wt2ppm)
df['NB2O5(WT%)'] = df['NB2O3(WT%)'].pipe(NB2O5).pipe(wt2ppm)

# TA2O5 // TA(PPM) - FU
df['TA2O5(WT%)'] = df['TA2O5(WT%)'].pipe(TA2O5).pipe(wt2ppm)

# WO3 // W(PPM) - FV
df['WO3(WT%)'] = df['WO3(WT%)'].pipe(WO3).pipe(wt2ppm)

# BAO // BA(PPM) - FE
df['BAO(WT%)'] = df['BAO(WT%)'].pipe(BAO).pipe(wt2ppm)

# SRO // SR(PPM) - EN
df['SRO(WT%)'] = df['SRO(WT%)'].pipe(SRO).pipe(wt2ppm)

# PBO // PB(PPM) - FZ
df['PBO(WT%)'] = df['PBO(WT%)'].pipe(PBO).pipe(wt2ppm)
df['PB(PPB)'] = df['PB(PPB)'].pipe(ppb2m)

# SNO2 // SN(PPM) - EZ
df['SNO2(WT%)'] = df['SNO2(WT%)'].pipe(SNO2).pipe(wt2ppm)

# NIO // NI(PPM) - EB
df['NIO(WT%)'] = df['NIO(WT%)'].pipe(NIO).pipe(wt2ppm)

# ZNO // ZN(PPM) - ED
df['ZNO(WT%)'] = df['ZNO(WT%)'].pipe(ZNO).pipe(wt2ppm)

# COO // CO(PPM) - EA
df['COO(WT%)'] = df['COO(WT%)'].pipe(COO).pipe(wt2ppm)

# CUO // CU(PPM) - EC
df['CUO(WT%)'] = df['CUO(WT%)'].pipe(CUO).pipe(wt2ppm)

# CS2O // CS(PPM) - FD
df['CS2O(WT%)'] = df['CS2O(WT%)'].pipe(CS2O).pipe(wt2ppm)

# RB2O // RB(PPM) - EM
df['RB2O(WT%)'] = df['RB2O(WT%)'].pipe(RB2O).pipe(wt2ppm)

# LI2O // LI(PPM) - CT
df['LI2O(WT%)'] = df['LI2O(WT%)'].pipe(LI2O).pipe(wt2ppm)

# F2O // F(PPM) - CX
df['F(WT%)'] = df['F(WT%)'].pipe(wt2ppm)
df['F2O(WT%)'] = df['F2O(WT%)'].pipe(F2O).pipe(wt2ppm)

# CL2O // CL(PPM) - DF
df['CL(WT%)'] = df['CL(WT%)'].pipe(wt2ppm)
df['CL2O(WT%)'] = df['CL2O(WT%)'].pipe(CL2O).pipe(wt2ppm)

## Column Merging

IMPORTANT - When merging the columns, data values need to stay **inplace**.
This is because each value corresponds to measurements taken from specific rocks/minerals.

Thus, we will use .fillna() with the 'inplace' argument to merge these columns.

### Oxides

In [18]:
# Oxides (We will leave major elements in WT% form)

# SIO2
df['SIO2(WT%)'] = df['SIO2(WT%)'].fillna(df['SI(PPM)']).fillna(df['SI(WT%)'])

# AL2O3
df['AL2O3(WT%)'] = df['AL2O3(WT%)'].fillna(df['AL(WT%)']).fillna(df['AL(PPM)'])

# FEO
df['FEO(WT%)'] = df['FEO(WT%)'].fillna(df['FE2O3(WT%)']).fillna(df['FE(WT%)']).fillna(df['FE(PPM)'])

# MNO
df['MNO(WT%)'] = df['MNO(WT%)'].fillna(df['MN(PPM)'])

# MGO
df['MGO(WT%)'] = df['MGO(WT%)'].fillna(df['MG(PPM)']).fillna(df['MG(WT%)'])

# CAO
df['CAO(WT%)'] = df['CAO(WT%)'].fillna(df['CA(PPM)'])

# NA2O
df['NA2O(WT%)'] = df['NA2O(WT%)'].fillna(df['NA(WT%)']).fillna(df['NA(PPM)'])

# K20
df['K2O(WT%)'] = df['K2O(WT%)'].fillna(df['K(WT%)']).fillna(df['K(PPM)'])

# P2O5
df['P2O5(WT%)'] = df['P2O5(WT%)'].fillna(df['P(PPM)'])

### Non-Oxides

In [19]:
# Trace elements should be left in ppm format

# ZR
df['ZR(PPM)'] = df['ZR(PPM)'].fillna(df['ZRO2(WT%)'])

# HF
df['HF(PPM)'] = df['HF(PPM)'].fillna(df['HFO2(WT%)'])

# TH
df['TH(PPM)'] = df['TH(PPM)'].fillna(df['THO2(WT%)'])

# U
df['U(PPM)'] = df['U(PPM)'].fillna(df['UO2(WT%)']).fillna(df['UO3(WT%)']).fillna(df['U(PPB)'])

# CR
df['CR(PPM)'] = df['CR(PPM)'].fillna(df['CR2O3(WT%)']).fillna(df['CR(WT%)'])

# LA
df['LA(PPM)'] = df['LA(PPM)'].fillna(df['LA2O3(WT%)'])

# CE
df['CE(PPM)'] = df['CE(PPM)'].fillna(df['CE2O3(WT%)'])

# ND
df['ND(PPM)'] = df['ND(PPM)'].fillna(df['ND2O3(WT%)'])

# Y
df['Y(PPM)'] = df['Y(PPM)'].fillna(df['Y2O3(WT%)'])

# V
df['V(PPM)'] = df['V(PPM)'].fillna(df['V2O3(WT%)']).fillna(df['V2O5(WT%)'])

# NB
df['NB(PPM)'] = df['NB(PPM)'].fillna(df['NB2O3(WT%)']).fillna(df['NB2O5(WT%)'])

# TA
df['TA(PPM)'] = df['TA(PPM)'].fillna(df['TA2O5(WT%)'])

# W
df['W(PPM)'] = df['W(PPM)'].fillna(df['WO3(WT%)'])

# BA
df['BA(PPM)'] = df['BA(PPM)'].fillna(df['BAO(WT%)'])

# SR
df['SR(PPM)'] = df['SR(PPM)'].fillna(df['SRO(WT%)'])

# PB
df['PB(PPM)'] = df['PB(PPM)'].fillna(df['PBO(WT%)']).fillna(df['PB(PPB)'])

# SN
df['SN(PPM)'] = df['SN(PPM)'].fillna(df['SNO2(WT%)'])

# NI
df['NI(PPM)'] = df['NI(PPM)'].fillna(df['NIO(WT%)'])

# ZN
df['ZN(PPM)'] = df['ZN(PPM)'].fillna(df['ZNO(WT%)'])

# CO
df['CO(PPM)'] = df['CO(PPM)'].fillna(df['COO(WT%)'])

# CU
df['CU(PPM)'] = df['CU(PPM)'].fillna(df['CUO(WT%)'])

# CS
df['CS(PPM)'] = df['CS(PPM)'].fillna(df['CS2O(WT%)'])

# RB
df['RB(PPM)'] = df['RB(PPM)'].fillna(df['RB2O(WT%)'])

# LI
df['LI(PPM)'] = df['LI(PPM)'].fillna(df['LI2O(WT%)'])

# F
df['F(PPM)'] = df['F(PPM)'].fillna(df['F(WT%)']).fillna(df['F2O(WT%)'])

# CL
df['CL(PPM)'] = df['CL(PPM)'].fillna(df['CL(WT%)']).fillna(df['CL2O(WT%)'])


### Drop columns that have been merged

In [20]:
postmerged = ['SI(WT%)', 'SI(PPM)', 'AL(WT%)', 'AL(PPM)', 'FE2O3(WT%)',
              'FE2O3T(WT%)', 'FEOT(WT%)', 'MN(PPM)', 'MG(WT%)', 'MG(PPM)',
              'CA(PPM)', 'NA(WT%)', 'NA(PPM)', 'K(WT%)', 'K(PPM)',
              'P(PPM)', 'ZRO2(WT%)', 'HFO2(WT%)', 'THO2(WT%)', 'UO2(WT%)',
              'UO3(WT%)', 'U(PPB)', 'CR2O3(WT%)', 'CR(WT%)', 'LA2O3(WT%)',
              'CE2O3(WT%)', 'ND2O3(WT%)', 'Y2O3(WT%)', 'V2O3(WT%)', 'V2O5(WT%)',
              'NB2O3(WT%)', 'NB2O5(WT%)', 'TA2O5(WT%)', 'WO3(WT%)', 'BAO(WT%)',
              'SRO(WT%)', 'PBO(WT%)', 'PB(PPB)','SNO2(WT%)', 'NIO(WT%)',
              'ZNO(WT%)', 'COO(WT%)', 'CUO(WT%)', 'CS2O(WT%)', 'RB2O(WT%)',
              'LI2O(WT%)', 'F(WT%)', 'F2O(WT%)', 'CL(WT%)', 'CL2O(WT%)',
              'FE(WT%)', 'FE(PPM)', 'AR(MOL/G)']

In [21]:
df = df.drop(postmerged, axis = 1)
df.columns

Index(['SAMPLE NAME', 'TECTONIC SETTING', 'LOCATION', 'LOCATION COMMENT',
       'LATITUDE (MIN.)', 'LATITUDE (MAX.)', 'LONGITUDE (MIN.)',
       'LONGITUDE (MAX.)', 'LAND/SEA (SAMPLING)', 'ELEVATION (MIN.)',
       ...
       'ANORTHITE(MOL%)', 'ANNITE(MOL%)', 'ENSTATITE(MOL%)',
       'FERROSILITE(MOL%)', 'MARGARITE(MOL%)', 'MUSCOVITE(MOL%)',
       'ORTHOCLASE(MOL%)', 'PARAGONITE(MOL%)', 'PHLOGOPITE(MOL%)',
       'WOLLASTONITE(MOL%)'],
      dtype='object', length=101)

We can now drop all rows containing NaNs from the LI(PPM) column as we have successfully merged them.

In [22]:
Cdf = df.dropna(subset=['LI(PPM)'])
Cdf.reset_index(drop = True, inplace = True)
Cdf

Unnamed: 0,SAMPLE NAME,TECTONIC SETTING,LOCATION,LOCATION COMMENT,LATITUDE (MIN.),LATITUDE (MAX.),LONGITUDE (MIN.),LONGITUDE (MAX.),LAND/SEA (SAMPLING),ELEVATION (MIN.),...,ANORTHITE(MOL%),ANNITE(MOL%),ENSTATITE(MOL%),FERROSILITE(MOL%),MARGARITE(MOL%),MUSCOVITE(MOL%),ORTHOCLASE(MOL%),PARAGONITE(MOL%),PHLOGOPITE(MOL%),WOLLASTONITE(MOL%)
0,samp. LR57,INTRAPLATE VOLCANICS,NORTH ATLANTIC CRATON_PROTEROZOIC / SCOTLAND /...,,57.9,58.4,-5.15,-5.16,subaerial,,...,,,,,,,,,,
1,samp. 40920,CONVERGENT MARGIN,BISMARCK ARC - NEW BRITAIN ARC / BISMARCK ARC ...,UASILAU-YAU YAU,-5.6,-5.6,150.20,150.20,subaerial,,...,,,,,,,,,,
2,samp. 40922,CONVERGENT MARGIN,BISMARCK ARC - NEW BRITAIN ARC / BISMARCK ARC ...,UASILAU-YAU YAU,-5.6,-5.6,150.20,150.20,subaerial,,...,,,,,,,,,,
3,samp. 40933,CONVERGENT MARGIN,BISMARCK ARC - NEW BRITAIN ARC / BISMARCK ARC ...,UASILAU-YAU YAU,-5.6,-5.6,150.20,150.20,subaerial,,...,,,,,,,,,,
4,samp. 40938,CONVERGENT MARGIN,BISMARCK ARC - NEW BRITAIN ARC / BISMARCK ARC ...,UASILAU-YAU YAU,-5.6,-5.6,150.20,150.20,subaerial,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4704,samp. ZJ-8,INTRAPLATE VOLCANICS,CENTRAL ASIAN FOLDBELT - MESOZOIC / CHINA - ME...,ZHAOJIANGGOU TA-NB DEPOSIT,42.0,42.0,112.00,112.00,subaerial,,...,,,,,,,,,,
4705,samp. ZJ-8-2,INTRAPLATE VOLCANICS,CENTRAL ASIAN FOLDBELT - MESOZOIC / CHINA - ME...,ZHAOJIANGGOU TA-NB DEPOSIT,42.0,42.0,112.00,112.00,subaerial,,...,,,,,,,,,,
4706,samp. ZJ-8-2,INTRAPLATE VOLCANICS,CENTRAL ASIAN FOLDBELT - MESOZOIC / CHINA - ME...,ZHAOJIANGGOU TA-NB DEPOSIT,42.0,42.0,112.00,112.00,subaerial,,...,,,,,,,,,,
4707,samp. ZJ-8-2,INTRAPLATE VOLCANICS,CENTRAL ASIAN FOLDBELT - MESOZOIC / CHINA - ME...,ZHAOJIANGGOU TA-NB DEPOSIT,42.0,42.0,112.00,112.00,subaerial,,...,,,,,,,,,,


In [54]:
def convert_mixed_value(value):
    if isinstance(value, (float, int)):
        return float(value)
    if "\\" in value:
        parts = value.split('\\')
        number = parts[0].strip()
        return float(number)
    else:
        return float(value)

In [52]:
numericals = ['SIO2(WT%)', 'TIO2(WT%)', 'AL2O3(WT%)', 'FEO(WT%)', 'CAO(WT%)',
       'MGO(WT%)', 'MNO(WT%)', 'K2O(WT%)', 'NA2O(WT%)', 'P2O5(WT%)', 'LI(PPM)',
       'BE(PPM)', 'B(PPM)', 'N(PPM)', 'F(PPM)', 'S(PPM)', 'CL(PPM)', 'SC(PPM)',
       'TI(PPM)', 'V(PPM)', 'CR(PPM)', 'CO(PPM)', 'NI(PPM)', 'CU(PPM)',
       'ZN(PPM)', 'GA(PPM)', 'GE(PPM)', 'AS(PPM)', 'SE(PPM)', 'BR(PPM)',
       'RB(PPM)', 'SR(PPM)', 'Y(PPM)', 'ZR(PPM)', 'NB(PPM)', 'MO(PPM)',
       'AG(PPM)', 'CD(PPM)', 'IN(PPM)', 'SN(PPM)', 'SB(PPM)', 'TE(PPM)',
       'CS(PPM)', 'BA(PPM)', 'LA(PPM)', 'CE(PPM)', 'PR(PPM)', 'ND(PPM)',
       'SM(PPM)', 'EU(PPM)', 'GD(PPM)', 'TB(PPM)', 'DY(PPM)', 'HO(PPM)',
       'ER(PPM)', 'TM(PPM)', 'YB(PPM)', 'LU(PPM)', 'HF(PPM)', 'TA(PPM)',
       'W(PPM)', 'AU(PPM)', 'HG(PPM)', 'TL(PPM)', 'PB(PPM)', 'BI(PPM)',
       'TH(PPM)', 'U(PPM)']

In [55]:
for col in numericals:
  Cdf[col] = Cdf[col].apply(convert_mixed_value)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Cdf[col] = Cdf[col].apply(convert_mixed_value)


### Categorization

With our reduced and cleaned up dataset, it may prove beneficial to recatergorize some of our categorical columns.

 * Filter the dataset to choose only MINERAL = biotite.

 * Instead of categorising by geographical location, probably the most interesting would be to categorise by TECTONIC SETTING.
  * This will give 3 or 4 categories

* The next significant category would probably be rock name.


In [26]:
Cdf['TECTONIC SETTING'].value_counts()

INTRAPLATE VOLCANICS                           2222
OCEAN ISLAND                                    920
RIFT VOLCANICS                                  817
CONVERGENT MARGIN                               650
CONTINENTAL FLOOD BASALT                         50
ARCHEAN CRATON (INCLUDING GREENSTONE BELTS)      48
COMPLEX VOLCANIC SETTINGS                         2
Name: TECTONIC SETTING, dtype: int64

We do not need to consider further categorization of Tectonic Setting at this point

## Rock Name

You can use the following combinations of categories

·         Granite (any kind; combine with microgranite, rhyolite, aplite, rhyodacite, leucogranite)

·         Granodiorite (with dacite, tonalite, trondhjemite)

·         Diorite (any kind; combine with andesite)

·         Gabbro

·         Syenite (with monzonite, monzogranite, melasyenite, phonolite)

·         Carbonatite (any kind; combine with lamprophyre, vogesite, nephelinite)

Ignore the few lines that have “phlogopite, amphibole” or equivalent – those are not actually rock names so consider them as NOT GIVEN.

In [24]:
pd.set_option('display.max_rows', None)
Cdf['ROCK NAME'].value_counts()

NOT GIVEN                                                    918
PHONOLITE                                                    853
GRANITE                                                      587
GRANITE, BIOTITE                                             326
GRANODIORITE                                                 236
LAMPROITE                                                    175
LEUCOGRANITE, 2-MICA                                         120
KIMBERLITE                                                    70
SYENITE                                                       66
GABBRO                                                        66
GRANITE, MUSCOVITE                                            64
GRANITE, 2-MICA                                               60
ECLOGITE, XENOLITH                                            58
ALBITITE                                                      57
LEUCOGRANITE, TOURMALINE                                      51
MONZOGRANITE, BIOTITE-HOR

In [23]:
def categorize_rock(item):
  if item == 'MONZOGRANITE':
    return 'SYENITE'
  elif item in ['RHYOLITE', 'APLITE', 'RHYODACITE']:
    return 'GRANITE'
  elif 'GRANITE' in item:
    return 'GRANITE'
  elif item in ['DACITE', 'TONALITE', 'TRONDHJEMITE']:
    return 'GRANODIORITE'
  elif 'GRANODIORITE' in item:
    return 'GRANODIORITE'
  elif 'DIORITE' in item:
    return 'DIORITE'
  elif item == 'ANDESITE':
    return 'DIORITE'
  elif 'GABBRO' in item:
    return 'GABBRO'
  elif 'SYENITE' in item:
    return 'SYENITE'
  elif item in ['MONZONITE', 'MELASYENITE', 'PHONOLITE']:
    return 'SYENITE'
  elif 'CARBONATITE' in item:
    return 'CARBONATITE'
  elif item in ['LAMPROPHYRE', 'VOGESITE', 'NEPHELINITE']:
    return 'CARBONATITE'
  elif 'PHLOGOPITE' in item:
    return 'NOT GIVEN'
  else:
    return item

In [24]:
testdf = Cdf['ROCK NAME'].apply(categorize_rock)
testdf.value_counts()

GRANITE                                1623
SYENITE                                 998
NOT GIVEN                               946
GRANODIORITE                            356
LAMPROITE                               175
GABBRO                                  100
KIMBERLITE                               70
DIORITE                                  59
CARBONATITE                              59
ECLOGITE, XENOLITH                       58
ALBITITE                                 57
SCHIST, METAPELITIC                      41
LHERZOLITE, XENOLITH                     26
ALN÷ITE                                  19
LHERZOLITE, SPINEL, XENOLITH             14
PEGMATITE, GRANITIC                      12
TRACHYTE                                 11
GREISEN                                  11
CLINOPYROXENITE                          10
HARZBURGITE, SPINEL, XENOLITH             9
WEHRLITE, XENOLITH                        8
CLINOPYROXENITE, GARNET, XENOLITH         4
MEGACRYST, OLIVINE, XENOLITH    

### Qs

For example, "PERIDOTITE, PHLOGOPITE-RICHTERITE, XENOLITH" appears to be a rock name.
It seems to represent the composition of the rock based on those minerals/xenoliths?

Should i return this as 'NOT GIVEN' just like with 'PHLOGOPITE, AMPHIBOLE'?

There are approx 20 cases such as this, I assume they should all be returned as 'NOT GIVEN'?

### Further Recategorization Ideas/Notes

* Location does not need to be examined - categorising tectonic setting gives a more relevant comparison.
* Latitude and Longitude min and max are governed entirely by tectonic setting.

* Examine the relationship/relevancy of both Rock Name and Mineral

##################################################################################
# NEXT STEPS

The dataset is now cleaned and the relevant columns are merged etc. We need to now consider what machine learning models we will develop alongside other techniques:

* Think about/Ask whether to include AGE(KA) and AGE(MA).
* Describe exact mol calculation techniques plus the motivation behind conversions and mergin.

* Consider Dimension Reduction (PCA etc)
  * If we decide to not use it, justify why.

* Consider Data Imputation - to fill in/add in missing values
  * Display EDA with and w/o imputation
  * Compare Models with and w/o imputation

* Consider model comparisons based on different:
  * normalization methods
  * categorizations
  * model architectures (e.g. random forests)

* Consider building separate models for lithium conc against major elements and lithium conc against trace elements
  * Perform the according comparisons

################################

# Modelling

For effective modelling


In [56]:
Cdf['ROCK NAME'] = Cdf['ROCK NAME'].apply(categorize_rock)
Cdf.columns

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Cdf['ROCK NAME'] = Cdf['ROCK NAME'].apply(categorize_rock)


Index(['SAMPLE NAME', 'TECTONIC SETTING', 'LOCATION', 'LOCATION COMMENT',
       'LATITUDE (MIN.)', 'LATITUDE (MAX.)', 'LONGITUDE (MIN.)',
       'LONGITUDE (MAX.)', 'LAND/SEA (SAMPLING)', 'ELEVATION (MIN.)',
       ...
       'ANORTHITE(MOL%)', 'ANNITE(MOL%)', 'ENSTATITE(MOL%)',
       'FERROSILITE(MOL%)', 'MARGARITE(MOL%)', 'MUSCOVITE(MOL%)',
       'ORTHOCLASE(MOL%)', 'PARAGONITE(MOL%)', 'PHLOGOPITE(MOL%)',
       'WOLLASTONITE(MOL%)'],
      dtype='object', length=101)

In [29]:
Cdf['ROCK TEXTURE'].value_counts()

PORPHYRITIC                                        142
MICROGRANULAR                                       78
MASSIVE MACROCRYSTIC                                66
FINE-GRAINED                                        34
SUBEQUIGRANULAR                                     32
INEQUIGRANULAR                                      21
GRANULAR                                            20
COARSE-GRAINED, MEGACRYSTIC                         16
MEDIUM- TO FINE-GRAINED, PORPHYRITIC                12
FINE- TO MEDIUM-GRAINED                             11
COARSE                                              11
EQUIGRANULAR                                        11
FOLIATED                                             8
PORPHYROBLASTIC                                      7
COARSE-GRAINED, ANISOTROPIC                          7
COARSE-GRANULAR                                      7
PORPHYRITIC, FINE-GRAINED                            6
MACROCRYSTIC                                         6
MEDIUM-GRA

In [57]:
# Let us split the dataset first
arbitrary = ['ALBITE(MOL%)', 'ANORTHITE(MOL%)', 'ANNITE(MOL%)',
             'ENSTATITE(MOL%)', 'FERROSILITE(MOL%)', 'MARGARITE(MOL%)',
             'MUSCOVITE(MOL%)', 'ORTHOCLASE(MOL%)', 'PARAGONITE(MOL%)',
             'PHLOGOPITE(MOL%)', 'WOLLASTONITE(MOL%)', 'LOCATION COMMENT',
             'SAMPLE NAME', 'LOCATION', 'SPOT', 'CRYSTAL',
             'RIM/CORE (MINERAL GRAINS)', 'GRAIN SIZE', 'PRIMARY/SECONDARY',
             'ALTERATION'
             ]

init = Cdf.drop(arbitrary, axis = 1)
init.columns

Index(['TECTONIC SETTING', 'LATITUDE (MIN.)', 'LATITUDE (MAX.)',
       'LONGITUDE (MIN.)', 'LONGITUDE (MAX.)', 'LAND/SEA (SAMPLING)',
       'ELEVATION (MIN.)', 'ELEVATION (MAX.)', 'ROCK NAME', 'ROCK TEXTURE',
       'DRILLING DEPTH (MIN.)', 'DRILLING DEPTH (MAX.)', 'MINERAL',
       'SIO2(WT%)', 'TIO2(WT%)', 'AL2O3(WT%)', 'FEO(WT%)', 'CAO(WT%)',
       'MGO(WT%)', 'MNO(WT%)', 'K2O(WT%)', 'NA2O(WT%)', 'P2O5(WT%)', 'LI(PPM)',
       'BE(PPM)', 'B(PPM)', 'N(PPM)', 'F(PPM)', 'S(PPM)', 'CL(PPM)', 'SC(PPM)',
       'TI(PPM)', 'V(PPM)', 'CR(PPM)', 'CO(PPM)', 'NI(PPM)', 'CU(PPM)',
       'ZN(PPM)', 'GA(PPM)', 'GE(PPM)', 'AS(PPM)', 'SE(PPM)', 'BR(PPM)',
       'RB(PPM)', 'SR(PPM)', 'Y(PPM)', 'ZR(PPM)', 'NB(PPM)', 'MO(PPM)',
       'AG(PPM)', 'CD(PPM)', 'IN(PPM)', 'SN(PPM)', 'SB(PPM)', 'TE(PPM)',
       'CS(PPM)', 'BA(PPM)', 'LA(PPM)', 'CE(PPM)', 'PR(PPM)', 'ND(PPM)',
       'SM(PPM)', 'EU(PPM)', 'GD(PPM)', 'TB(PPM)', 'DY(PPM)', 'HO(PPM)',
       'ER(PPM)', 'TM(PPM)', 'YB(PPM)', 'LU(PPM)', 'HF

In [58]:
# The categorical variables here are Tectonic Setting, Rock Name,
# Land/Sea (sampling), rock texture, and mineral

Categoricals = ['TECTONIC SETTING', 'ROCK NAME', 'ROCK TEXTURE',
                'LAND/SEA (SAMPLING)', 'MINERAL']
# one-hot encode each column
for col in Categoricals:
  one_hot = pd.get_dummies(init[col], prefix=col)
# Concatenate the one-hot encoded column back into the original df
  init = pd.concat([init, one_hot], axis = 1)
  init = init.drop(col, axis=1)

In [62]:
init

Unnamed: 0,LATITUDE (MIN.),LATITUDE (MAX.),LONGITUDE (MIN.),LONGITUDE (MAX.),ELEVATION (MIN.),ELEVATION (MAX.),DRILLING DEPTH (MIN.),DRILLING DEPTH (MAX.),SIO2(WT%),TIO2(WT%),...,LAND/SEA (SAMPLING)_subaerial,LAND/SEA (SAMPLING)_subaquatic,MINERAL_BIOTITE,MINERAL_LEPIDOLITE,MINERAL_MICA,MINERAL_MUSCOVITE,MINERAL_PARAGONITE,MINERAL_PHENGITE,MINERAL_PHLOGOPITE,MINERAL_ZINNWALDITE
0,57.9,58.4,-5.15,-5.16,,,,,33.800,6.900,...,1,0,1,0,0,0,0,0,0,0
1,-5.6,-5.6,150.20,150.20,,,,,36.150,3.470,...,1,0,1,0,0,0,0,0,0,0
2,-5.6,-5.6,150.20,150.20,,,,,35.710,4.690,...,1,0,1,0,0,0,0,0,0,0
3,-5.6,-5.6,150.20,150.20,,,,,34.830,4.070,...,1,0,1,0,0,0,0,0,0,0
4,-5.6,-5.6,150.20,150.20,,,,,36.880,3.130,...,1,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4704,42.0,42.0,112.00,112.00,,,,,43.833,0.000,...,1,0,0,0,0,0,0,0,0,1
4705,42.0,42.0,112.00,112.00,,,,,44.181,0.000,...,1,0,0,0,0,0,0,0,0,1
4706,42.0,42.0,112.00,112.00,,,,,44.135,0.003,...,1,0,0,0,0,0,0,0,0,1
4707,42.0,42.0,112.00,112.00,,,,,44.512,0.000,...,1,0,0,0,0,0,0,0,0,1


### We begin with a standard test neural net


In [60]:
# NON-NORMALIZED
# 80:20 split
ini_train = init.sample(frac=0.8, random_state=42)
ini_val = init.drop(ini_train.index)

# split the labels from the rest of the features
initrain_features = ini_train.copy()
inival_features = ini_val.copy()

initrain_labels = initrain_features.pop('LI(PPM)')
inival_labels = inival_features.pop('LI(PPM)')

In [65]:
# Build the model
Norm_model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation = 'relu', input_shape=(initrain_features.shape[1],)),
    # tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation = 'relu'),
    # tf.keras.layers.Dropout(0.2),
  # tf.keras.layers.Dense(64, activation = 'relu'),
    tf.keras.layers.Dense(32, activation = 'relu'),
    # tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation = 'linear')
])

Norm_model.compile(loss='mean_squared_error', # MSE because we are predicting a continuous value
                   optimizer = tf.keras.optimizers.Adam(learning_rate=0.01),
                      metrics = ['accuracy'])

history = Norm_model.fit(initrain_features, initrain_labels, epochs=15,
                  batch_size = 16, validation_data = (inival_features, inival_labels))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
