# Homework 03 of EPS 88

## Learning from a bigger classification of basalt source

In the 2006 paper

>Vermeesch, P. (2006). Tectonic discrimination of basalts with classification trees. Geochimica et Cosmochimica Acta, 70, 1839-1848. https://doi.org/10.1016/j.gca.2005.12.016

Vermeesch wrote:

> *"If a much larger database were compiled, the trees would grow and their discriminative power increase, but they would still be easy to interpret"*

In a more recent paper, Doucet et al. compiled many more data. Rather than 756 basalt data points, they compiled 29,407 of which 22,005 correspond to the categories of Vermeesch (2006).

> Doucet, L. S., Tetley, M. G., Li, Z.-X., Liu, Y., & Gamaleldien, H. (2022). Geochemical fingerprinting of continental and oceanic basalts: A machine learning approach. Earth-Science Reviews, 233, https://doi.org/10.1016/j.earscirev.2022.104192

Your task in this assignment is use the data of Doucet et al. (2022) to evaluate whether the predictive power of the classification tree approach increases within this increase in data size as predicted by Vermeesch (2006).

## Import scientific Python libraries

In addition to the standard scientific Python libraries, a number of functions from `sklearn` with be needed as well.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

## Import data

We will import the data from Doucet et al. 2022 that is provided as their supplemental table 1.

In [2]:
Doucet_data = pd.read_csv('../data/Doucet2022.csv',header=11)

Doucet_data

Unnamed: 0,X1,type,location,SiO2,TiO2,Al2O3,MgO,Fe2O3,FeO,FeOt,...,Ho,Er,Tm,Yb,Lu,Hf,Ta,Pb,Th,U
0,26,ARC-C,ANDEAN-ARC-1,46.40,0.54,11.72,19.60,,,11.92,...,0.26,0.8,0.11,0.6,0.10,,,,,
1,27,ARC-C,ANDEAN-ARC-1,45.80,0.64,10.63,21.40,,,11.79,...,,,,,,,,,,
2,28,ARC-C,ANDEAN-ARC-1,47.30,0.58,10.94,20.80,,,10.55,...,0.35,0.9,0.13,0.8,0.11,,,,,
3,29,ARC-C,ANDEAN-ARC-1,52.00,1.30,18.17,5.47,,,8.75,...,0.60,1.4,0.20,1.2,0.18,,,,,
4,30,ARC-C,ANDEAN-ARC-1,51.70,0.81,18.02,7.26,,,8.04,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29402,10976,MORB,PETDB,51.41,15.54,1.32,7.52,,8.50,8.50,...,,3.6,,3.4,0.55,3.5,0.96,1.13,1.05,0.37
29403,10977,MORB,PETDB,45.15,3.18,15.11,7.50,,9.20,9.20,...,,2.1,,1.7,0.24,6.6,4.73,3.92,5.66,1.87
29404,10978,MORB,PETDB,50.36,1.33,15.83,8.72,,8.74,8.74,...,,2.8,,2.7,0.40,2.1,0.13,0.41,0.13,0.05
29405,10979,MORB,PETDB,51.31,1.10,15.94,8.60,,8.62,8.62,...,,3.0,,2.9,,2.0,0.06,0.30,0.04,0.02


The Doucet et al. 2022 study includes data from additional basalt types. To test Vermeesch's hypothesis, let's filter the data to be those from:

- ***Island arc basalts (IAB)*** *In the Doucet et al. dataset these are called `ARC-O` standing for oceanic arc.*
- ***Mid-ocean ridge (MORB)***
- ***Ocean-island (OIB)***

The code below filters to these types and creates a new dataframe called `basalt_data_MORB_OIB_IAB`

In [3]:
basalt_data = Doucet_data[(Doucet_data['type']=='MORB') | (Doucet_data['type']=='OIB') | (Doucet_data['type']=='ARC-O')]

basalt_data

Unnamed: 0,X1,type,location,SiO2,TiO2,Al2O3,MgO,Fe2O3,FeO,FeOt,...,Ho,Er,Tm,Yb,Lu,Hf,Ta,Pb,Th,U
2012,9,ARC-O,IZU-BONIN,52.80,0.30,13.68,9.76,,,8.42,...,,,,,,,,,,
2013,14,ARC-O,IZU-BONIN,52.07,0.53,14.46,9.41,,,9.03,...,,,,,,,,,,
2014,15,ARC-O,IZU-BONIN,52.84,0.58,14.85,8.56,,,9.55,...,,1.48,,1.51,0.23,,,0.54,0.14,0.06
2015,17,ARC-O,IZU-BONIN,52.87,0.56,14.80,8.80,,,8.84,...,,,,,,,,,,
2016,18,ARC-O,IZU-BONIN,52.58,0.56,14.63,9.26,,,8.68,...,,1.58,,1.63,1.25,,,0.21,0.16,0.08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29402,10976,MORB,PETDB,51.41,15.54,1.32,7.52,,8.50,8.50,...,,3.60,,3.40,0.55,3.5,0.96,1.13,1.05,0.37
29403,10977,MORB,PETDB,45.15,3.18,15.11,7.50,,9.20,9.20,...,,2.10,,1.70,0.24,6.6,4.73,3.92,5.66,1.87
29404,10978,MORB,PETDB,50.36,1.33,15.83,8.72,,8.74,8.74,...,,2.80,,2.70,0.40,2.1,0.13,0.41,0.13,0.05
29405,10979,MORB,PETDB,51.31,1.10,15.94,8.60,,8.62,8.62,...,,3.00,,2.90,,2.0,0.06,0.30,0.04,0.02


## Build a decision tree classifier

Take the same approach that we did in class to build a decision tree classifier between the different `type` values (as they are called in the Doucet et al. (2022) data set. You will want to take this steps:

- Encode the target variable 'type' using LabelEncoder
- Split the data into features (X) and target (y)
    - When you do this split go ahead and drop the `['type','location','X1']` from X as we don't want them to be part of the classification. You can drop them with this code: 
    > `X = basalt_data.drop(['type','location','X1'], axis=1)`
- Impute missing values using median imputation
- Split the data into training and testing sets
- Train the decision tree classifier
- Make predictions on the test set
- Evaluate the classifier
- Plot the tree
- Get and disply the feature importances from the classifier

### Setting the `max_depth`
One consideration is that when setting the classifier there is a parameter `max_depth` than can be set to constrain the maximum depth of the tree. 

The default setting is `max_depth=None` which means it will keep going and going until the leafs of the tree contain a single category. For interpretability, it could be beneficial to set a `max_depth` value like so:

```
classifier = DecisionTreeClassifier(max_depth=12)
```

Once you have your machine learning classifier working, experiment with the tradeoff of predictive accuracy that comes with decreasing the depth of the tree and try to find a balance.

**How does the accuracy of the decision tree based on larger dataset from Doucet et al. (2022) compare to that using the smaller dataset from Vermeesch (2006)?**

*Write your answer here*

**What `max_depth` value do you think represents a good balance between predictive power and model complexity?**

*Write your answer here*

**What similarities and differences are there between the importance of different data fields (feature importance) between the decision tree built on the Vermeesch (2006) data compilation vs that built on the Doucet et al. (2022) data compilation?**

*Write your answer here*

## Comparing the classification of the Vermeesch (2006) dataset

- Import the Vermeesch (2006) dataset, apply the decision tree classifier learend from the Doucet et al. (2022) dataset to the Vermeesch (2006) dataset
- Comment on how well the decision tree based on Doucet et al. (2022) applies to other datasets.

This will entail making sure that the column names and classification names are the same.

## Comparing to a Support Vector Machine (SVM) classifier

- Implement the SVM classifier on the `basalt_data` dataset. Recall that the SVM classifier requires normalization of the data. Use the `StandardScaler` to normalize the data.
- Compare the accuracy of the SVM classifier to the decision tree classifier.

## Adding more categories

- Implement the SVM classifier on Doucet data with all of the categories included, meaning, **do not filter** to just the `ARC-O`, `MORB`, and `OIB` categories.
- Plot the confusion matrix for all of the categories.
- Comment on how well the SVM classifier does with the additional categories.

