In [1]:
import numpy as np
import pandas as pd

from src.definitions import ROOT_DIR

from src.data.download_data import download_competition_files
from src.model.train_model import score

In [2]:
%load_ext autoreload
%autoreload 2

# Import data

The competition winner prediction on the open test data is not found on the competition data folder (GitHub or Google Drive). To create it, I pointed Colab to the [winner notebook](https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition/blob/master/lithology_competition/code/OlawaleI/FORCE_Submission_File.ipynb) on GitHub.

Also, I updated the data input folder to point to the [competition shared Google Drive folder](https://drive.google.com/drive/folders/1GIkjq4fwgwbiqVQxYwoJnOJWVobZ91pL) by adding a shorcut to my personal Google Drive. This way I can access the data without duplicating cloud storage.

Finally, I created a [Google Drive shared folder](https://drive.google.com/drive/folders/1ilFw-gfCSbvRjkbEDygTjuxN-Ixa3wxg) named `lith_pred` to keep the results of the Colab runs.

In this notebook, I explore the result of running the winning code on the open test data, in an attemp to reproduce and test the scoring function.

## Lithology mapping

In [3]:
lithology_numbers = {30000: 0,
                     65030: 1,
                     65000: 2,
                     80000: 3,
                     74000: 4,
                     70000: 5,
                     70032: 6,
                     88000: 7,
                     86000: 8,
                     99000: 9,
                     90000: 10,
                     93000: 11}

## Olawale hidden y_pred

In [4]:
output_root = ROOT_DIR / 'data/external'
olawale_open_test_pred_path = output_root / 'olawale_open_test_pred.csv'

In [5]:
if not olawale_open_test_pred_path.is_file():
    download_competition_files()

In [6]:
olawale_open_test_pred = pd.read_csv(olawale_open_test_pred_path)

In [7]:
olawale_open_test_pred.head()

Unnamed: 0,# lithology
0,65000
1,65000
2,65000
3,65000
4,65000


In [8]:
y_pred = olawale_open_test_pred['# lithology'].map(lithology_numbers).values.ravel()

y_pred

array([2, 2, 2, ..., 1, 1, 0])

## y_true

In [9]:
csv_open_test_path = ROOT_DIR / 'data/external/open_test_y_true.csv'

csv_open_test = pd.read_csv(csv_open_test_path, ',')

In [10]:
csv_open_test.head()

Unnamed: 0,WELL,DEPTH_MD,FORCE_2020_LITHOFACIES_LITHOLOGY
0,15/9-14,480.628001,65000
1,15/9-14,480.780001,65000
2,15/9-14,480.932001,65000
3,15/9-14,481.084001,65000
4,15/9-14,481.236001,65000


In [11]:
y_true = csv_open_test['FORCE_2020_LITHOFACIES_LITHOLOGY'].map(lithology_numbers).values.ravel()

# Score

In [12]:
olawale_open_test_score = score(y_true, y_pred)

In [13]:
print(f'Olawale open test score is: {olawale_open_test_score:.4f}')

Olawale open test score is: -0.5148


Similar for the hidden test (notebook 2.3), this score is a little higher compared to the reported -0.5118 final test score (leaderboard). This could be due to the randomness of the process.