# Machine-learning Notebook (without Dask)
- Machine_Learning_vB2_20170802
- A cleaner version of Machine_Learning_vB in the same folder
- preceeded by feature creation notebooks

## Contents
1. [Read In a HDF5 from the previous notebook that creates the features](#readInFirstHDF5)
2. [Add column for train or test based on a split %, like 80%/20%, split based on well UWI](#trainVsTestCol)
3. [Rebalance the classes by throwing out some of the rows away from the pick and duplicating some rows at or near the known pick.](#rebalanceClasses)
4. [Identify which columns to use as training features](#identifyTrainingFeatures)
5. [Identify which columns to use as labels](#identifyLabelCol)
6. [Split single dataframe into 4 for train-features,train-labels,test-features,test-labels](#splitDataframe)
7. [Machine learning using standard XGBoost classifier and not yet Dask](#machineLearningNoDask)
8. [Evaluate the initial results](#ml_evaluation)
9. [Turning row-by-row classification prediction into single well pick depth prediction](#classificationToPick)


In [3]:
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
%matplotlib inline
import welly
from welly import Well
import lasio
import glob
from sklearn import neighbors
import pickle
import math
import dask
import dask.dataframe as dd
from dask.distributed import Client
# import pdvega
# import vega
import random
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import mean_squared_error


  return f(*args, **kwds)
  return f(*args, **kwds)


In [4]:
print(welly.__version__)
print(dask.__version__)
print(pd.__version__)

0.3.5
0.18.2
0.23.3


In [5]:
%%timeit
import os
env = %env

78.6 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [6]:
#### Had to change display options to get this to print in full!
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.max_colwidth = 100000

In [7]:
knn_dir = "../WellsKNN/"
load_dir = "../loadLAS"
features_dir = "../createFeatures/"

## If you open this notebook fresh and jump to a point below where a pick file is read in, you still need to load everything above! 

------------

# Reading in the last hdf5 file<a name="readInFirstHDF5"></a>

In [10]:
h5_to_load = 'df_all_wells_wKNN_DEPTHtoDEPT_KNN1PredTopMcM_20180724.h5'
h5_key = 'df'
df_all_Col_preSplit = pd.read_hdf(features_dir+h5_to_load, h5_key)

In [11]:
df_all_Col_preSplit.head()

Unnamed: 0,CALI,COND,DELT,DENS,DEPT,DEPTH,DPHI,DPHI:1,DPHI:2,DT,GR,GR:1,GR:2,IL,ILD,ILD:1,ILD:2,ILM,LITH,LLD,LLS,NPHI,PHID,PHIN,RESD,RHOB,RT,SFL,SFLU,SN,SNP,SP,UWI,SitID,McMurray_Base_HorID,McMurray_Top_HorID,McMurray_Base_DEPTH,McMurray_Top_DEPTH,McMurray_Base_Qual,McMurray_Top_Qual,lat,lng,NN1_McMurray_Top_DEPTH,NN1_McMurray_Base_DEPTH,NN1_thickness,MM_Top_Depth_predBy_NN1thick,HorID,Pick,Quality,HorID_paleoz,Pick_paleoz,Quality_paleoz,diff_TMcM_Pick_v_DEPT,diff_TPal_Pick_v_DEPT,cat_isTopMcMrNearby_known,cat_isTopPalNearby_known,DistFrom_NN1_TopDepth_Abs,NewWell,LastBitWell,TopWellDept,BotWellDept,FromTopWell,FromBotWell,WellThickness,closerToBotOrTop,closTopBotDist,rowsToEdge,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
0,167.003,,,,149.602,,0.227,,,,102.473,,,,0.0,,,,,,,0.46,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,228.348,235.058,0,0,210.058,True,False,149.602,396.102,0.0,246.5,246.5,FromTopWell,0.0,0,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,199.159,,,,149.852,,0.263,,,,122.589,,,,4.202,,,,,,,0.55,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,228.098,234.808,0,0,209.808,False,False,149.602,396.102,0.25,246.25,246.5,FromTopWell,0.25,1,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202
2,200.496,,,,150.102,,0.252,,,,120.196,,,,4.643,,,,,,,0.537,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,227.848,234.558,0,0,209.558,False,False,149.602,396.102,0.5,246.0,246.5,FromTopWell,0.5,2,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643
3,203.933,,,,150.352,,0.244,,,,115.975,,,,5.28,,,,,,,0.513,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,227.598,234.308,0,0,209.308,False,False,149.602,396.102,0.75,245.75,246.5,FromTopWell,0.75,3,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28
4,203.664,,,,150.602,,0.24,,,,109.271,,,,6.592,,,,,,,0.487,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,227.348,234.058,0,0,209.058,False,False,149.602,396.102,1.0,245.5,246.5,FromTopWell,1.0,4,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592


-------------

# Train vs Test Column creation <a name="trainVsTestCol"></a>

We'll do this based on UWIs, so we don't have any datapoints from train wells in our test datset. This is more like reality than if we'd sample train and test rows randomally from the whole dataframe.

Get all the UWIs

In [12]:
UWIs = list(df_all_Col_preSplit['UWI'].unique())

Find the number of wells if you want 80%

In [13]:
numberOfTrainingWells = math.floor(len(UWIs)*0.8)
numberOfTrainingWells

1525

Randomly select that number of UWIs for training and the ones left for test

In [14]:
UWIs_training = random.sample(UWIs, numberOfTrainingWells)

In [15]:
UWIs_test = [x for x in UWIs if x not in UWIs_training]

In [16]:
print("train",len(UWIs_training))
print("test",len(UWIs_test))

train 1525
test 382


In [17]:
df_all_Col_preSplit_wTrainTest = df_all_Col_preSplit.copy()

In [18]:
df_all_Col_preSplit_wTrainTest['trainOrTest'] = np.where(df_all_Col_preSplit_wTrainTest['UWI'].isin(UWIs_training), 'train', 'test')

In [19]:
df_all_Col_preSplit_wTrainTest.tail()

Unnamed: 0,CALI,COND,DELT,DENS,DEPT,DEPTH,DPHI,DPHI:1,DPHI:2,DT,GR,GR:1,GR:2,IL,ILD,ILD:1,ILD:2,ILM,LITH,LLD,LLS,NPHI,PHID,PHIN,RESD,RHOB,RT,SFL,SFLU,SN,SNP,SP,UWI,SitID,McMurray_Base_HorID,McMurray_Top_HorID,McMurray_Base_DEPTH,McMurray_Top_DEPTH,McMurray_Base_Qual,McMurray_Top_Qual,lat,lng,NN1_McMurray_Top_DEPTH,NN1_McMurray_Base_DEPTH,NN1_thickness,MM_Top_Depth_predBy_NN1thick,HorID,Pick,Quality,HorID_paleoz,Pick_paleoz,Quality_paleoz,diff_TMcM_Pick_v_DEPT,diff_TPal_Pick_v_DEPT,cat_isTopMcMrNearby_known,cat_isTopPalNearby_known,DistFrom_NN1_TopDepth_Abs,NewWell,LastBitWell,TopWellDept,BotWellDept,FromTopWell,FromBotWell,WellThickness,closerToBotOrTop,closTopBotDist,rowsToEdge,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,trainOrTest
1482751,,,,,359.0,,0.014,,,,61.724,,,,53.94,,,,,,,0.191,,,,,,,,,,,00/10-35-081-15W4/0,154240,14000,13000,348.0,321.0,1,3,56.066128,-112.234008,300.5,323.5,23.0,325.0,13000,321.0,3,14000,348.0,1,-38.0,-11.0,0,0,34.0,False,False,140.0,360.0,219.0,1.0,220.0,FromBotWell,1.0,4,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,61.724,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,53.94,train
1482752,,,,,359.25,,0.014,,,,59.927,,,,63.882,,,,,,,0.167,,,,,,,,,,,00/10-35-081-15W4/0,154240,14000,13000,348.0,321.0,1,3,56.066128,-112.234008,300.5,323.5,23.0,325.0,13000,321.0,3,14000,348.0,1,-38.25,-11.25,0,0,34.25,False,False,140.0,360.0,219.25,0.75,220.0,FromBotWell,0.75,3,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,59.927,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,63.882,train
1482753,,,,,359.5,,0.011,,,,58.729,,,,74.245,,,,,,,0.155,,,,,,,,,,,00/10-35-081-15W4/0,154240,14000,13000,348.0,321.0,1,3,56.066128,-112.234008,300.5,323.5,23.0,325.0,13000,321.0,3,14000,348.0,1,-38.5,-11.5,0,0,34.5,False,False,140.0,360.0,219.5,0.5,220.0,FromBotWell,0.5,2,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,58.729,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,74.245,train
1482754,,,,,359.75,,0.007,,,,57.529,,,,93.046,,,,,,,0.148,,,,,,,,,,,00/10-35-081-15W4/0,154240,14000,13000,348.0,321.0,1,3,56.066128,-112.234008,300.5,323.5,23.0,325.0,13000,321.0,3,14000,348.0,1,-38.75,-11.75,0,0,34.75,False,False,140.0,360.0,219.75,0.25,220.0,FromBotWell,0.25,1,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,57.529,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,93.046,train
1482755,,,,,360.0,,0.006,,,,56.926,,,,138.167,,,,,,,0.14,,,,,,,,,,,00/10-35-081-15W4/0,154240,14000,13000,348.0,321.0,1,3,56.066128,-112.234008,300.5,323.5,23.0,325.0,13000,321.0,3,14000,348.0,1,-39.0,-12.0,0,0,35.0,False,True,140.0,360.0,220.0,0.0,220.0,FromBotWell,0.0,0,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,56.926,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,138.167,train


--------------

# Rebalance class, aka label, populations to deal with lopsided class populations<a name="rebalanceClasses"></a>

#### Because we have a lot more rows far away from the pick than exactly at the pick or close to the pick, we run the risk of being class heavy in some areas. This can result in not enough ability to identify the sparsely populate classes, like right at the pick. 
#### We'll attemp to deal with this problem by throwing out some of the rows far away from the pick and duplicating some of the rows right at or near the pick.

In [20]:
#### create a copy for the test below to avoid rewriting accidentally
df_test_5 = df_all_Col_preSplit_wTrainTest.copy()

In [21]:
def countRowsByClassOfNearPickOrNot(df,arrayOfClass,divisionInt,classToShrink):
    """
    Takes as input a dataframe, array of classes, an integer to divide by, and  a column, and a class within the column to shrink.
    Returns the dataframe minus the rows that match the ClassToShrink in the Col and prints details about the number of rows of the various classes.
    """
    for eachClass in arrayOfClass:
        print("length of rows with "+str(eachClass)+" in cat_isTopMcMrNearby_known:",len(df[df['cat_isTopMcMrNearby_known'] == eachClass]))
    df_NearPickZeroSmall = df.loc[(df.index%10 != 3) & (df['cat_isTopMcMrNearby_known'] == classToShrink)]
    print("length of rows with 0 in cat_isTopMcMrNearby_known and %"+str(divisionInt)+" == 0 is:",len(df_NearPickZeroSmall))
    print("% reduction in classs 0 is:", math.floor(len(df_NearPickZeroSmall) / len(df['cat_isTopMcMrNearby_known'] == classToShrink) * 100),"%")
    total_after_reduction_in_bigger_class = len(df[df['cat_isTopMcMrNearby_known'] == classToShrink]) -len(df_NearPickZeroSmall)
    print("if taken out using this remainder, the total number of 0 class will be: ",total_after_reduction_in_bigger_class)
#     print("ratio between that class away from pick and classes near pick is :":)
    return df_NearPickZeroSmall

In [22]:
class_array_NearPick = [100,95,60,0]
test_df_return = countRowsByClassOfNearPickOrNot(df_test_5,class_array_NearPick,2,0)

length of rows with 100 in cat_isTopMcMrNearby_known: 1014
length of rows with 95 in cat_isTopMcMrNearby_known: 4819
length of rows with 60 in cat_isTopMcMrNearby_known: 61345
length of rows with 0 in cat_isTopMcMrNearby_known: 1415578
length of rows with 0 in cat_isTopMcMrNearby_known and %2 == 0 is: 1274047
% reduction in classs 0 is: 85 %
if taken out using this remainder, the total number of 0 class will be:  141531


In [23]:
def dropsRowsWithMatchClassAndDeptRemainderIsZero(df,Col,RemainderInt,classToShrink):
    """
    Takes as input a dataframe, a column, a remainder integer, and a class within the column.
    Returns the dataframe minus the rows that match the ClassToShrink in the Col and have a depth from the DEPT col with a remainder of zero.
    """
    print("original lenght of dataframe = ",len(df))
    df_new = df.drop(df[(df[Col] == classToShrink) & (df.index%10 != 0)].index)
    print("length of new dataframe after dropping rows = ",len(df_new))
    print("number of rows dropped = ",len(df)-len(df_new))
    print("length of 0 class is :",len(df_new[df_new[Col] == classToShrink]))
    return df_new

In [24]:
df_all_Col_preSplit_wTrainTest_ClassBalanced = dropsRowsWithMatchClassAndDeptRemainderIsZero(df_all_Col_preSplit_wTrainTest,'cat_isTopMcMrNearby_known',7,0)

original lenght of dataframe =  1482756
length of new dataframe after dropping rows =  208758
number of rows dropped =  1273998
length of 0 class is : 141580


In [25]:
df_all_Col_preSplit_wTrainTest_ClassBalanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208758 entries, 0 to 1482750
Columns: 132 entries, CALI to trainOrTest
dtypes: bool(2), float64(115), int64(12), object(3)
memory usage: 209.0+ MB


In [26]:
def addsRowsToBalanceClasses(df,rangeFor100,rangeFor95):
    """
    Input is a dataframe, range for class 100, and range for class 95
    Copies the rows with labels that don't occur very much so they are a larger part of dataframe
    returns the new dataframe with additional copies of rows added on
    """
    df_class100 = df[df['cat_isTopMcMrNearby_known'] == 100]
    df_class95 = df[df['cat_isTopMcMrNearby_known'] == 95]
    for each1 in range(rangeFor100):
        #print(each1)
        df = df.append(df_class100, ignore_index=True)
    for each2 in range(rangeFor95):
        #print(each2)
        df = df.append(df_class95, ignore_index=True)
    return df

In [27]:
df_all_Col_preSplit_wTrainTest_ClassBalanced2 = addsRowsToBalanceClasses(df_all_Col_preSplit_wTrainTest_ClassBalanced,50,10)

In [28]:
len(df_all_Col_preSplit_wTrainTest_ClassBalanced2)

307648

In [29]:
df_all_Col_preSplit_wTrainTest_ClassBalanced2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307648 entries, 0 to 307647
Columns: 132 entries, CALI to trainOrTest
dtypes: bool(2), float64(115), int64(12), object(3)
memory usage: 305.7+ MB


In [30]:
df_all_Col_preSplit_wTrainTest_ClassBalanced = df_all_Col_preSplit_wTrainTest_ClassBalanced2

# Identify which columns to use as features <a name="identifyTrainingFeatures"></a>

Get a list of columns

In [31]:
col_list = df_all_Col_preSplit_wTrainTest_ClassBalanced.columns
print(col_list)

Index(['CALI', 'COND', 'DELT', 'DENS', 'DEPT', 'DEPTH', 'DPHI', 'DPHI:1', 'DPHI:2', 'DT',
       ...
       'ILD_min_11winSize_dirAroundnLarge', 'ILD_min_21winSize_dirAroundMin', 'ILD_min_21winSize_dirAboveMin', 'ILD_min_21winSize_dirAroundMax', 'ILD_min_21winSize_dirAboveMax', 'ILD_min_21winSize_dirAroundMean', 'ILD_min_21winSize_dirAboveMean', 'ILD_min_21winSize_dirAbovenLarge', 'ILD_min_21winSize_dirAroundnLarge', 'trainOrTest'], dtype='object', length=132)


In [32]:
col_list = list(col_list)
col_list

['CALI',
 'COND',
 'DELT',
 'DENS',
 'DEPT',
 'DEPTH',
 'DPHI',
 'DPHI:1',
 'DPHI:2',
 'DT',
 'GR',
 'GR:1',
 'GR:2',
 'IL',
 'ILD',
 'ILD:1',
 'ILD:2',
 'ILM',
 'LITH',
 'LLD',
 'LLS',
 'NPHI',
 'PHID',
 'PHIN',
 'RESD',
 'RHOB',
 'RT',
 'SFL',
 'SFLU',
 'SN',
 'SNP',
 'SP',
 'UWI',
 'SitID',
 'McMurray_Base_HorID',
 'McMurray_Top_HorID',
 'McMurray_Base_DEPTH',
 'McMurray_Top_DEPTH',
 'McMurray_Base_Qual',
 'McMurray_Top_Qual',
 'lat',
 'lng',
 'NN1_McMurray_Top_DEPTH',
 'NN1_McMurray_Base_DEPTH',
 'NN1_thickness',
 'MM_Top_Depth_predBy_NN1thick',
 'HorID',
 'Pick',
 'Quality',
 'HorID_paleoz',
 'Pick_paleoz',
 'Quality_paleoz',
 'diff_TMcM_Pick_v_DEPT',
 'diff_TPal_Pick_v_DEPT',
 'cat_isTopMcMrNearby_known',
 'cat_isTopPalNearby_known',
 'DistFrom_NN1_TopDepth_Abs',
 'NewWell',
 'LastBitWell',
 'TopWellDept',
 'BotWellDept',
 'FromTopWell',
 'FromBotWell',
 'WellThickness',
 'closerToBotOrTop',
 'closTopBotDist',
 'rowsToEdge',
 'GR_min_5winSize_dirAroundMin',
 'GR_min_5winSize_dirA

## Manually copy the list above and take out some that are labels or aren't things you want to use as training
- At some point come back and see if I can instead use a standard list of things to not include and make the list of columns to use as features more automatically???

In [33]:
## NOTE WE ARE LEAVING THE UWI in for now but will take it out after dataframe is split into train/test portions!!!!
train_feat_bigList = [
 'UWI',
 'trainOrTest',
 'CALI',
 'COND',
 'DELT',
 'DENS',
 'DPHI',
 'DPHI:1',
 'DPHI:2',
 'DT',
 'GR',
 'GR:1',
 'GR:2',
 'IL',
 'ILD',
 'ILD:1',
 'ILD:2',
 'ILM',
 'LITH',
 'LLD',
 'LLS',
 'NPHI',
 'PHID',
 'PHIN',
 'RESD',
 'RHOB',
 'RT',
 'SFL',
 'SFLU',
 'SN',
 'SNP',
 'SP',
 'McMurray_Base_Qual',
 'McMurray_Top_Qual',
 'lat',
 'lng',  
 'NN1_thickness',
 'MM_Top_Depth_predBy_NN1thick',
 'Quality',
 'Quality_paleoz',
 'DistFrom_NN1_TopDepth_Abs',
 'BotWellDept',
 'FromTopWell',
 'FromBotWell',
 'WellThickness',
 'rowsToEdge',
 'GR_min_5winSize_dirAroundMin',
 'GR_min_5winSize_dirAboveMin',
 'GR_min_5winSize_dirAroundMax',
 'GR_min_5winSize_dirAboveMax',
 'GR_min_5winSize_dirAroundMean',
 'GR_min_5winSize_dirAboveMean',
 'GR_min_5winSize_dirAbovenLarge',
 'GR_min_5winSize_dirAroundnLarge',
 'GR_min_7winSize_dirAroundMin',
 'GR_min_7winSize_dirAboveMin',
 'GR_min_7winSize_dirAroundMax',
 'GR_min_7winSize_dirAboveMax',
 'GR_min_7winSize_dirAroundMean',
 'GR_min_7winSize_dirAboveMean',
 'GR_min_7winSize_dirAbovenLarge',
 'GR_min_7winSize_dirAroundnLarge',
 'GR_min_11winSize_dirAroundMin',
 'GR_min_11winSize_dirAboveMin',
 'GR_min_11winSize_dirAroundMax',
 'GR_min_11winSize_dirAboveMax',
 'GR_min_11winSize_dirAroundMean',
 'GR_min_11winSize_dirAboveMean',
 'GR_min_11winSize_dirAbovenLarge',
 'GR_min_11winSize_dirAroundnLarge',
 'GR_min_21winSize_dirAroundMin',
 'GR_min_21winSize_dirAboveMin',
 'GR_min_21winSize_dirAroundMax',
 'GR_min_21winSize_dirAboveMax',
 'GR_min_21winSize_dirAroundMean',
 'GR_min_21winSize_dirAboveMean',
 'GR_min_21winSize_dirAbovenLarge',
 'GR_min_21winSize_dirAroundnLarge',
 'ILD_min_5winSize_dirAroundMin',
 'ILD_min_5winSize_dirAboveMin',
 'ILD_min_5winSize_dirAroundMax',
 'ILD_min_5winSize_dirAboveMax',
 'ILD_min_5winSize_dirAroundMean',
 'ILD_min_5winSize_dirAboveMean',
 'ILD_min_5winSize_dirAbovenLarge',
 'ILD_min_5winSize_dirAroundnLarge',
 'ILD_min_7winSize_dirAroundMin',
 'ILD_min_7winSize_dirAboveMin',
 'ILD_min_7winSize_dirAroundMax',
 'ILD_min_7winSize_dirAboveMax',
 'ILD_min_7winSize_dirAroundMean',
 'ILD_min_7winSize_dirAboveMean',
 'ILD_min_7winSize_dirAbovenLarge',
 'ILD_min_7winSize_dirAroundnLarge',
 'ILD_min_11winSize_dirAroundMin',
 'ILD_min_11winSize_dirAboveMin',
 'ILD_min_11winSize_dirAroundMax',
 'ILD_min_11winSize_dirAboveMax',
 'ILD_min_11winSize_dirAroundMean',
 'ILD_min_11winSize_dirAboveMean',
 'ILD_min_11winSize_dirAbovenLarge',
 'ILD_min_11winSize_dirAroundnLarge',
 'ILD_min_21winSize_dirAroundMin',
 'ILD_min_21winSize_dirAboveMin',
 'ILD_min_21winSize_dirAroundMax',
 'ILD_min_21winSize_dirAboveMax',
 'ILD_min_21winSize_dirAroundMean',
 'ILD_min_21winSize_dirAboveMean',
 'ILD_min_21winSize_dirAbovenLarge',
 'ILD_min_21winSize_dirAroundnLarge']

In [34]:
len(train_feat_bigList)

110

In [35]:
df_train_feat = df_all_Col_preSplit_wTrainTest_ClassBalanced[train_feat_bigList]

In [36]:
df_train_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307648 entries, 0 to 307647
Columns: 110 entries, UWI to ILD_min_21winSize_dirAroundnLarge
dtypes: float64(103), int64(5), object(2)
memory usage: 258.2+ MB


Describing the dataframe here to find out which columns are not populated very much and have a lot of blanks. We'll likely exclude those columns. At this point doing it manually.

In [37]:
df_train_feat.describe()

Unnamed: 0,CALI,COND,DELT,DENS,DPHI,DPHI:1,DPHI:2,DT,GR,GR:1,GR:2,IL,ILD,ILD:1,ILD:2,ILM,LITH,LLD,LLS,NPHI,PHID,PHIN,RESD,RHOB,RT,SFL,SFLU,SN,SNP,SP,McMurray_Base_Qual,McMurray_Top_Qual,lat,lng,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,FromBotWell,WellThickness,rowsToEdge,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
count,101670.0,299.0,11903.0,376.0,279301.0,125.0,125.0,3548.0,307214.0,125.0,125.0,669.0,305061.0,125.0,125.0,1580.0,142.0,543.0,120.0,289830.0,757.0,258.0,516.0,22925.0,83.0,971.0,1579.0,318.0,139.0,3469.0,307648.0,307648.0,307648.0,307648.0,306894.0,306894.0,307648.0,307648.0,306894.0,307648.0,307648.0,307648.0,307648.0,307648.0,307211.0,307214.0,307211.0,307214.0,307211.0,307214.0,307214.0,307213.0,307212.0,307214.0,307212.0,307214.0,307212.0,307214.0,307214.0,307213.0,307213.0,307213.0,307213.0,307213.0,307213.0,307213.0,307213.0,307214.0,307213.0,307212.0,307213.0,307212.0,307213.0,307212.0,307212.0,307213.0,305061.0,305059.0,305061.0,305059.0,305061.0,305059.0,305059.0,305061.0,305060.0,305058.0,305060.0,305058.0,305060.0,305058.0,305058.0,305060.0,305059.0,305058.0,305059.0,305058.0,305059.0,305058.0,305058.0,305059.0,305057.0,305057.0,305057.0,305057.0,305057.0,305057.0,305057.0,305058.0
mean,182.76178,97.331595,339.759574,2147.886944,0.251156,0.347784,0.347784,500.181759,79.313089,59.771432,59.771432,25.097675,23.539498,24.244392,24.244392,74.728143,59.682401,206.408972,108.736838,0.416785,0.280582,0.305053,51.134554,1766.692709,137.096514,130.46675,93.908314,45.015197,0.327813,-34.276468,1.333326,1.804049,55.857735,-112.079129,38.724685,403.431947,1.804049,1.333326,53.485277,470.89904,138.6051,82.252457,220.857556,199.356294,71.573041,71.984871,86.120706,84.991337,78.972551,78.517582,78.517582,78.97372,69.013426,69.910424,88.06509,86.434257,78.734864,78.087746,80.821154,81.849248,65.58226,67.632727,90.750929,88.581815,78.32995,77.784944,83.872589,85.310916,61.739544,66.41741,94.620626,91.347238,78.004253,78.507993,88.076699,88.620342,18.918723,17.821049,30.749279,29.813357,23.988587,23.612654,23.612654,23.97185,17.4273,16.606689,34.325458,31.405646,24.307483,23.335367,25.728592,26.420769,14.762723,15.276471,39.864485,33.327836,24.647141,22.71106,28.084316,30.63066,12.963767,15.90267,47.33765,35.854583,25.116095,23.140083,31.582544,36.04866
std,57.210784,60.958102,63.411731,81.185622,0.539236,0.061691,0.061691,125.42866,24.889367,15.979349,15.979349,69.33101,245.88139,48.047357,48.047357,151.337433,11.263283,168.562692,202.206645,0.740066,0.076542,0.077695,153.349352,927.213034,375.578843,280.789476,372.107834,29.898305,0.052246,89.303414,0.770345,0.791408,0.736551,1.142158,25.972044,164.765071,0.791408,0.770345,101.181928,195.932171,78.213462,82.747857,90.783465,154.032392,23.347048,22.829652,25.581937,24.977939,23.731531,23.342336,23.342336,23.774766,22.852957,22.457657,25.678358,24.989769,23.174332,22.824596,23.442316,24.069699,22.259578,22.355627,25.79694,25.188644,22.32609,22.33478,23.68764,24.484797,22.140071,23.6889,26.630201,25.447193,21.355424,22.488905,24.295444,25.090013,232.580203,219.028231,359.760831,366.560134,249.217423,254.393799,254.393799,249.219228,226.761084,212.805504,370.933495,322.603312,250.036889,240.136791,258.198623,253.317067,208.293287,207.296141,429.400735,315.888515,245.152889,227.051376,252.597238,262.158582,200.013889,223.999296,368.618895,277.614152,234.163046,237.309351,263.499156,286.332415
min,-225.905,2.103,-25.876,2016.2,-0.603,0.005,0.005,17.96,-109.091,24.257,24.257,2.52,-16.304,3.027,3.027,2.3759,15.531,17.9041,4.0053,-0.251,0.0,0.058,2.8,-245.448,4.4666,0.0,0.196,6.11,0.131,-197.34,1.0,1.0,54.764109,-114.774119,-184.0,-46.09,1.0,1.0,0.0,34.0,0.0,0.0,32.0,0.0,-109.091,-109.091,-63.467,-63.467,-78.9142,-78.9142,-78.9142,-78.9142,-109.091,-109.091,-4.191,-8.929,-39.840429,-39.840429,-15.835,-15.835,-109.091,-109.091,-8.929,-8.929,-8.929,-8.929,-8.929,-8.929,-109.091,-109.091,-8.929,-109.091,-8.929,-109.091,-109.091,-109.091,-27.548,-27.548,-10.58,-16.304,-14.9082,-16.304,-16.304,-16.304,-27.548,-27.548,-16.304,-16.304,-16.304,-16.304,-16.304,-16.304,-27.548,-27.548,-16.304,-16.304,-16.304,-16.304,-16.304,-16.304,-27.548,-27.548,-16.304,-16.304,-16.304,-16.304,-16.304,-16.304
25%,159.93625,40.174,316.7185,2090.5,0.196,0.321,0.321,448.36,63.614,55.795,55.795,10.13,5.54,8.386,8.386,7.8529,53.91475,85.1245,20.330125,0.358,0.2425,0.2563,10.0,2032.18,4.4956,11.675,6.9411,18.915,0.297,-79.55,1.0,1.0,55.288691,-112.868349,19.5,306.5,1.0,1.0,6.0,375.0,82.5,30.5,195.0,90.0,56.511,57.508,70.495,69.88,64.0274,63.85775,63.85775,64.0104,54.148,55.705,72.63875,71.363,64.150821,63.949857,66.3698,66.9478,51.244,53.542,75.535,73.544,64.354545,64.072909,69.5,70.70385,47.346,51.683,80.146,76.79875,64.975524,65.187929,73.974,74.573,4.948,4.93,6.272,6.185,5.6746,5.606,5.606,5.6686,4.77,4.724,6.604,6.451,5.746857,5.646214,5.8986,5.9908,4.493,4.386,7.082,6.746,5.893818,5.689386,6.3176,6.4773,4.025,3.722,7.83,7.054,6.032905,5.566762,6.714,7.0444
50%,172.888,100.889,347.179,2134.3999,0.245,0.388,0.388,495.12,79.238,55.795,55.795,15.22,8.635,12.921,12.921,12.4003,65.643,166.5989,38.66015,0.402,0.284,0.2826,17.26,2202.25,8.7512,20.93,13.7857,41.75,0.318,-55.605,1.0,2.0,55.717502,-111.896661,39.5,440.5,2.0,1.0,17.28,487.426,158.25,58.0,220.0,180.0,71.391,71.797,86.224,85.237,78.924,78.6358,78.6358,78.9236,68.879,69.608,88.0775,86.803,78.647857,78.184714,81.0045,81.9834,65.444,66.937,90.894,89.133,78.280091,77.872455,84.351,85.5946,61.028,64.4,95.045,93.26,78.101286,78.845476,89.9814,89.844,7.572,7.576,10.106,9.858,8.8318,8.7188,8.7188,8.8212,7.22,7.237,10.771,10.328,8.965071,8.765571,9.2516,9.4528,6.792,6.711,11.922,10.921,9.224,8.769773,10.0,10.5102,6.102,5.837,13.568,11.37,9.528143,8.43781,10.626,11.8002
75%,210.3625,145.815,377.0245,2176.799925,0.29,0.388,0.388,546.08,94.482,64.411,64.411,28.17,14.914,12.921,12.921,30.43205,65.643,284.4281,66.676175,0.448,0.3164,0.3465,30.0,2293.61,44.64295,58.525,31.94705,63.6,0.3535,-31.01,1.0,2.0,56.226374,-111.137684,54.5,495.0,2.0,1.0,67.75,555.0,182.75,107.69725,239.997,271.0,86.667,86.75175,100.972,99.86875,93.6018,93.0996,93.0996,93.6064,83.87625,84.156,102.88825,101.264,93.051464,92.434225,95.341,96.3406,79.7,81.241,105.456,103.536,92.340636,91.880727,98.5358,99.84055,74.926,79.453,109.253,106.719,90.894333,91.927,103.032,103.4012,12.48,12.468,18.061,17.518,15.2396,14.9518,14.9518,15.22,11.726,11.861,19.602,18.452,15.441143,15.065964,16.0888,16.5432,10.841,10.985,22.096,19.80475,15.876636,15.068,17.70315,18.837,9.571,9.937,25.963,20.65,16.279857,14.618571,19.0358,21.9952
max,1417.29,434.515,657.774,2632.8,45.0,0.394,0.394,926.44,506.788,101.147,101.147,1254.4999,99960.8281,301.379,301.379,2262.7151,84.234,1774.2136,1091.0129,58.05,0.5365,0.6112,2323.78,3004.72,1720.2471,1579.9999,2465.0735,173.15,0.459,303.437,4.0,3.0,57.807827,-110.008902,137.5,948.07,3.0,4.0,1517.51,2149.67,860.4,853.018,860.4,1706.0,483.785,506.788,1127.0,506.788,516.9468,506.788,506.788,506.788,483.785,506.788,1127.0,506.788,529.336714,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,99960.8281,99960.8281,100000.0,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,100000.0,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281


### Two lists of columns to not use as training features

Columns taken out as they either contain information probably captures in other columns, are related to labels too closely, or other reasons.

In [38]:
takeOutColList = [
    'FromBotWell',
    'FromTopWel'
    'MM_Top_Depth_predBy_NN1thick',
    'rowsToEdge',
     'McMurray_Top_Qual',
     'lat',
     'lng',    
]

Columns taken out as they aren't present often enough in the well dataset

In [39]:
training_feats_w_lowCount = ['RHOB','SP','CALI','COND','DELT','DENS','DPHI:1','DPHI:2','DT','GR:1','GR:2','IL','ILD:1','ILD:2','ILM','LITH','LLD','LLS','PHID','PHIN','RESD','RT','SFL','SFLU','SN','SNP','Sp']

Next few lines to combine the two lists above and take those columns out of dataframe

In [40]:
train_feats_minusLowCount = [x for x in train_feat_bigList if x not in training_feats_w_lowCount]

In [41]:
train_feats_minusLowCount = [x for x in train_feats_minusLowCount if x not in takeOutColList]

In [42]:
df_train_featWithHighCount = df_train_feat[train_feats_minusLowCount]

Number of columns for training

In [43]:
len(train_feats_minusLowCount)

79

In [44]:
df_train_featWithHighCount.describe()

Unnamed: 0,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
count,279301.0,307214.0,305061.0,289830.0,307648.0,306894.0,306894.0,307648.0,307648.0,306894.0,307648.0,307648.0,307648.0,307211.0,307214.0,307211.0,307214.0,307211.0,307214.0,307214.0,307213.0,307212.0,307214.0,307212.0,307214.0,307212.0,307214.0,307214.0,307213.0,307213.0,307213.0,307213.0,307213.0,307213.0,307213.0,307213.0,307214.0,307213.0,307212.0,307213.0,307212.0,307213.0,307212.0,307212.0,307213.0,305061.0,305059.0,305061.0,305059.0,305061.0,305059.0,305059.0,305061.0,305060.0,305058.0,305060.0,305058.0,305060.0,305058.0,305058.0,305060.0,305059.0,305058.0,305059.0,305058.0,305059.0,305058.0,305058.0,305059.0,305057.0,305057.0,305057.0,305057.0,305057.0,305057.0,305057.0,305058.0
mean,0.251156,79.313089,23.539498,0.416785,1.333326,38.724685,403.431947,1.804049,1.333326,53.485277,470.89904,138.6051,220.857556,71.573041,71.984871,86.120706,84.991337,78.972551,78.517582,78.517582,78.97372,69.013426,69.910424,88.06509,86.434257,78.734864,78.087746,80.821154,81.849248,65.58226,67.632727,90.750929,88.581815,78.32995,77.784944,83.872589,85.310916,61.739544,66.41741,94.620626,91.347238,78.004253,78.507993,88.076699,88.620342,18.918723,17.821049,30.749279,29.813357,23.988587,23.612654,23.612654,23.97185,17.4273,16.606689,34.325458,31.405646,24.307483,23.335367,25.728592,26.420769,14.762723,15.276471,39.864485,33.327836,24.647141,22.71106,28.084316,30.63066,12.963767,15.90267,47.33765,35.854583,25.116095,23.140083,31.582544,36.04866
std,0.539236,24.889367,245.88139,0.740066,0.770345,25.972044,164.765071,0.791408,0.770345,101.181928,195.932171,78.213462,90.783465,23.347048,22.829652,25.581937,24.977939,23.731531,23.342336,23.342336,23.774766,22.852957,22.457657,25.678358,24.989769,23.174332,22.824596,23.442316,24.069699,22.259578,22.355627,25.79694,25.188644,22.32609,22.33478,23.68764,24.484797,22.140071,23.6889,26.630201,25.447193,21.355424,22.488905,24.295444,25.090013,232.580203,219.028231,359.760831,366.560134,249.217423,254.393799,254.393799,249.219228,226.761084,212.805504,370.933495,322.603312,250.036889,240.136791,258.198623,253.317067,208.293287,207.296141,429.400735,315.888515,245.152889,227.051376,252.597238,262.158582,200.013889,223.999296,368.618895,277.614152,234.163046,237.309351,263.499156,286.332415
min,-0.603,-109.091,-16.304,-0.251,1.0,-184.0,-46.09,1.0,1.0,0.0,34.0,0.0,32.0,-109.091,-109.091,-63.467,-63.467,-78.9142,-78.9142,-78.9142,-78.9142,-109.091,-109.091,-4.191,-8.929,-39.840429,-39.840429,-15.835,-15.835,-109.091,-109.091,-8.929,-8.929,-8.929,-8.929,-8.929,-8.929,-109.091,-109.091,-8.929,-109.091,-8.929,-109.091,-109.091,-109.091,-27.548,-27.548,-10.58,-16.304,-14.9082,-16.304,-16.304,-16.304,-27.548,-27.548,-16.304,-16.304,-16.304,-16.304,-16.304,-16.304,-27.548,-27.548,-16.304,-16.304,-16.304,-16.304,-16.304,-16.304,-27.548,-27.548,-16.304,-16.304,-16.304,-16.304,-16.304,-16.304
25%,0.196,63.614,5.54,0.358,1.0,19.5,306.5,1.0,1.0,6.0,375.0,82.5,195.0,56.511,57.508,70.495,69.88,64.0274,63.85775,63.85775,64.0104,54.148,55.705,72.63875,71.363,64.150821,63.949857,66.3698,66.9478,51.244,53.542,75.535,73.544,64.354545,64.072909,69.5,70.70385,47.346,51.683,80.146,76.79875,64.975524,65.187929,73.974,74.573,4.948,4.93,6.272,6.185,5.6746,5.606,5.606,5.6686,4.77,4.724,6.604,6.451,5.746857,5.646214,5.8986,5.9908,4.493,4.386,7.082,6.746,5.893818,5.689386,6.3176,6.4773,4.025,3.722,7.83,7.054,6.032905,5.566762,6.714,7.0444
50%,0.245,79.238,8.635,0.402,1.0,39.5,440.5,2.0,1.0,17.28,487.426,158.25,220.0,71.391,71.797,86.224,85.237,78.924,78.6358,78.6358,78.9236,68.879,69.608,88.0775,86.803,78.647857,78.184714,81.0045,81.9834,65.444,66.937,90.894,89.133,78.280091,77.872455,84.351,85.5946,61.028,64.4,95.045,93.26,78.101286,78.845476,89.9814,89.844,7.572,7.576,10.106,9.858,8.8318,8.7188,8.7188,8.8212,7.22,7.237,10.771,10.328,8.965071,8.765571,9.2516,9.4528,6.792,6.711,11.922,10.921,9.224,8.769773,10.0,10.5102,6.102,5.837,13.568,11.37,9.528143,8.43781,10.626,11.8002
75%,0.29,94.482,14.914,0.448,1.0,54.5,495.0,2.0,1.0,67.75,555.0,182.75,239.997,86.667,86.75175,100.972,99.86875,93.6018,93.0996,93.0996,93.6064,83.87625,84.156,102.88825,101.264,93.051464,92.434225,95.341,96.3406,79.7,81.241,105.456,103.536,92.340636,91.880727,98.5358,99.84055,74.926,79.453,109.253,106.719,90.894333,91.927,103.032,103.4012,12.48,12.468,18.061,17.518,15.2396,14.9518,14.9518,15.22,11.726,11.861,19.602,18.452,15.441143,15.065964,16.0888,16.5432,10.841,10.985,22.096,19.80475,15.876636,15.068,17.70315,18.837,9.571,9.937,25.963,20.65,16.279857,14.618571,19.0358,21.9952
max,45.0,506.788,99960.8281,58.05,4.0,137.5,948.07,3.0,4.0,1517.51,2149.67,860.4,860.4,483.785,506.788,1127.0,506.788,516.9468,506.788,506.788,506.788,483.785,506.788,1127.0,506.788,529.336714,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,506.788,99960.8281,99960.8281,100000.0,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,100000.0,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281


In [45]:
used_features = list(df_train_featWithHighCount.columns)

In [46]:
used_features

['UWI',
 'trainOrTest',
 'DPHI',
 'GR',
 'ILD',
 'NPHI',
 'McMurray_Base_Qual',
 'NN1_thickness',
 'MM_Top_Depth_predBy_NN1thick',
 'Quality',
 'Quality_paleoz',
 'DistFrom_NN1_TopDepth_Abs',
 'BotWellDept',
 'FromTopWell',
 'WellThickness',
 'GR_min_5winSize_dirAroundMin',
 'GR_min_5winSize_dirAboveMin',
 'GR_min_5winSize_dirAroundMax',
 'GR_min_5winSize_dirAboveMax',
 'GR_min_5winSize_dirAroundMean',
 'GR_min_5winSize_dirAboveMean',
 'GR_min_5winSize_dirAbovenLarge',
 'GR_min_5winSize_dirAroundnLarge',
 'GR_min_7winSize_dirAroundMin',
 'GR_min_7winSize_dirAboveMin',
 'GR_min_7winSize_dirAroundMax',
 'GR_min_7winSize_dirAboveMax',
 'GR_min_7winSize_dirAroundMean',
 'GR_min_7winSize_dirAboveMean',
 'GR_min_7winSize_dirAbovenLarge',
 'GR_min_7winSize_dirAroundnLarge',
 'GR_min_11winSize_dirAroundMin',
 'GR_min_11winSize_dirAboveMin',
 'GR_min_11winSize_dirAroundMax',
 'GR_min_11winSize_dirAboveMax',
 'GR_min_11winSize_dirAroundMean',
 'GR_min_11winSize_dirAboveMean',
 'GR_min_11winSize_

-----------------

## Identify which columns to use as labels<a name="identifyLabelCol"></a>

#### The column 'cat_isTopMcMrNearby_known' is what we'll use as labels.
- 100 = exactly the Top McMurray Pick
- 95 if the distance between that depth and the Top McMurray Pick is -0.5 < x and x <0.5
- 60 if the distance between that depth and the Top McMurray Pick is -5 < x and x < 5
- 0 = not near the Top McMurray Pick

The function used to make these classes or lables as column was:
`df_all_wells_wKNN_DEPTHtoDEPT['cat_isTopMcMrNearby_known']=df_all_wells_wKNN_DEPTHtoDEPT['diff_TMcM_Pick_v_DEPT'].apply(lambda x: 100 if x==0 else ( 95 if (-0.5 < x and x <0.5) else 60 if (-5 < x and x <5) else 0))`

In [47]:
df_all_Col_preSplit_wTrainTest_ClassBalanced['cat_isTopMcMrNearby_known'].unique()

array([  0,  60,  95, 100])

In [48]:
labels = df_all_Col_preSplit_wTrainTest_ClassBalanced[['cat_isTopMcMrNearby_known','UWI','trainOrTest']]

In [49]:
labels.head()

Unnamed: 0,cat_isTopMcMrNearby_known,UWI,trainOrTest
0,0,00/10-32-080-20W4/0,test
1,0,00/10-32-080-20W4/0,test
2,0,00/10-32-080-20W4/0,test
3,0,00/10-32-080-20W4/0,test
4,0,00/10-32-080-20W4/0,test


In [50]:
labels.tail()

Unnamed: 0,cat_isTopMcMrNearby_known,UWI,trainOrTest
307643,95,00/11-18-079-03W5/0,test
307644,95,00/11-18-079-03W5/0,test
307645,95,00/11-18-079-03W5/0,test
307646,95,00/10-35-081-15W4/0,train
307647,95,00/10-35-081-15W4/0,train


In [51]:
len(labels)

307648

The lengths of training dataframes and labels dataframes should be the same. We'll take out UWI and trainOrTest further down.

-----------------

## Now separate into 4 dataframes = <a name="splitDataframe"></a>
### train_labels
### train_feat 
### test_labels
### test_feat
Then take off UWI and TrainTest col

### Create label dataframes

In [52]:
#### split based on train in trainOrTest col
labels_train = labels[labels['trainOrTest'] == 'train' ]
#### Keep only the 'cat_isTopMcMrNearby_known' column, so now it is just a series of labels
labels_train = labels_train['cat_isTopMcMrNearby_known']
#### split based on test in trainOrTest col
labels_test = labels[labels['trainOrTest'] == 'test' ]
#### Keep only the 'cat_isTopMcMrNearby_known' column, so now it is just a series of labels
labels_test = labels_test['cat_isTopMcMrNearby_known']

### Create training dataframes

In [53]:
#### split based on train in trainOrTest col and drop UWI and TrainOrTest columns
df_train_featWithHighCount_train = df_train_featWithHighCount[df_train_featWithHighCount['trainOrTest'] == 'train' ].drop(columns=['UWI', 'trainOrTest'])
#### split based on test in trainOrTest col and drop UWI and TrainOrTest columns
df_train_featWithHighCount_test = df_train_featWithHighCount[df_train_featWithHighCount['trainOrTest'] == 'test' ].drop(columns=['UWI', 'trainOrTest'])

### Rename to avoid overwriting & keep with previous work

In [54]:
train_X = df_train_featWithHighCount_train
train_y = labels_train
test_X = df_train_featWithHighCount_test
test_y = labels_test

### Inspect to make sure column headers and lengths make sense

In [55]:
print(len(train_X))
train_X.head()

246700


Unnamed: 0,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
390,0.185,101.752,3.723,0.537,2,23.78,421.84,1,2,183.096,445.994,1.0,208.25,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723
391,0.212,100.657,2.95,0.516,2,23.78,421.84,1,2,180.596,445.994,3.5,208.25,100.349,100.657,104.476,100.657,101.5134,100.657,100.657,100.657,100.349,100.657,106.802,100.657,102.304429,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,2.95,2.95,3.254,2.95,3.1066,2.95,2.95,2.95,2.95,2.95,3.414,2.95,3.194286,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95
392,0.175,100.744,3.409,0.532,2,23.78,421.84,1,2,178.096,445.994,6.0,208.25,99.221,100.744,106.397,106.397,102.582,104.5656,104.5656,102.582,99.221,100.744,106.397,100.744,103.294,100.744,100.744,100.744,99.221,100.744,106.729,100.744,103.907273,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,3.299,3.299,3.558,3.493,3.423,3.3906,3.3906,3.423,3.299,3.409,3.632,3.409,3.449143,3.409,3.409,3.409,3.299,3.409,3.632,3.409,3.478455,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409
393,0.265,91.018,4.864,0.489,2,23.78,421.84,1,2,175.596,445.994,8.5,208.25,67.81,91.018,102.635,102.635,88.5874,98.7966,98.7966,88.5874,58.59,91.018,102.635,105.471,85.816714,100.284,102.356,94.8634,53.847,91.018,102.635,91.018,83.585545,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,3.541,3.298,7.441,4.864,5.2084,3.8452,3.8452,5.2084,3.298,3.298,9.898,4.864,5.605429,3.758,3.934,6.4798,3.298,4.864,10.327,4.864,6.026455,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864
394,0.298,74.735,7.736,0.426,2,23.78,421.84,1,2,173.096,445.994,11.0,208.25,71.051,69.128,74.946,74.946,73.5338,73.189,73.189,73.5338,70.149,53.847,74.946,74.946,72.959,68.991429,73.189,73.9026,63.148,74.735,74.946,74.735,71.501545,74.735,74.735,74.735,53.847,74.735,97.36,74.735,75.98419,74.735,74.735,74.735,7.182,7.736,8.256,9.347,7.6842,8.4198,8.4198,7.6842,7.111,7.736,8.756,10.327,7.755429,8.909429,9.3252,7.999,7.11,7.736,9.94,7.736,7.981636,7.736,7.736,7.736,4.864,7.736,10.327,7.736,7.779095,7.736,7.736,7.736


In [56]:
print(len(train_y))
train_y.head()

246700


390    0
391    0
392    0
393    0
394    0
Name: cat_isTopMcMrNearby_known, dtype: int64

In [57]:
print(len(test_X))
test_X.head()

60948


Unnamed: 0,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
0,0.227,102.473,0.0,0.46,1,25.0,359.66,3,1,210.058,396.102,0.0,246.5,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.269,26.625,30.179,0.355,1,25.0,359.66,3,1,207.558,396.102,2.5,246.5,25.825,26.625,50.213,26.625,32.768,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,20.262,30.179,30.37,30.179,27.1088,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179
2,0.339,31.562,21.793,0.428,1,25.0,359.66,3,1,205.058,396.102,5.0,246.5,23.605,31.562,49.258,31.562,34.0164,31.562,31.562,31.562,22.7,31.562,60.528,31.562,36.187143,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,16.583,21.793,25.975,21.793,21.5708,21.793,21.793,21.793,14.586,21.793,26.774,21.793,21.316286,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793
3,0.291,51.257,7.449,0.452,1,25.0,359.66,3,1,202.558,396.102,7.5,246.5,40.739,37.621,72.481,51.257,53.8586,43.6546,43.6546,53.8586,37.621,37.621,87.074,60.965,56.284,47.374143,50.6518,63.1256,37.621,51.257,88.401,51.257,59.979636,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,5.151,7.449,13.402,13.402,8.5374,11.2332,11.2332,8.5374,4.945,7.449,13.402,13.402,8.707571,10.551143,11.5662,10.1714,4.945,7.449,13.402,7.449,8.569,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449
4,0.275,24.048,28.931,0.384,1,25.0,359.66,3,1,200.058,396.102,10.0,246.5,24.048,24.048,38.122,71.567,28.8192,43.1398,43.1398,28.8192,24.048,24.048,54.197,88.401,33.679571,55.188,66.9006,37.4412,24.048,24.048,82.216,24.048,43.925455,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,17.23,9.333,28.931,28.931,23.7516,18.6082,18.6082,23.7516,12.412,6.043,28.931,28.931,21.132,15.137571,18.6082,23.7516,6.879,28.931,28.931,28.931,17.155545,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931


In [58]:
print(len(test_y))
test_y.head()

60948


0    0
1    0
2    0
3    0
4    0
Name: cat_isTopMcMrNearby_known, dtype: int64

-------------------

## Save the dataframes as a dict
- train_X 
- train_y 
- test_X
- test_y
- & full

### Write pandas dataframes to HDF5

In [59]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = preSplit
df_all_Col_preSplit_wTrainTest.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML.h5', key='preSplitpreBal', mode='w')

In [60]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = preSplit

train_X.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML.h5', key='train_X')

In [61]:
# Write hdf5 to current directory
# df = train_y
# key = train_y

train_y.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML.h5', key='train_y')

In [62]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = test_X

test_X.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML.h5', key='test_X')

In [63]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = test_y

test_y.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML.h5', key='test_y')

---------------------

# Machine-learning<a name=machineLearningNoDask></a>

In [1]:
seed = 123

In [None]:
# .values.ravel()
model = XGBClassifier(
    gamma=0, 
    reg_alpha=0.2, 
    max_depth=3, 
    subsample=0.8, 
    colsample_bytree= 0.8, 
    n_estimators= 300, 
    learning_rate= 0.03, 
    min_child_weight= 3,n_jobs=8)
model.fit(train_X,train_y)


In [None]:
result = model.predict(test_X)
result

In [75]:
type(result)

numpy.ndarray

In [76]:
len(result)

61662

In [77]:
test_y_indexValues = test_y.index.values
df_result = pd.DataFrame(result, index=test_y_indexValues, columns=['TopMcMr_Pick_pred'])
df_results_2 = pd.concat([test_y, df_result], axis=1)

In [78]:
df_results_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61662 entries, 390 to 307637
Data columns (total 2 columns):
cat_isTopMcMrNearby_known    61662 non-null int64
TopMcMr_Pick_pred            61662 non-null int64
dtypes: int64(2)
memory usage: 1.4 MB


In [79]:
df_results_2.head()

Unnamed: 0,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred
390,0,0
391,0,0
392,0,0
393,0,0
394,0,0


In [80]:
# test_df_pred = test_y.copy()
# test_df_pred['Pick_pred'] = result
# test_df_pred.head()

# Examination of first-level results<a name="ml_evaluation"></a>

In [81]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# make predictions for test data
# y_pred = model.predict(X_test)
# predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(df_results_2['cat_isTopMcMrNearby_known'], df_results_2['TopMcMr_Pick_pred'])

#### Results of accuracy prediction where only exact label matches count on a row by row basis, so 60=60,100=100

In [82]:
accuracy

0.63246407836268692

#### Making another dataframe to make rows that lump in classes to combare to other groups of prediction classes

In [83]:
df_results_3 = df_results_2.copy()

In [84]:
df_results_3[0:500]

Unnamed: 0,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred
390,0,0
391,0,0
392,0,0
393,0,0
394,0,0
395,0,0
396,0,0
397,0,0
398,0,0
399,0,0


In [85]:
df_results_3['cat_isTopMcMrNearby_known_95or100'] = np.where(df_results_3['cat_isTopMcMrNearby_known']>60, 1, 0)
df_results_3['TopMcMr_Pick_pred_95or100'] = np.where(df_results_3['TopMcMr_Pick_pred']>60, 1, 0)

In [86]:
#### inspect
df_results_3[0:300]

Unnamed: 0,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100
390,0,0,0,0
391,0,0,0,0
392,0,0,0,0
393,0,0,0,0
394,0,0,0,0
395,0,0,0,0
396,0,0,0,0
397,0,0,0,0
398,0,0,0,0
399,0,0,0,0


#### accuracy if looking at only the labels for 95 and 100 in both known and prediction

In [87]:
accuracy = accuracy_score(df_results_3['cat_isTopMcMrNearby_known_95or100'], df_results_3['TopMcMr_Pick_pred_95or100'])
accuracy

0.76745807790859844

Create more columns for lumped labels

In [88]:
df_results_3['cat_isTopMcMrNearby_known_60or95or100'] = np.where(df_results_3['cat_isTopMcMrNearby_known']>59, 1, 0)
df_results_3['TopMcMr_Pick_pred_60or95or100'] = np.where(df_results_3['TopMcMr_Pick_pred']>59, 1, 0)
df_results_3['cat_isTopMcMrNearby_known_100'] = np.where(df_results_3['cat_isTopMcMrNearby_known']==100, 1, 0)
df_results_3['TopMcMr_Pick_pred_known_100'] = np.where(df_results_3['TopMcMr_Pick_pred']==100, 1, 0)

In [89]:
accuracy = accuracy_score(df_results_3['cat_isTopMcMrNearby_known_60or95or100'], df_results_3['TopMcMr_Pick_pred_60or95or100'])
accuracy

0.87035775680321759

In [90]:
accuracy = accuracy_score(df_results_3['cat_isTopMcMrNearby_known_100'], df_results_3['TopMcMr_Pick_pred_60or95or100'])
accuracy

0.62803671629204372

In [91]:
#### inspecting results manually
df_results_3[7000:9000]

Unnamed: 0,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100,cat_isTopMcMrNearby_known_60or95or100,TopMcMr_Pick_pred_60or95or100,cat_isTopMcMrNearby_known_100,TopMcMr_Pick_pred_known_100
39398,60,100,0,1,1,1,0,1
39399,60,100,0,1,1,1,0,1
39400,60,100,0,1,1,1,0,1
39401,0,60,0,0,0,1,0,0
39402,0,0,0,0,0,0,0,0
39403,0,0,0,0,0,0,0,0
39404,0,0,0,0,0,0,0,0
39405,0,0,0,0,0,0,0,0
39406,0,0,0,0,0,0,0,0
39407,0,0,0,0,0,0,0,0


In [92]:
len(df_results_3)

61662

In [93]:
df_all_Col_preSplit_wTrainTest_ClassBalanced.head()

Unnamed: 0,CALI,COND,DELT,DENS,DEPT,DEPTH,DPHI,DPHI:1,DPHI:2,DT,GR,GR:1,GR:2,IL,ILD,ILD:1,ILD:2,ILM,LITH,LLD,LLS,NPHI,PHID,PHIN,RESD,RHOB,RT,SFL,SFLU,SN,SNP,SP,UWI,SitID,McMurray_Base_HorID,McMurray_Top_HorID,McMurray_Base_DEPTH,McMurray_Top_DEPTH,McMurray_Base_Qual,McMurray_Top_Qual,lat,lng,NN1_McMurray_Top_DEPTH,NN1_McMurray_Base_DEPTH,NN1_thickness,MM_Top_Depth_predBy_NN1thick,HorID,Pick,Quality,HorID_paleoz,Pick_paleoz,Quality_paleoz,diff_TMcM_Pick_v_DEPT,diff_TPal_Pick_v_DEPT,cat_isTopMcMrNearby_known,cat_isTopPalNearby_known,DistFrom_NN1_TopDepth_Abs,NewWell,LastBitWell,TopWellDept,BotWellDept,FromTopWell,FromBotWell,WellThickness,closerToBotOrTop,closTopBotDist,rowsToEdge,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,trainOrTest
0,167.003,,,,149.602,,0.227,,,,102.473,,,,0.0,,,,,,,0.46,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,228.348,235.058,0,0,210.058,True,False,149.602,396.102,0.0,246.5,246.5,FromTopWell,0.0,0,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,train
1,166.675,,,,152.102,,0.269,,,,26.625,,,,30.179,,,,,,,0.355,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,225.848,232.558,0,0,207.558,False,False,149.602,396.102,2.5,244.0,246.5,FromTopWell,2.5,10,25.825,26.625,50.213,26.625,32.768,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,20.262,30.179,30.37,30.179,27.1088,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,train
2,211.701,,,,154.602,,0.339,,,,31.562,,,,21.793,,,,,,,0.428,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,223.348,230.058,0,0,205.058,False,False,149.602,396.102,5.0,241.5,246.5,FromTopWell,5.0,20,23.605,31.562,49.258,31.562,34.0164,31.562,31.562,31.562,22.7,31.562,60.528,31.562,36.187143,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,16.583,21.793,25.975,21.793,21.5708,21.793,21.793,21.793,14.586,21.793,26.774,21.793,21.316286,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,train
3,188.132,,,,157.102,,0.291,,,,51.257,,,,7.449,,,,,,,0.452,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,220.848,227.558,0,0,202.558,False,False,149.602,396.102,7.5,239.0,246.5,FromTopWell,7.5,30,40.739,37.621,72.481,51.257,53.8586,43.6546,43.6546,53.8586,37.621,37.621,87.074,60.965,56.284,47.374143,50.6518,63.1256,37.621,51.257,88.401,51.257,59.979636,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,5.151,7.449,13.402,13.402,8.5374,11.2332,11.2332,8.5374,4.945,7.449,13.402,13.402,8.707571,10.551143,11.5662,10.1714,4.945,7.449,13.402,7.449,8.569,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,train
4,165.135,,,,159.602,,0.275,,,,24.048,,,,28.931,,,,,,,0.384,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,218.348,225.058,0,0,200.058,False,False,149.602,396.102,10.0,236.5,246.5,FromTopWell,10.0,40,24.048,24.048,38.122,71.567,28.8192,43.1398,43.1398,28.8192,24.048,24.048,54.197,88.401,33.679571,55.188,66.9006,37.4412,24.048,24.048,82.216,24.048,43.925455,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,17.23,9.333,28.931,28.931,23.7516,18.6082,18.6082,23.7516,12.412,6.043,28.931,28.931,21.132,15.137571,18.6082,23.7516,6.879,28.931,28.931,28.931,17.155545,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,train


In [94]:
len(df_all_Col_preSplit_wTrainTest_ClassBalanced)

307648

In [95]:
df_all_Col_preSplit_wTrainTest_ClassBalanced_Copy = np.where(df_all_Col_preSplit_wTrainTest_ClassBalanced['trainOrTest'] == 'test')

In [96]:
len(df_all_Col_preSplit_wTrainTest_ClassBalanced_Copy)

1

In [98]:
predictedPickIsExactlyHere = df_results_3[df_results_3['TopMcMr_Pick_pred_known_100'] == 1]
test100 = predictedPickIsExactlyHere['TopMcMr_Pick_pred_known_100']

In [99]:
type(test100)

pandas.core.series.Series

In [100]:
test100.values

array([1, 1, 1, ..., 1, 1, 1])

### More evaluation

In [102]:
df_featPlus_wUWI_testCopy = df_train_featWithHighCount[df_train_featWithHighCount['trainOrTest'] == 'test' ].copy()

In [103]:
df_featPlus_wUWI_testCopy.head()

Unnamed: 0,UWI,trainOrTest,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
390,00/11-19-073-16W4/0,test,0.185,101.752,3.723,0.537,2,23.78,421.84,1,2,183.096,445.994,1.0,208.25,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723
391,00/11-19-073-16W4/0,test,0.212,100.657,2.95,0.516,2,23.78,421.84,1,2,180.596,445.994,3.5,208.25,100.349,100.657,104.476,100.657,101.5134,100.657,100.657,100.657,100.349,100.657,106.802,100.657,102.304429,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,2.95,2.95,3.254,2.95,3.1066,2.95,2.95,2.95,2.95,2.95,3.414,2.95,3.194286,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95
392,00/11-19-073-16W4/0,test,0.175,100.744,3.409,0.532,2,23.78,421.84,1,2,178.096,445.994,6.0,208.25,99.221,100.744,106.397,106.397,102.582,104.5656,104.5656,102.582,99.221,100.744,106.397,100.744,103.294,100.744,100.744,100.744,99.221,100.744,106.729,100.744,103.907273,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,3.299,3.299,3.558,3.493,3.423,3.3906,3.3906,3.423,3.299,3.409,3.632,3.409,3.449143,3.409,3.409,3.409,3.299,3.409,3.632,3.409,3.478455,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409
393,00/11-19-073-16W4/0,test,0.265,91.018,4.864,0.489,2,23.78,421.84,1,2,175.596,445.994,8.5,208.25,67.81,91.018,102.635,102.635,88.5874,98.7966,98.7966,88.5874,58.59,91.018,102.635,105.471,85.816714,100.284,102.356,94.8634,53.847,91.018,102.635,91.018,83.585545,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,3.541,3.298,7.441,4.864,5.2084,3.8452,3.8452,5.2084,3.298,3.298,9.898,4.864,5.605429,3.758,3.934,6.4798,3.298,4.864,10.327,4.864,6.026455,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864
394,00/11-19-073-16W4/0,test,0.298,74.735,7.736,0.426,2,23.78,421.84,1,2,173.096,445.994,11.0,208.25,71.051,69.128,74.946,74.946,73.5338,73.189,73.189,73.5338,70.149,53.847,74.946,74.946,72.959,68.991429,73.189,73.9026,63.148,74.735,74.946,74.735,71.501545,74.735,74.735,74.735,53.847,74.735,97.36,74.735,75.98419,74.735,74.735,74.735,7.182,7.736,8.256,9.347,7.6842,8.4198,8.4198,7.6842,7.111,7.736,8.756,10.327,7.755429,8.909429,9.3252,7.999,7.11,7.736,9.94,7.736,7.981636,7.736,7.736,7.736,4.864,7.736,10.327,7.736,7.779095,7.736,7.736,7.736


In [115]:
len(df_featPlus_wUWI_testCopy)

61662

In [104]:
df_featPlus_wUWI_testCopy_wResults = pd.concat([df_featPlus_wUWI_testCopy, df_results_3], axis=1)
df_featPlus_wUWI_testCopy_wResults.tail()

Unnamed: 0,UWI,trainOrTest,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100,cat_isTopMcMrNearby_known_60or95or100,TopMcMr_Pick_pred_60or95or100,cat_isTopMcMrNearby_known_100,TopMcMr_Pick_pred_known_100
307633,00/16-29-073-05W5/0,test,0.132,69.814,7.565,0.377,1,3.0,612.0,1,1,20.25,595.0,231.75,235.0,58.985,69.814,79.901,69.814,69.5914,69.814,69.814,69.814,58.985,69.814,80.779,69.814,70.854714,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,4.601,7.565,9.848,7.565,7.386,7.565,7.565,7.565,3.528,7.565,9.848,7.565,7.104429,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,95,0,1,0,1,0,0,0
307634,00/06-26-075-21W4/0,test,0.234,52.644,18.299,0.312,3,17.07,562.66,3,3,13.032,594.442,182.5,201.25,39.106,39.106,94.659,101.032,62.0152,65.09,65.09,62.0152,39.106,39.106,94.659,124.887,69.691286,80.616286,95.372,80.077,39.106,39.106,113.977,146.483,77.909727,101.844273,136.172,97.4862,39.106,52.644,146.483,52.644,93.937857,52.644,52.644,52.644,16.342,13.736,18.642,18.642,17.5814,16.528,16.528,17.5814,14.874,10.919,18.642,18.642,16.854857,15.112,16.528,17.5814,12.225,7.541,18.642,18.642,15.585273,12.802636,16.528,17.5814,7.541,18.299,18.642,18.299,13.319762,18.299,18.299,18.299,95,60,1,0,1,1,0,0
307635,00/06-26-075-21W4/0,test,0.211,75.319,17.535,0.336,3,17.07,562.66,3,3,13.282,594.442,182.75,201.25,39.106,39.106,94.659,84.32,71.0342,59.9474,59.9474,71.0342,39.106,39.106,94.659,113.977,69.303857,73.535143,85.4584,79.5346,39.106,39.106,101.032,142.116,74.123182,95.374818,129.6708,91.0124,39.106,75.319,142.116,75.319,91.498381,75.319,75.319,75.319,15.203,14.874,18.642,18.642,17.2042,17.2878,17.2878,17.2042,14.116,12.225,18.642,18.642,16.746571,16.057143,17.2878,17.5814,12.795,8.334,18.642,18.642,15.637091,13.711182,17.2878,17.5814,8.334,17.535,18.642,17.535,13.511762,17.535,17.535,17.535,95,60,1,0,1,1,0,0
307636,00/06-26-075-21W4/0,test,0.183,94.659,16.342,0.361,3,17.07,562.66,3,3,13.532,594.442,183.0,201.25,52.644,39.106,94.659,94.659,79.5346,62.0152,62.0152,79.5346,39.106,39.106,94.659,101.032,72.761429,70.775429,81.5948,83.516,39.106,39.106,94.659,136.765,71.935727,91.060545,121.454,86.2,39.106,94.659,136.765,94.659,89.694,94.659,94.659,94.659,14.116,16.342,18.299,18.642,16.299,17.5814,17.5814,16.299,13.377,13.736,18.642,18.642,16.216286,16.645286,17.5814,17.2042,12.513,9.142,18.642,18.642,15.525909,14.439182,17.5814,17.5814,9.142,16.342,18.642,16.342,13.624762,16.342,16.342,16.342,95,60,1,0,1,1,0,0
307637,00/06-26-075-21W4/0,test,0.206,93.443,15.203,0.391,3,17.07,562.66,3,3,13.782,594.442,183.25,201.25,72.551,39.106,94.659,94.659,83.516,71.0342,71.0342,83.516,52.644,39.106,94.659,94.659,77.507,69.691286,80.077,83.516,39.106,39.106,94.659,130.609,72.447909,87.122182,113.0328,87.3268,39.106,93.443,130.609,93.443,88.627381,93.443,93.443,93.443,13.377,15.203,17.535,18.642,15.3146,17.2042,17.2042,15.3146,12.795,14.874,18.299,18.642,15.381,16.854857,17.5814,16.299,12.351,10.028,18.642,18.642,15.296545,14.990182,17.5814,17.5814,10.028,15.203,18.642,15.203,13.672571,15.203,15.203,15.203,95,60,1,0,1,1,0,0


In [116]:
len(df_featPlus_wUWI_testCopy_wResults)

61662

In [105]:
wells_in_test = df_featPlus_wUWI_testCopy_wResults['UWI'].unique()
len(wells_in_test)

382

limt new dataframe to rows that are less than 1 from actual pick

In [107]:
df_look_at_pred_class_vs_distFromRealLess1 = df_featPlus_wUWI_testCopy_wResults[df_featPlus_wUWI_testCopy_wResults['DistFrom_NN1_TopDepth_Abs'] < 1 ]

In [108]:
df_look_at_pred_class_vs_distFromRealLess1['cat_isTopMcMrNearby_known'].nunique()

4

groupy label and get counts as dataframe using nunique

In [118]:
df_count = df_look_at_pred_class_vs_distFromRealLess1.groupby('TopMcMr_Pick_pred').nunique()

In [119]:
df_count

Unnamed: 0_level_0,UWI,trainOrTest,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100,cat_isTopMcMrNearby_known_60or95or100,TopMcMr_Pick_pred_60or95or100,cat_isTopMcMrNearby_known_100,TopMcMr_Pick_pred_known_100
TopMcMr_Pick_pred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1
0,17,1,21,22,22,22,3,17,17,3,3,12,17,21,14,18,19,20,22,22,22,22,22,17,20,21,20,22,22,22,22,19,20,20,22,22,22,22,21,20,20,22,20,22,22,21,22,20,17,22,22,22,22,22,22,18,17,22,22,22,22,22,22,17,22,22,22,22,22,22,22,22,22,21,22,22,22,22,22,2,1,1,1,2,1,1,1
60,212,1,260,620,610,247,4,163,202,3,4,300,177,409,135,446,466,492,480,623,623,623,622,413,444,472,439,623,623,596,600,371,399,427,393,623,623,531,567,340,404,371,371,623,622,462,490,543,536,497,522,617,616,614,611,519,487,450,498,617,616,593,564,470,442,398,474,615,616,560,501,418,398,383,467,616,615,532,464,4,1,2,1,2,1,2,1
95,119,1,152,287,269,150,4,104,115,3,4,187,106,257,87,248,247,222,232,287,287,287,287,239,217,194,225,287,287,279,259,217,193,169,225,287,287,269,223,187,199,153,216,287,287,255,210,217,224,254,242,274,274,273,273,201,223,245,207,274,273,253,271,174,224,215,189,274,274,216,260,171,209,175,200,273,274,215,222,4,1,2,1,2,1,2,1
100,69,1,94,137,137,88,4,56,65,3,4,34,58,107,46,126,122,111,113,137,137,137,137,119,112,99,113,137,137,132,132,105,100,95,106,137,137,128,119,91,89,88,99,137,137,126,109,118,125,130,122,137,137,137,137,111,120,125,117,137,137,134,134,102,111,115,100,137,137,118,130,96,109,85,97,137,137,106,103,4,1,2,1,2,1,2,1


In [120]:
total_rows_less_than_1_from_pick = df_count['UWI'].unique().sum()
total_rows_less_than_1_from_pick

417

Why is the number of unique wells less than the number calculated above 382? Where there rows included in the test dataset that didn't have any rows within 1 of the pick for that well?

In [121]:
df_count['UWI']

TopMcMr_Pick_pred
0       17
60     212
95     119
100     69
Name: UWI, dtype: int64

In [132]:
def getPercents(df,total_wells):
    index_list = df.index.values
    index_num = -1
    for Each in df:
        index_num = index_num+1
        print("label is =", index_num," and total instaces of that label =",Each, "and the % is: ",Each/total_wells)

In [133]:
getPercents(df_count['UWI'],total_rows_less_than_1_from_pick)

label is = 0  and total instaces of that label = 17 and the % is:  0.0407673860911
label is = 1  and total instaces of that label = 212 and the % is:  0.508393285372
label is = 2  and total instaces of that label = 119 and the % is:  0.285371702638
label is = 3  and total instaces of that label = 69 and the % is:  0.165467625899


#### The numbers above show the number of rows within 1 of the actual pick in terms of their predicted label.
#### What we see from this is there are very few rows within 1 (foot?) of actual pick that are predicted to be class 0, or more than 5 from the pick. 

In [134]:
df_look_at_pred_class_vs_distFromRealLess1 = df_featPlus_wUWI_testCopy_wResults[df_featPlus_wUWI_testCopy_wResults['DistFrom_NN1_TopDepth_Abs'] < 1 ]

In [135]:
df_count = df_look_at_pred_class_vs_distFromRealLess1.groupby('TopMcMr_Pick_pred').nunique()

In [136]:
total_rows_less_than_1_from_pick = df_count['UWI'].unique().sum()
total_rows_less_than_1_from_pick

417

In [137]:
df_count['UWI']

TopMcMr_Pick_pred
0       17
60     212
95     119
100     69
Name: UWI, dtype: int64

In [138]:
getPercents(df_count['UWI'],total_rows_less_than_1_from_pick)

label is = 0  and total instaces of that label = 17 and the % is:  0.0407673860911
label is = 1  and total instaces of that label = 212 and the % is:  0.508393285372
label is = 2  and total instaces of that label = 119 and the % is:  0.285371702638
label is = 3  and total instaces of that label = 69 and the % is:  0.165467625899


In [139]:
def getStatsOnWithinDistOfPick(df,distOfPick):
    df_look_at_pred_class_vs_distFromRealLessNum = df[df['DistFrom_NN1_TopDepth_Abs'] < distOfPick]
    df_count = df_look_at_pred_class_vs_distFromRealLessNum.groupby('TopMcMr_Pick_pred').nunique()
    total_rows_less_than_Num_from_pick = df_count['UWI'].unique().sum()
    getPercents(df_count['UWI'],total_rows_less_than_Num_from_pick)

In [140]:
getStatsOnWithinDistOfPick(df_featPlus_wUWI_testCopy_wResults,5)

label is = 0  and total instaces of that label = 53 and the % is:  0.0706666666667
label is = 1  and total instaces of that label = 329 and the % is:  0.438666666667
label is = 2  and total instaces of that label = 198 and the % is:  0.264
label is = 3  and total instaces of that label = 170 and the % is:  0.226666666667


### What this tells us is most of the predicted classes at or around the pick are predicted class of at or around the pick, which is good!

----------

# Turning row-by-row classification into single pick value prediction<a name="classificationToPick"></a>

1. Create function that treats depth & classification prediction column like histogram and finds median value (in this case depth)
2. Create widgeted function that changes values of labels to shift how much weight is given to each class (at pick, right by pick, sorta nearby pick, etc.)
3. Visualize step #2
4. Use steps 1,2,3 to create a new prediction that is a depth for each well
5. Calculate average distance between actual pick and predicted pick
6. Plot results of step 5 as simple scatter plot
7. Plot results of step 5 as map