# PreML_ClassRebalence_FeatureSelection_Prep_20181003_vF
(without Dask)
- Machine_Learning_vB2_20170802
- A cleaner version of Machine_Learning_vB in the same folder
- preceeded by feature creation notebooks

## What dataframes to get out of this notebook:
- Dataframes of:
    - Dataframe with everything
    - Dataframe with only training data (with class rebalancing & no index values that might leak)
    - Dataframe with only training labels (with class rebalancing & no index values that might leak)
    - Dataframe with only test data (with no class rebalancing & no index values that might leak)
    - Dataframe with only test labels (with no class rebalancing & no index values that might leak)
    - Dataframes for each of the 4 above but also with indexes like UWI and DEPT which were kept off
- Order of creation:
    - Split into train/test dataframes
    - Rebalance only training dataset
    - Split into test data, test label, train data, train label
    - Save versions with ONLY index colummns (UWI & Depth) 
    - Take off (UWI & Depth & Others)
    - Save versions with only relevent information
    - Do ML

## Contents
1. [Read In a HDF5 from the previous notebook that creates the features](#readInFirstHDF5)
2. [Add column for train or test based on a split %, like 80%/20%, split based on well UWI](#trainVsTestCol)
3. [Rebalance the classes by throwing out some of the rows away from the pick and duplicating some rows at or near the known pick.](#rebalanceClasses)
4. [Identify which columns to use as training features](#identifyTrainingFeatures)
5. [Identify which columns to use as labels](#identifyLabelCol)
6. [Split single dataframe into 4 for train-features,train-labels,test-features,test-labels](#splitDataframe)
7. [Machine learning using standard XGBoost classifier and not yet Dask](#machineLearningNoDask)
8. [Evaluate the initial results](#ml_evaluation)
9. [Turning row-by-row classification prediction into single well pick depth prediction](#classificationToPick)


In [99]:
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
%matplotlib inline
import welly
from welly import Well
import lasio
import glob
from sklearn import neighbors
import pickle
import math
import dask
import dask.dataframe as dd
from dask.distributed import Client
# import pdvega
# import vega
import random
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import mean_squared_error


In [100]:
print(welly.__version__)
print(dask.__version__)
print(pd.__version__)

0.3.5
0.18.2
0.23.3


In [101]:
%%timeit
import os
env = %env

91.3 µs ± 964 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [102]:
#### Had to change display options to get this to print in full!
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.max_colwidth = 100000

In [103]:
knn_dir = "../WellsKNN/"
load_dir = "../loadLAS"
features_dir = "../createFeatures/"

## If you open this notebook fresh and jump to a point below where a pick file is read in, you still need to load everything above! 

------------

# Reading in the last hdf5 file<a name="readInFirstHDF5"></a>

In [104]:
h5_to_load = 'df_all_wells_wKNN_DEPTHtoDEPT_KNN1PredTopMcM_20181003.h5'
h5_key = 'df'
df_all_Col_preSplit = pd.read_hdf(features_dir+h5_to_load, h5_key)

In [105]:
df_all_Col_preSplit.head()

Unnamed: 0,CALI,COND,DELT,DEPT,DPHI,DT,GR,ILD,ILM,NPHI,PHID,RHOB,SFL,SFLU,SN,SP,UWI,trainOrTest,SitID,lat,lng,TopHelper_HorID,TopTarget_HorID,TopHelper_DEPTH,TopTarget_DEPTH,TopHelper_HorID_Qual,TopTarget_Qual,NN1_topTarget_DEPTH,NN1_TopHelper_DEPTH,NN1_thickness,topTarget_Depth_predBy_NN1thick,diff_Top_Depth_Real_v_predBy_NN1thick,diff_TopTarget_DEPTH_v_rowDEPT,diff_TopHelper_DEPTH_v_rowDEPT,class_DistFrPick_TopTarget,class_DistFrPick_TopHelper,DistFrom_NN1ThickPredTopDepth_toRowDept,NewWell,LastBitWell,TopWellDept,BotWellDept,FromTopWell,FromBotWell,WellThickness,closerToBotOrTop,closTopBotDist,rowsToEdge,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,NPHI_min_5winSize_dirAroundMin,NPHI_min_5winSize_dirAboveMin,NPHI_min_5winSize_dirAroundMax,NPHI_min_5winSize_dirAboveMax,NPHI_min_5winSize_dirAroundMean,NPHI_min_5winSize_dirAboveMean,NPHI_min_5winSize_dirAbovenLarge,NPHI_min_5winSize_dirAroundnLarge,NPHI_min_7winSize_dirAroundMin,NPHI_min_7winSize_dirAboveMin,NPHI_min_7winSize_dirAroundMax,NPHI_min_7winSize_dirAboveMax,NPHI_min_7winSize_dirAroundMean,NPHI_min_7winSize_dirAboveMean,NPHI_min_7winSize_dirAbovenLarge,NPHI_min_7winSize_dirAroundnLarge,NPHI_min_11winSize_dirAroundMin,NPHI_min_11winSize_dirAboveMin,NPHI_min_11winSize_dirAroundMax,NPHI_min_11winSize_dirAboveMax,NPHI_min_11winSize_dirAroundMean,NPHI_min_11winSize_dirAboveMean,NPHI_min_11winSize_dirAbovenLarge,NPHI_min_11winSize_dirAroundnLarge,NPHI_min_21winSize_dirAroundMin,NPHI_min_21winSize_dirAboveMin,NPHI_min_21winSize_dirAroundMax,NPHI_min_21winSize_dirAboveMax,NPHI_min_21winSize_dirAroundMean,NPHI_min_21winSize_dirAboveMean,NPHI_min_21winSize_dirAbovenLarge,NPHI_min_21winSize_dirAroundnLarge,DPHI_min_5winSize_dirAroundMin,DPHI_min_5winSize_dirAboveMin,DPHI_min_5winSize_dirAroundMax,DPHI_min_5winSize_dirAboveMax,DPHI_min_5winSize_dirAroundMean,DPHI_min_5winSize_dirAboveMean,DPHI_min_5winSize_dirAbovenLarge,DPHI_min_5winSize_dirAroundnLarge,DPHI_min_7winSize_dirAroundMin,DPHI_min_7winSize_dirAboveMin,DPHI_min_7winSize_dirAroundMax,DPHI_min_7winSize_dirAboveMax,DPHI_min_7winSize_dirAroundMean,DPHI_min_7winSize_dirAboveMean,DPHI_min_7winSize_dirAbovenLarge,DPHI_min_7winSize_dirAroundnLarge,DPHI_min_11winSize_dirAroundMin,DPHI_min_11winSize_dirAboveMin,DPHI_min_11winSize_dirAroundMax,DPHI_min_11winSize_dirAboveMax,DPHI_min_11winSize_dirAroundMean,DPHI_min_11winSize_dirAboveMean,DPHI_min_11winSize_dirAbovenLarge,DPHI_min_11winSize_dirAroundnLarge,DPHI_min_21winSize_dirAroundMin,DPHI_min_21winSize_dirAboveMin,DPHI_min_21winSize_dirAroundMax,DPHI_min_21winSize_dirAboveMax,DPHI_min_21winSize_dirAroundMean,DPHI_min_21winSize_dirAboveMean,DPHI_min_21winSize_dirAbovenLarge,DPHI_min_21winSize_dirAroundnLarge,diff_DEPT_vs_NN1_topTarget_DEPTH
0,167.003,,,149.602,0.227,,102.473,0.0,,0.46,,,,,,,00/10-32-080-20W4/0,train,112385,55.978836,-113.095365,14000,13000,384.66,377.95,1,3,389.0,414.0,25.0,359.66,18.29,228.348,235.058,0,0,210.058,True,False,149.602,396.102,0.0,246.5,246.5,FromTopWell,0.0,0,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,-239.398
1,199.159,,,149.852,0.263,,122.589,4.202,,0.55,,,,,,,00/10-32-080-20W4/0,train,112385,55.978836,-113.095365,14000,13000,384.66,377.95,1,3,389.0,414.0,25.0,359.66,18.29,228.098,234.808,0,0,209.808,False,False,149.602,396.102,0.25,246.25,246.5,FromTopWell,0.25,1,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,122.589,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,4.202,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,0.263,-239.148
2,200.496,,,150.102,0.252,,120.196,4.643,,0.537,,,,,,,00/10-32-080-20W4/0,train,112385,55.978836,-113.095365,14000,13000,384.66,377.95,1,3,389.0,414.0,25.0,359.66,18.29,227.848,234.558,0,0,209.558,False,False,149.602,396.102,0.5,246.0,246.5,FromTopWell,0.5,2,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,120.196,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,4.643,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.537,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,0.252,-238.898
3,203.933,,,150.352,0.244,,115.975,5.28,,0.513,,,,,,,00/10-32-080-20W4/0,train,112385,55.978836,-113.095365,14000,13000,384.66,377.95,1,3,389.0,414.0,25.0,359.66,18.29,227.598,234.308,0,0,209.308,False,False,149.602,396.102,0.75,245.75,246.5,FromTopWell,0.75,3,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,115.975,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,5.28,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.513,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,0.244,-238.648
4,203.664,,,150.602,0.24,,109.271,6.592,,0.487,,,,,,,00/10-32-080-20W4/0,train,112385,55.978836,-113.095365,14000,13000,384.66,377.95,1,3,389.0,414.0,25.0,359.66,18.29,227.348,234.058,0,0,209.058,False,False,149.602,396.102,1.0,245.5,246.5,FromTopWell,1.0,4,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,109.271,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,6.592,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.487,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,0.24,-238.398


In [106]:
len(df_all_Col_preSplit.columns)

176

-------------

# Train vs Test Column creation (IGNORED AS DOING THIS BEFORE KNN NOTEBOOK NOW) <a name="trainVsTestCol"></a>

We'll do this based on UWIs, so we don't have any datapoints from train wells in our test datset. This is more like reality than if we'd sample train and test rows randomally from the whole dataframe.

Get all the UWIs

In [107]:
# UWIs = list(df_all_Col_preSplit['UWI'].unique())

Find the number of wells if you want 80%

In [108]:
# numberOfTrainingWells = math.floor(len(UWIs)*0.8)
# numberOfTrainingWells

Randomly select that number of UWIs for training and the ones left for test

In [109]:
# UWIs_training = random.sample(UWIs, numberOfTrainingWells)

In [110]:
# UWIs_test = [x for x in UWIs if x not in UWIs_training]

In [111]:
# print("train",len(UWIs_training))
# print("test",len(UWIs_test))

In [112]:
# df_all_Col_preSplit_wTrainTest = df_all_Col_preSplit.copy()

In [113]:
# df_all_Col_preSplit_wTrainTest['trainOrTest'] = np.where(df_all_Col_preSplit_wTrainTest['UWI'].isin(UWIs_training), 'train', 'test')

In [114]:
# df_all_Col_preSplit_wTrainTest.tail()

--------------

## Split into training and test dataframes

In [115]:
#### Create dataframe that is just the train rows
df_all_Col_train_noRebalance = df_all_Col_preSplit.loc[df_all_Col_preSplit['trainOrTest'] == 'train']

In [116]:
#### Create dataframe that is just the test rows
df_all_Col_test = df_all_Col_preSplit.loc[df_all_Col_preSplit['trainOrTest'] == 'test']

------------

# Rebalance class, aka label, populations to deal with lopsided class populations<a name="rebalanceClasses"></a>

#### Because we have a lot more rows far away from the pick than exactly at the pick or close to the pick, we run the risk of being class heavy in some areas. This can result in not enough ability to identify the sparsely populate classes, like right at the pick. 
#### We'll attemp to deal with this problem by throwing out some of the rows far away from the pick and duplicating some of the rows right at or near the pick.

### THIS SHOULD ONLY BE DONE TO THE TRAINING DATA NOT THE TEST DATA OR THAT IS CHEATING!

In [117]:
#### create a copy for the test below to avoid rewriting accidentally
# df_test_5 = df_all_Col_preSplit_wTrainTest.copy()

In [118]:
#### THIS CELL IS ONLY TO BE RUN IF YOURE IGNOREING THE TRAIN SPLIT THINGS AS YOU DID THEM BEFORE
#df_test_5 = df_all_Col_preSplit

In [119]:
def countRowsByClassOfNearPickOrNot(df,arrayOfClass,divisionInt,classToShrink):
    """
    Takes as input a dataframe, array of classes, an integer to divide by, and  a column, and a class within the column to shrink.
    Returns the dataframe minus the rows that match the ClassToShrink in the Col and prints details about the number of rows of the various classes.
    """
    for eachClass in arrayOfClass:
        print("length of rows with "+str(eachClass)+" in class_DistFrPick_TopTarget:",len(df[df['class_DistFrPick_TopTarget'] == eachClass]))
    df_NearPickZeroSmall = df.loc[(df.index%10 != 3) & (df['class_DistFrPick_TopTarget'] == classToShrink)]
    print("length of rows with 0 in class_DistFrPick_TopTarget and %"+str(divisionInt)+" == 0 is:",len(df_NearPickZeroSmall))
    print("% reduction in classs 0 is:", math.floor(len(df_NearPickZeroSmall) / len(df['class_DistFrPick_TopTarget'] == classToShrink) * 100),"%")
    total_after_reduction_in_bigger_class = len(df[df['class_DistFrPick_TopTarget'] == classToShrink]) -len(df_NearPickZeroSmall)
    print("if taken out using this remainder, the total number of 0 class will be: ",total_after_reduction_in_bigger_class)
#     print("ratio between that class away from pick and classes near pick is :":)
    return df_NearPickZeroSmall

In [120]:
class_array_NearPick = [100,95,70,60,0]
test_df_return = countRowsByClassOfNearPickOrNot(df_test_5,class_array_NearPick,2,0)

length of rows with 100 in class_DistFrPick_TopTarget: 973
length of rows with 95 in class_DistFrPick_TopTarget: 4049
length of rows with 70 in class_DistFrPick_TopTarget: 26066
length of rows with 60 in class_DistFrPick_TopTarget: 25654
length of rows with 0 in class_DistFrPick_TopTarget: 1245892
length of rows with 0 in class_DistFrPick_TopTarget and %2 == 0 is: 1121337
% reduction in classs 0 is: 86 %
if taken out using this remainder, the total number of 0 class will be:  124555


In [121]:
def dropsRowsWithMatchClassAndDeptRemainderIsZero(df,Col,RemainderInt,classToShrink):
    """
    Takes as input a dataframe, a column, a remainder integer, and a class within the column.
    Returns the dataframe minus the rows that match the ClassToShrink in the Col and have a depth from the DEPT col with a remainder of zero.
    """
    print("original lenght of dataframe = ",len(df))
    df_new = df.drop(df[(df[Col] == classToShrink) & (df.index%10 != 0)].index)
    print("length of new dataframe after dropping rows = ",len(df_new))
    print("number of rows dropped = ",len(df)-len(df_new))
    print("length of 0 class is :",len(df_new[df_new[Col] == classToShrink]))
    return df_new

In [122]:
df_all_Col_preSplit_wTrainTest_ClassBalanced = dropsRowsWithMatchClassAndDeptRemainderIsZero(df_all_Col_train_noRebalance,'class_DistFrPick_TopTarget',7,0)

original lenght of dataframe =  1046749
length of new dataframe after dropping rows =  145772
number of rows dropped =  900977
length of 0 class is : 100148


In [123]:
df_all_Col_preSplit_wTrainTest_ClassBalanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 145772 entries, 0 to 1302630
Columns: 176 entries, CALI to diff_DEPT_vs_NN1_topTarget_DEPTH
dtypes: bool(2), float64(163), int64(8), object(3)
memory usage: 194.9+ MB


In [124]:
def addsRowsToBalanceClasses(df,rangeFor100,rangeFor95):
    """
    Input is a dataframe, range for class 100, and range for class 95
    Copies the rows with labels that don't occur very much so they are a larger part of dataframe
    returns the new dataframe with additional copies of rows added on
    """
    df_class100 = df[df['class_DistFrPick_TopTarget'] == 100]
    df_class95 = df[df['class_DistFrPick_TopTarget'] == 95]
    for each1 in range(rangeFor100):
        #print(each1)
        df = df.append(df_class100, ignore_index=True)
    for each2 in range(rangeFor95):
        #print(each2)
        df = df.append(df_class95, ignore_index=True)
    return df

In [125]:
df_all_Col_preSplit_wTrainTest_ClassBalanced2 = addsRowsToBalanceClasses(df_all_Col_preSplit_wTrainTest_ClassBalanced,50,10)

In [126]:
def findNumberOfEachClass(df,col):
    return df[col].value_counts()

In [127]:
findNumberOfEachClass(df_all_Col_preSplit_wTrainTest_ClassBalanced2,'class_DistFrPick_TopTarget')

0      100148
100     40086
95      35783
70      20947
60      20638
Name: class_DistFrPick_TopTarget, dtype: int64

In [128]:
len(df_all_Col_preSplit_wTrainTest_ClassBalanced2)

217602

In [129]:
df_all_Col_preSplit_wTrainTest_ClassBalanced2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217602 entries, 0 to 217601
Columns: 176 entries, CALI to diff_DEPT_vs_NN1_topTarget_DEPTH
dtypes: bool(2), float64(163), int64(8), object(3)
memory usage: 289.3+ MB


In [130]:
df_all_Col_preSplit_wTrainTest_ClassBalanced = df_all_Col_preSplit_wTrainTest_ClassBalanced2

# Identify which columns to use as features <a name="identifyTrainingFeatures"></a>

Get a list of columns

In [131]:
col_list = df_all_Col_preSplit_wTrainTest_ClassBalanced.columns
print(col_list)

Index(['CALI', 'COND', 'DELT', 'DEPT', 'DPHI', 'DT', 'GR', 'ILD', 'ILM', 'NPHI',
       ...
       'DPHI_min_11winSize_dirAroundnLarge', 'DPHI_min_21winSize_dirAroundMin', 'DPHI_min_21winSize_dirAboveMin', 'DPHI_min_21winSize_dirAroundMax', 'DPHI_min_21winSize_dirAboveMax', 'DPHI_min_21winSize_dirAroundMean', 'DPHI_min_21winSize_dirAboveMean', 'DPHI_min_21winSize_dirAbovenLarge', 'DPHI_min_21winSize_dirAroundnLarge', 'diff_DEPT_vs_NN1_topTarget_DEPTH'], dtype='object', length=176)


In [132]:
col_list = list(col_list)
col_list

['CALI',
 'COND',
 'DELT',
 'DEPT',
 'DPHI',
 'DT',
 'GR',
 'ILD',
 'ILM',
 'NPHI',
 'PHID',
 'RHOB',
 'SFL',
 'SFLU',
 'SN',
 'SP',
 'UWI',
 'trainOrTest',
 'SitID',
 'lat',
 'lng',
 'TopHelper_HorID',
 'TopTarget_HorID',
 'TopHelper_DEPTH',
 'TopTarget_DEPTH',
 'TopHelper_HorID_Qual',
 'TopTarget_Qual',
 'NN1_topTarget_DEPTH',
 'NN1_TopHelper_DEPTH',
 'NN1_thickness',
 'topTarget_Depth_predBy_NN1thick',
 'diff_Top_Depth_Real_v_predBy_NN1thick',
 'diff_TopTarget_DEPTH_v_rowDEPT',
 'diff_TopHelper_DEPTH_v_rowDEPT',
 'class_DistFrPick_TopTarget',
 'class_DistFrPick_TopHelper',
 'DistFrom_NN1ThickPredTopDepth_toRowDept',
 'NewWell',
 'LastBitWell',
 'TopWellDept',
 'BotWellDept',
 'FromTopWell',
 'FromBotWell',
 'WellThickness',
 'closerToBotOrTop',
 'closTopBotDist',
 'rowsToEdge',
 'GR_min_5winSize_dirAroundMin',
 'GR_min_5winSize_dirAboveMin',
 'GR_min_5winSize_dirAroundMax',
 'GR_min_5winSize_dirAboveMax',
 'GR_min_5winSize_dirAroundMean',
 'GR_min_5winSize_dirAboveMean',
 'GR_min_5w

## Manually copy the list above and take out some that are labels or aren't things you want to use as training
- At some point come back and see if I can instead use a standard list of things to not include and make the list of columns to use as features more automatically???

In [133]:
## NOTE WE ARE LEAVING THE UWI in for now but will take it out after dataframe is split into train/test portions!!!!
#train_feat_bigList = []

In [134]:
#len(train_feat_bigList)

In [135]:
#df_train_feat = df_all_Col_preSplit_wTrainTest_ClassBalanced[train_feat_bigList]

In [136]:
#df_train_feat.info()

Describing the dataframe here to find out which columns are not populated very much and have a lot of blanks. We'll likely exclude those columns. At this point doing it manually.

In [137]:
#df_train_feat.describe()

### Two lists of columns to not use as training features

Columns taken out as they aren't present often enough in the well dataset

In [138]:
training_feats_w_lowCount = ['RHOB','SP','CALI','COND','DELT','DENS','DPHI:1','DPHI:2','DT','GR:1','GR:2','IL','ILD:1','ILD:2','ILM','LITH','LLD','LLS','PHID','PHIN','RESD','RT','SFL','SFLU','SN','SNP','Sp']


Columns taken out as they either contain information probably captures in other columns, are related to labels too closely, or other reasons.
#### BUT LEAVE IN THE 'class_DistFrPick_TopTarget' column for now as thats a label we'll use in label df!!!!

In [263]:
takeOutColumnsNotCurvesList = [
    'FromBotWell',
    'FromTopWel'
    'rowsToEdge',
     'lat',
     'lng',  
 'SitID',
 'TopHelper_HorID',
 'TopTarget_HorID',
 'TopHelper_DEPTH',
 'diff_Top_Depth_Real_v_predBy_NN1thick',
 'diff_TopTarget_DEPTH_v_rowDEPT',
 'diff_TopHelper_DEPTH_v_rowDEPT',
 'class_DistFrPick_TopHelper',
 'NewWell',
 'LastBitWell',
 'TopWellDept',
 'BotWellDept',
 'WellThickness',
    'rowsToEdge',
    'closTopBotDist',
    'closerToBotOrTop'
]

Next few lines to combine the two lists above and take those columns out of dataframe

In [264]:
def takeOutColNotNeededInTrainingDF(df,list_allCol,colToTakeOutCurves,colToTakeOutOther):
    print("number of columns in dataframe coming into function",len(df.columns))
    train_feats_minusLowCount = [x for x in list_allCol if x not in colToTakeOutCurves]
    train_feats_minusLowCount = [x for x in train_feats_minusLowCount if x not in colToTakeOutOther]
    df_train_featWithHighCount = df[train_feats_minusLowCount]
    print("number of columns in dataframe leaving function",len(df_train_featWithHighCount.columns))
    return df_train_featWithHighCount

In [265]:
df_train_featWithHighCount = takeOutColNotNeededInTrainingDF(df_all_Col_preSplit_wTrainTest_ClassBalanced,col_list,training_feats_w_lowCount,takeOutColumnsNotCurvesList)


number of columns in dataframe coming into function 176
number of columns in dataframe leaving function 146


In [266]:
list(df_train_featWithHighCount.columns)

['DEPT',
 'DPHI',
 'GR',
 'ILD',
 'NPHI',
 'UWI',
 'trainOrTest',
 'TopTarget_DEPTH',
 'TopHelper_HorID_Qual',
 'TopTarget_Qual',
 'NN1_topTarget_DEPTH',
 'NN1_TopHelper_DEPTH',
 'NN1_thickness',
 'topTarget_Depth_predBy_NN1thick',
 'class_DistFrPick_TopTarget',
 'DistFrom_NN1ThickPredTopDepth_toRowDept',
 'FromTopWell',
 'GR_min_5winSize_dirAroundMin',
 'GR_min_5winSize_dirAboveMin',
 'GR_min_5winSize_dirAroundMax',
 'GR_min_5winSize_dirAboveMax',
 'GR_min_5winSize_dirAroundMean',
 'GR_min_5winSize_dirAboveMean',
 'GR_min_5winSize_dirAbovenLarge',
 'GR_min_5winSize_dirAroundnLarge',
 'GR_min_7winSize_dirAroundMin',
 'GR_min_7winSize_dirAboveMin',
 'GR_min_7winSize_dirAroundMax',
 'GR_min_7winSize_dirAboveMax',
 'GR_min_7winSize_dirAroundMean',
 'GR_min_7winSize_dirAboveMean',
 'GR_min_7winSize_dirAbovenLarge',
 'GR_min_7winSize_dirAroundnLarge',
 'GR_min_11winSize_dirAroundMin',
 'GR_min_11winSize_dirAboveMin',
 'GR_min_11winSize_dirAroundMax',
 'GR_min_11winSize_dirAboveMax',
 'GR_mi

Number of columns for training

In [267]:
len(df_train_featWithHighCount.columns)

146

In [268]:
df_train_featWithHighCount.describe()

Unnamed: 0,DEPT,DPHI,GR,ILD,NPHI,TopTarget_DEPTH,TopHelper_HorID_Qual,TopTarget_Qual,NN1_topTarget_DEPTH,NN1_TopHelper_DEPTH,NN1_thickness,topTarget_Depth_predBy_NN1thick,class_DistFrPick_TopTarget,DistFrom_NN1ThickPredTopDepth_toRowDept,FromTopWell,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,NPHI_min_5winSize_dirAroundMin,NPHI_min_5winSize_dirAboveMin,NPHI_min_5winSize_dirAroundMax,NPHI_min_5winSize_dirAboveMax,NPHI_min_5winSize_dirAroundMean,NPHI_min_5winSize_dirAboveMean,NPHI_min_5winSize_dirAbovenLarge,NPHI_min_5winSize_dirAroundnLarge,NPHI_min_7winSize_dirAroundMin,NPHI_min_7winSize_dirAboveMin,NPHI_min_7winSize_dirAroundMax,NPHI_min_7winSize_dirAboveMax,NPHI_min_7winSize_dirAroundMean,NPHI_min_7winSize_dirAboveMean,NPHI_min_7winSize_dirAbovenLarge,NPHI_min_7winSize_dirAroundnLarge,NPHI_min_11winSize_dirAroundMin,NPHI_min_11winSize_dirAboveMin,NPHI_min_11winSize_dirAroundMax,NPHI_min_11winSize_dirAboveMax,NPHI_min_11winSize_dirAroundMean,NPHI_min_11winSize_dirAboveMean,NPHI_min_11winSize_dirAbovenLarge,NPHI_min_11winSize_dirAroundnLarge,NPHI_min_21winSize_dirAroundMin,NPHI_min_21winSize_dirAboveMin,NPHI_min_21winSize_dirAroundMax,NPHI_min_21winSize_dirAboveMax,NPHI_min_21winSize_dirAroundMean,NPHI_min_21winSize_dirAboveMean,NPHI_min_21winSize_dirAbovenLarge,NPHI_min_21winSize_dirAroundnLarge,DPHI_min_5winSize_dirAroundMin,DPHI_min_5winSize_dirAboveMin,DPHI_min_5winSize_dirAroundMax,DPHI_min_5winSize_dirAboveMax,DPHI_min_5winSize_dirAroundMean,DPHI_min_5winSize_dirAboveMean,DPHI_min_5winSize_dirAbovenLarge,DPHI_min_5winSize_dirAroundnLarge,DPHI_min_7winSize_dirAroundMin,DPHI_min_7winSize_dirAboveMin,DPHI_min_7winSize_dirAroundMax,DPHI_min_7winSize_dirAboveMax,DPHI_min_7winSize_dirAroundMean,DPHI_min_7winSize_dirAboveMean,DPHI_min_7winSize_dirAbovenLarge,DPHI_min_7winSize_dirAroundnLarge,DPHI_min_11winSize_dirAroundMin,DPHI_min_11winSize_dirAboveMin,DPHI_min_11winSize_dirAroundMax,DPHI_min_11winSize_dirAboveMax,DPHI_min_11winSize_dirAroundMean,DPHI_min_11winSize_dirAboveMean,DPHI_min_11winSize_dirAbovenLarge,DPHI_min_11winSize_dirAroundnLarge,DPHI_min_21winSize_dirAroundMin,DPHI_min_21winSize_dirAboveMin,DPHI_min_21winSize_dirAroundMax,DPHI_min_21winSize_dirAboveMax,DPHI_min_21winSize_dirAroundMean,DPHI_min_21winSize_dirAboveMean,DPHI_min_21winSize_dirAbovenLarge,DPHI_min_21winSize_dirAroundnLarge,diff_DEPT_vs_NN1_topTarget_DEPTH
count,217602.0,217329.0,217321.0,217526.0,217499.0,217602.0,217602.0,217602.0,217602.0,217602.0,217602.0,217602.0,217602.0,217602.0,217602.0,217321.0,217321.0,217321.0,217321.0,217321.0,217321.0,217321.0,217321.0,217320.0,217321.0,217320.0,217321.0,217320.0,217321.0,217321.0,217321.0,217319.0,217320.0,217319.0,217320.0,217319.0,217320.0,217320.0,217321.0,217320.0,217319.0,217320.0,217319.0,217320.0,217319.0,217319.0,217320.0,217524.0,217525.0,217524.0,217525.0,217524.0,217525.0,217525.0,217526.0,217525.0,217525.0,217525.0,217525.0,217525.0,217525.0,217525.0,217526.0,217524.0,217524.0,217524.0,217524.0,217524.0,217524.0,217524.0,217524.0,217523.0,217522.0,217523.0,217522.0,217523.0,217522.0,217522.0,217523.0,217496.0,217494.0,217496.0,217494.0,217496.0,217494.0,217494.0,217497.0,217495.0,217492.0,217495.0,217492.0,217495.0,217492.0,217492.0,217495.0,217493.0,217489.0,217493.0,217489.0,217493.0,217489.0,217489.0,217494.0,217489.0,217479.0,217489.0,217479.0,217489.0,217479.0,217479.0,217489.0,217327.0,217325.0,217327.0,217325.0,217327.0,217325.0,217325.0,217327.0,217326.0,217323.0,217326.0,217323.0,217326.0,217323.0,217323.0,217326.0,217323.0,217318.0,217323.0,217318.0,217323.0,217318.0,217318.0,217323.0,217316.0,217307.0,217316.0,217307.0,217316.0,217307.0,217307.0,217318.0,217602.0
mean,393.341893,0.239203,80.094933,20.618847,0.402086,415.999299,1.300843,1.782966,414.856429,453.628444,38.772016,416.151216,46.472712,48.478512,140.998789,72.356381,72.810303,86.829881,85.790926,79.743219,79.346494,79.346494,79.73782,69.732759,70.716666,88.744836,87.253236,79.483405,78.917151,81.659142,82.606915,66.226858,68.446547,91.457548,89.482194,79.051345,78.667977,84.796951,86.094624,62.362566,67.246012,95.452723,92.408375,78.766542,79.49167,89.168616,89.486217,16.332263,15.859532,25.02579,24.15127,20.12291,20.129556,20.129556,20.847684,16.079093,14.931199,27.563055,25.451599,21.039623,19.886107,21.65658,22.562182,13.385753,13.809132,32.091318,27.082356,21.157499,19.487424,23.704258,25.787029,11.933964,14.91073,37.207745,28.74365,21.418354,20.319901,26.186134,28.614409,0.375372,0.378079,0.426356,0.424859,0.401113,0.401457,0.401457,0.401154,0.367235,0.371769,0.43294,0.430993,0.400459,0.401334,0.411097,0.411038,0.356646,0.363675,0.441845,0.439756,0.399779,0.402016,0.42325,0.422379,0.342284,0.355916,0.454705,0.452794,0.399687,0.406481,0.440231,0.433451,0.212919,0.21369,0.264053,0.261541,0.239082,0.238077,0.238077,0.239074,0.204783,0.20603,0.271391,0.267279,0.23925,0.238006,0.248429,0.250055,0.192637,0.194484,0.281163,0.274127,0.239675,0.238132,0.259942,0.26254,0.173832,0.179349,0.292791,0.27909,0.240071,0.237548,0.268762,0.272686,-21.514536
std,166.176145,0.075918,24.665701,344.141682,0.080626,147.596456,0.734882,0.789693,146.389958,142.676138,22.917633,147.817413,44.49362,87.038472,70.905311,23.172088,22.42059,25.22837,24.692508,23.521421,23.007484,23.007484,23.569439,22.662699,22.041233,25.403201,24.675567,22.94182,22.43089,23.08015,23.824546,22.057071,21.975689,25.466659,24.825019,22.009678,21.9083,23.291117,24.113221,21.859609,23.518937,25.941669,24.896593,20.911682,22.165427,23.817502,24.26191,137.465207,319.988135,345.625563,349.182961,184.647969,331.763028,331.763028,344.349838,328.961895,323.052987,408.889353,351.041854,345.92008,332.675687,339.491022,345.879234,314.701205,320.740271,469.957064,353.119538,336.667237,329.118392,342.485741,351.169148,317.443341,337.128479,378.860449,354.433124,334.284668,342.277608,350.663016,350.594508,0.081012,0.080423,0.078273,0.078657,0.076524,0.076767,0.076767,0.076708,0.082043,0.081025,0.077933,0.078995,0.075077,0.075708,0.076015,0.075572,0.084053,0.082346,0.078565,0.080012,0.073328,0.074236,0.075976,0.07523,0.087061,0.085988,0.081896,0.084052,0.07105,0.072728,0.079428,0.078246,0.079231,0.079634,0.072322,0.072752,0.071678,0.072642,0.072642,0.071855,0.080295,0.081067,0.072631,0.072664,0.069978,0.071268,0.070699,0.070103,0.082123,0.083381,0.073593,0.073621,0.067495,0.068962,0.069863,0.070146,0.084551,0.087664,0.07687,0.077393,0.064425,0.06722,0.072724,0.073905,102.692427
min,0.0,-0.593,-109.091,0.0,-0.085,1.5,1.0,1.0,1.5,30.5,1.22,-14.0,0.0,0.0,0.0,-109.091,-109.091,-63.467,-63.467,-78.9142,-78.9142,-78.9142,-78.9142,-109.091,-109.091,-3.91,-6.788,-39.840429,-39.840429,-15.835,-15.835,-109.091,-109.091,-3.91,-6.788,-3.91,-6.788,-6.788,-6.788,-109.091,-109.091,-6.788,-109.091,-6.788,-109.091,-109.091,-109.091,-0.62,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.136,-0.136,0.0,0.0,0.0,0.0,0.0,0.0,-0.136,-0.136,0.0,0.0,0.0,0.0,0.0,0.0,-0.1,-0.085,-0.059,-0.085,-0.0812,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.085,-0.601,-0.593,-0.576,-0.593,-0.5764,-0.593,-0.593,-0.593,-0.601,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.601,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-0.593,-355.972
25%,294.304,0.195,64.555,5.472,0.359,333.0,1.0,1.0,330.5,374.9,19.81,335.89,0.0,5.75,90.0,57.403,58.658,71.365,70.808,64.798,64.8066,64.8066,64.752,55.06425,56.805,73.469,72.36,64.923,64.865286,67.3566,67.7652,52.0835,54.602,76.271,74.522,65.019545,65.054068,70.43175,71.6418,48.2415,52.719,81.214,78.5055,65.894595,66.357667,75.4358,75.7265,4.907,4.896,6.155,6.068,5.582,5.511,5.511,5.5752,4.728,4.683,6.477,6.337,5.649,5.555857,5.7886,5.8842,4.458,4.338,6.926,6.606,5.788886,5.609977,6.20555,6.3654,3.984,3.653,7.7005,6.933,5.945333,5.44706,6.5976,6.9568,0.334,0.336,0.382675,0.381,0.3598,0.36,0.36,0.3598,0.326,0.329,0.389,0.387,0.360429,0.360429,0.3696,0.3708,0.316,0.321,0.397,0.394,0.360545,0.362,0.381,0.381,0.301,0.31,0.407,0.403,0.36219,0.366,0.3948,0.389,0.169,0.169,0.222,0.219,0.1966,0.1952,0.1952,0.1968,0.159,0.16,0.231,0.226,0.197571,0.195143,0.2068,0.2092,0.145,0.144,0.242,0.235,0.199364,0.197,0.2206,0.2232,0.12,0.123,0.255,0.24,0.203476,0.198905,0.2308,0.235,-62.8295
50%,417.0,0.244,79.894,8.379,0.403,445.01,1.0,2.0,445.62,481.0,38.71,445.0,60.0,15.694,162.25,71.974,72.445,86.715,85.926,79.438,79.234,79.234,79.438,69.371,70.115,88.447,87.423,79.146571,78.816571,81.7082,82.558,66.043,67.518,91.379,89.914,78.867182,78.600091,85.1917,86.1118,61.527,64.982,95.684,94.06,78.884929,79.582048,90.9484,90.439,7.393,7.392,9.785,9.512,8.5636,8.4454,8.4454,8.5367,7.089,7.076,10.39,9.988,8.707714,8.488857,8.93578,9.162,6.657,6.539,11.523,10.573,8.942091,8.485364,9.6764,10.1361,5.963,5.642,13.204,11.002,9.271238,8.16619,10.2724,11.404,0.379,0.381,0.424,0.422,0.4012,0.401,0.401,0.4014,0.372,0.375,0.429,0.428,0.400286,0.401,0.4098,0.41,0.363,0.368,0.437,0.437,0.399182,0.401545,0.4214,0.4202,0.35,0.361,0.45,0.452,0.399429,0.406381,0.4404,0.431,0.22,0.222,0.267,0.265,0.2422,0.2426,0.2426,0.2422,0.213,0.214,0.274,0.271,0.242571,0.242286,0.252,0.253,0.201,0.203,0.284,0.278,0.243455,0.241636,0.264,0.267,0.182,0.184,0.296,0.285,0.243524,0.238762,0.274,0.2794,-11.524
75%,480.5,0.289,95.026,14.243,0.447,493.0,1.0,2.0,495.0,540.5,53.0,495.3,95.0,65.31,184.25,87.242,87.264,101.332,100.336,94.1538,93.627,93.627,94.1572,84.479,84.722,103.211,101.759,93.561143,93.006143,96.0092,96.8716,80.159,81.746,105.8895,104.197,92.833545,92.505341,99.28597,100.333,75.255,79.852,109.701,107.276,91.27125,92.524405,103.6586,103.8308,11.978,12.0,17.261,16.588,14.59425,14.1942,14.1942,14.5762,11.33,11.37,18.747,17.542,14.763286,14.314571,15.242,15.9002,10.389,10.44325,21.228,18.768,15.102182,14.282568,16.8067,18.01275,9.126,9.37275,24.817,19.55675,15.533333,13.725,18.1104,21.1786,0.422,0.424,0.469,0.468,0.4434,0.445,0.445,0.4436,0.415,0.419,0.475,0.474,0.442143,0.444143,0.4534,0.4524,0.406,0.412,0.484,0.485,0.440818,0.444727,0.4664,0.464,0.393,0.406,0.501,0.505,0.439286,0.448,0.4882,0.478,0.266,0.267,0.309,0.307,0.286,0.2866,0.2866,0.2862,0.259,0.261,0.314,0.312,0.285,0.285857,0.2948,0.2952,0.248,0.251,0.322,0.317,0.283545,0.283636,0.304,0.306,0.232,0.239,0.332,0.324,0.280667,0.27981,0.3134,0.317,11.75
max,2147.21,1.283,634.184,99960.8281,1.237,920.5,4.0,3.0,920.5,957.07,116.0,948.07,100.0,1515.05,855.48,380.478,421.393,634.184,634.184,503.4172,427.1764,427.1764,503.4172,365.486,634.184,1127.0,634.184,484.194286,634.184,634.184,634.184,421.393,634.184,634.184,634.184,461.736727,634.184,634.184,634.184,634.184,634.184,634.184,634.184,634.184,634.184,634.184,634.184,6395.27,99960.8281,100000.0,99960.8281,40020.06118,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,100000.0,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,99960.8281,1.206,1.174,1.278,1.283,1.2308,1.2224,1.2224,1.2308,1.183,1.174,1.283,1.283,1.223,1.231714,1.2554,1.2344,1.165,1.165,1.283,1.283,1.214091,1.222727,1.2606,1.2572,1.133,1.133,1.283,1.283,1.211905,1.211905,1.2628,1.2628,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1.283,1526.03


In [269]:
used_features = list(df_train_featWithHighCount.columns)

In [270]:
used_features

['DEPT',
 'DPHI',
 'GR',
 'ILD',
 'NPHI',
 'UWI',
 'trainOrTest',
 'TopTarget_DEPTH',
 'TopHelper_HorID_Qual',
 'TopTarget_Qual',
 'NN1_topTarget_DEPTH',
 'NN1_TopHelper_DEPTH',
 'NN1_thickness',
 'topTarget_Depth_predBy_NN1thick',
 'class_DistFrPick_TopTarget',
 'DistFrom_NN1ThickPredTopDepth_toRowDept',
 'FromTopWell',
 'GR_min_5winSize_dirAroundMin',
 'GR_min_5winSize_dirAboveMin',
 'GR_min_5winSize_dirAroundMax',
 'GR_min_5winSize_dirAboveMax',
 'GR_min_5winSize_dirAroundMean',
 'GR_min_5winSize_dirAboveMean',
 'GR_min_5winSize_dirAbovenLarge',
 'GR_min_5winSize_dirAroundnLarge',
 'GR_min_7winSize_dirAroundMin',
 'GR_min_7winSize_dirAboveMin',
 'GR_min_7winSize_dirAroundMax',
 'GR_min_7winSize_dirAboveMax',
 'GR_min_7winSize_dirAroundMean',
 'GR_min_7winSize_dirAboveMean',
 'GR_min_7winSize_dirAbovenLarge',
 'GR_min_7winSize_dirAroundnLarge',
 'GR_min_11winSize_dirAroundMin',
 'GR_min_11winSize_dirAboveMin',
 'GR_min_11winSize_dirAroundMax',
 'GR_min_11winSize_dirAboveMax',
 'GR_mi

### Now let's take out those same columns in the test only dataframe

In [271]:
df_test_featWithHighCount = takeOutColNotNeededInTrainingDF(df_all_Col_test,col_list,training_feats_w_lowCount,takeOutColumnsNotCurvesList)


number of columns in dataframe coming into function 176
number of columns in dataframe leaving function 146


--------------------

## Now let's combine the rebalanced train df with the unrebalanced test df to make a df we will then split into 4 pieces: train-data, train-labels, test-data,test-lables

In [272]:

df_testPlusRebalTrain_featWithHighCount = pd.concat([df_train_featWithHighCount,df_test_featWithHighCount])

In [273]:
len(df_testPlusRebalTrain_featWithHighCount.columns)

146

-----------------

## Identify which columns to use as labels<a name="identifyLabelCol"></a>

#### The column 'cat_isTopMcMrNearby_known' is what we'll use as labels.
- 100 = exactly the Top McMurray Pick
- 95 if the distance between that depth and the Top McMurray Pick is -0.5 < x and x <0.5
- 60 if the distance between that depth and the Top McMurray Pick is -5 < x and x < 5
- 0 = not near the Top McMurray Pick

The function used to make these classes or lables as column was:
`df_all_wells_wKNN_DEPTHtoDEPT['cat_isTopMcMrNearby_known']=df_all_wells_wKNN_DEPTHtoDEPT['diff_TMcM_Pick_v_DEPT'].apply(lambda x: 100 if x==0 else ( 95 if (-0.5 < x and x <0.5) else 60 if (-5 < x and x <5) else 0))`

In [274]:
df_testPlusRebalTrain_featWithHighCount['class_DistFrPick_TopTarget'].unique()

array([  0,  70,  95,  60, 100])

In [275]:
labels = df_testPlusRebalTrain_featWithHighCount[['class_DistFrPick_TopTarget','UWI','trainOrTest','TopTarget_DEPTH']]

In [276]:
labels.head()

Unnamed: 0,class_DistFrPick_TopTarget,UWI,trainOrTest,TopTarget_DEPTH
0,0,00/10-32-080-20W4/0,train,377.95
1,0,00/10-32-080-20W4/0,train,377.95
2,0,00/10-32-080-20W4/0,train,377.95
3,0,00/10-32-080-20W4/0,train,377.95
4,0,00/10-32-080-20W4/0,train,377.95


In [277]:
labels.tail()

Unnamed: 0,class_DistFrPick_TopTarget,UWI,trainOrTest,TopTarget_DEPTH
1296773,0,00/10-20-083-20W4/0,test,410.87
1296774,0,00/10-20-083-20W4/0,test,410.87
1296775,0,00/10-20-083-20W4/0,test,410.87
1296776,0,00/10-20-083-20W4/0,test,410.87
1296777,0,00/10-20-083-20W4/0,test,410.87


In [278]:
len(labels)

473487

The lengths of training dataframes and labels dataframes should be the same. We'll take out UWI and trainOrTest further down.

-----------------

## Now separate into 4 dataframes = <a name="splitDataframe"></a>
### train_labels
### train_feat 
### test_labels
### test_feat
Then take off UWI and TrainTest col

### Create label dataframes

In [279]:
#### split based on train in trainOrTest col
labels_train = labels[labels['trainOrTest'] == 'train' ]
#### Keep only the 'cat_isTopMcMrNearby_known' column, so now it is just a series of labels
labels_train = labels_train['class_DistFrPick_TopTarget']
#### split based on test in trainOrTest col
labels_test = labels[labels['trainOrTest'] == 'test' ]
#### Keep only the 'cat_isTopMcMrNearby_known' column, so now it is just a series of labels
labels_test = labels_test['class_DistFrPick_TopTarget']

In [200]:
df_train_featWithHighCount = df_testPlusRebalTrain_featWithHighCount

### Create training dataframes

In [280]:
#### split based on train in trainOrTest col and drop UWI and TrainOrTest columns
df_train_featWithHighCount_train = df_testPlusRebalTrain_featWithHighCount[df_testPlusRebalTrain_featWithHighCount['trainOrTest'] == 'train' ].drop(columns=['UWI', 'trainOrTest','class_DistFrPick_TopTarget','TopTarget_DEPTH'])
#### split based on test in trainOrTest col and drop UWI and TrainOrTest columns
df_train_featWithHighCount_test = df_testPlusRebalTrain_featWithHighCount[df_testPlusRebalTrain_featWithHighCount['trainOrTest'] == 'test' ].drop(columns=['UWI', 'trainOrTest','class_DistFrPick_TopTarget','TopTarget_DEPTH'])

## Create index dataframes for reattaching'UWI', 'trainOrTest','class_DistFrPick_TopTarget','TopTarget_DEPTH'

In [283]:
df_train_featWithHighCount_train_indexOnly = df_testPlusRebalTrain_featWithHighCount[df_testPlusRebalTrain_featWithHighCount['trainOrTest'] == 'train'][['UWI', 'trainOrTest','class_DistFrPick_TopTarget','TopTarget_DEPTH']]
df_train_featWithHighCount_test_indexOnly = df_testPlusRebalTrain_featWithHighCount[df_testPlusRebalTrain_featWithHighCount['trainOrTest'] == 'test' ][['UWI', 'trainOrTest','class_DistFrPick_TopTarget','TopTarget_DEPTH']]

In [284]:
df_train_featWithHighCount_train_indexOnly.head()

Unnamed: 0,UWI,trainOrTest,class_DistFrPick_TopTarget,TopTarget_DEPTH
0,00/10-32-080-20W4/0,train,0,377.95
1,00/10-32-080-20W4/0,train,0,377.95
2,00/10-32-080-20W4/0,train,0,377.95
3,00/10-32-080-20W4/0,train,0,377.95
4,00/10-32-080-20W4/0,train,0,377.95


### Rename to avoid overwriting & keep with previous work

In [285]:
train_X = df_train_featWithHighCount_train
train_y = labels_train
test_X = df_train_featWithHighCount_test
test_y = labels_test
train_index = df_train_featWithHighCount_train_indexOnly
test_index = df_train_featWithHighCount_test_indexOnly 

### Inspect to make sure column headers and lengths make sense

In [298]:
print(len(train_X))
train_X.head()

217602


Unnamed: 0,DEPT,DPHI,GR,ILD,NPHI,TopHelper_HorID_Qual,TopTarget_Qual,NN1_topTarget_DEPTH,NN1_TopHelper_DEPTH,NN1_thickness,topTarget_Depth_predBy_NN1thick,DistFrom_NN1ThickPredTopDepth_toRowDept,FromTopWell,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,NPHI_min_5winSize_dirAroundMin,NPHI_min_5winSize_dirAboveMin,NPHI_min_5winSize_dirAroundMax,NPHI_min_5winSize_dirAboveMax,NPHI_min_5winSize_dirAroundMean,NPHI_min_5winSize_dirAboveMean,NPHI_min_5winSize_dirAbovenLarge,NPHI_min_5winSize_dirAroundnLarge,NPHI_min_7winSize_dirAroundMin,NPHI_min_7winSize_dirAboveMin,NPHI_min_7winSize_dirAroundMax,NPHI_min_7winSize_dirAboveMax,NPHI_min_7winSize_dirAroundMean,NPHI_min_7winSize_dirAboveMean,NPHI_min_7winSize_dirAbovenLarge,NPHI_min_7winSize_dirAroundnLarge,NPHI_min_11winSize_dirAroundMin,NPHI_min_11winSize_dirAboveMin,NPHI_min_11winSize_dirAroundMax,NPHI_min_11winSize_dirAboveMax,NPHI_min_11winSize_dirAroundMean,NPHI_min_11winSize_dirAboveMean,NPHI_min_11winSize_dirAbovenLarge,NPHI_min_11winSize_dirAroundnLarge,NPHI_min_21winSize_dirAroundMin,NPHI_min_21winSize_dirAboveMin,NPHI_min_21winSize_dirAroundMax,NPHI_min_21winSize_dirAboveMax,NPHI_min_21winSize_dirAroundMean,NPHI_min_21winSize_dirAboveMean,NPHI_min_21winSize_dirAbovenLarge,NPHI_min_21winSize_dirAroundnLarge,DPHI_min_5winSize_dirAroundMin,DPHI_min_5winSize_dirAboveMin,DPHI_min_5winSize_dirAroundMax,DPHI_min_5winSize_dirAboveMax,DPHI_min_5winSize_dirAroundMean,DPHI_min_5winSize_dirAboveMean,DPHI_min_5winSize_dirAbovenLarge,DPHI_min_5winSize_dirAroundnLarge,DPHI_min_7winSize_dirAroundMin,DPHI_min_7winSize_dirAboveMin,DPHI_min_7winSize_dirAroundMax,DPHI_min_7winSize_dirAboveMax,DPHI_min_7winSize_dirAroundMean,DPHI_min_7winSize_dirAboveMean,DPHI_min_7winSize_dirAbovenLarge,DPHI_min_7winSize_dirAroundnLarge,DPHI_min_11winSize_dirAroundMin,DPHI_min_11winSize_dirAboveMin,DPHI_min_11winSize_dirAroundMax,DPHI_min_11winSize_dirAboveMax,DPHI_min_11winSize_dirAroundMean,DPHI_min_11winSize_dirAboveMean,DPHI_min_11winSize_dirAbovenLarge,DPHI_min_11winSize_dirAroundnLarge,DPHI_min_21winSize_dirAroundMin,DPHI_min_21winSize_dirAboveMin,DPHI_min_21winSize_dirAroundMax,DPHI_min_21winSize_dirAboveMax,DPHI_min_21winSize_dirAroundMean,DPHI_min_21winSize_dirAboveMean,DPHI_min_21winSize_dirAbovenLarge,DPHI_min_21winSize_dirAroundnLarge,diff_DEPT_vs_NN1_topTarget_DEPTH
0,149.602,0.227,102.473,0.0,0.46,1,3,389.0,414.0,25.0,359.66,210.058,0.0,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.46,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,0.227,-239.398
1,152.102,0.269,26.625,30.179,0.355,1,3,389.0,414.0,25.0,359.66,207.558,2.5,25.825,26.625,50.213,26.625,32.768,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,20.262,30.179,30.37,30.179,27.1088,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,0.353,0.355,0.403,0.355,0.3668,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.355,0.263,0.269,0.277,0.269,0.2682,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,0.269,-236.898
2,154.602,0.339,31.562,21.793,0.428,1,3,389.0,414.0,25.0,359.66,205.058,5.0,23.605,31.562,49.258,31.562,34.0164,31.562,31.562,31.562,22.7,31.562,60.528,31.562,36.187143,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,16.583,21.793,25.975,21.793,21.5708,21.793,21.793,21.793,14.586,21.793,26.774,21.793,21.316286,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,0.415,0.428,0.474,0.428,0.4364,0.428,0.428,0.428,0.391,0.428,0.483,0.428,0.436571,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.428,0.315,0.339,0.343,0.339,0.3302,0.339,0.339,0.339,0.298,0.339,0.355,0.339,0.329143,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,0.339,-234.398
3,157.102,0.291,51.257,7.449,0.452,1,3,389.0,414.0,25.0,359.66,202.558,7.5,40.739,37.621,72.481,51.257,53.8586,43.6546,43.6546,53.8586,37.621,37.621,87.074,60.965,56.284,47.374143,50.6518,63.1256,37.621,51.257,88.401,51.257,59.979636,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,5.151,7.449,13.402,13.402,8.5374,11.2332,11.2332,8.5374,4.945,7.449,13.402,13.402,8.707571,10.551143,11.5662,10.1714,4.945,7.449,13.402,7.449,8.569,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,0.422,0.403,0.479,0.452,0.4514,0.4308,0.4308,0.4514,0.403,0.403,0.516,0.503,0.453714,0.447143,0.461,0.4702,0.403,0.452,0.541,0.452,0.469091,0.452,0.452,0.452,0.452,0.452,0.452,0.452,0.452,0.452,0.452,0.452,0.278,0.279,0.298,0.3,0.2882,0.291,0.291,0.2882,0.268,0.279,0.298,0.333,0.284,0.301143,0.3084,0.2884,0.256,0.291,0.32,0.291,0.285273,0.291,0.291,0.291,0.291,0.291,0.291,0.291,0.291,0.291,0.291,0.291,-231.898
4,159.602,0.275,24.048,28.931,0.384,1,3,389.0,414.0,25.0,359.66,200.058,10.0,24.048,24.048,38.122,71.567,28.8192,43.1398,43.1398,28.8192,24.048,24.048,54.197,88.401,33.679571,55.188,66.9006,37.4412,24.048,24.048,82.216,24.048,43.925455,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,17.23,9.333,28.931,28.931,23.7516,18.6082,18.6082,23.7516,12.412,6.043,28.931,28.931,21.132,15.137571,18.6082,23.7516,6.879,28.931,28.931,28.931,17.155545,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,0.363,0.363,0.442,0.537,0.3882,0.448,0.448,0.3882,0.363,0.363,0.514,0.541,0.405,0.474429,0.5148,0.4202,0.363,0.384,0.54,0.384,0.429818,0.384,0.384,0.384,0.384,0.384,0.384,0.384,0.384,0.384,0.384,0.384,0.275,0.246,0.306,0.281,0.2842,0.2674,0.2674,0.2842,0.26,0.246,0.318,0.281,0.285571,0.266714,0.273,0.2928,0.246,0.275,0.318,0.275,0.283909,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,0.275,-229.398


In [299]:
print(len(train_y))
train_y.head()

217602


0    0
1    0
2    0
3    0
4    0
Name: class_DistFrPick_TopTarget, dtype: int64

In [300]:
print(len(test_X))
test_X.head()

255885


Unnamed: 0,DEPT,DPHI,GR,ILD,NPHI,TopHelper_HorID_Qual,TopTarget_Qual,NN1_topTarget_DEPTH,NN1_TopHelper_DEPTH,NN1_thickness,topTarget_Depth_predBy_NN1thick,DistFrom_NN1ThickPredTopDepth_toRowDept,FromTopWell,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,NPHI_min_5winSize_dirAroundMin,NPHI_min_5winSize_dirAboveMin,NPHI_min_5winSize_dirAroundMax,NPHI_min_5winSize_dirAboveMax,NPHI_min_5winSize_dirAroundMean,NPHI_min_5winSize_dirAboveMean,NPHI_min_5winSize_dirAbovenLarge,NPHI_min_5winSize_dirAroundnLarge,NPHI_min_7winSize_dirAroundMin,NPHI_min_7winSize_dirAboveMin,NPHI_min_7winSize_dirAroundMax,NPHI_min_7winSize_dirAboveMax,NPHI_min_7winSize_dirAroundMean,NPHI_min_7winSize_dirAboveMean,NPHI_min_7winSize_dirAbovenLarge,NPHI_min_7winSize_dirAroundnLarge,NPHI_min_11winSize_dirAroundMin,NPHI_min_11winSize_dirAboveMin,NPHI_min_11winSize_dirAroundMax,NPHI_min_11winSize_dirAboveMax,NPHI_min_11winSize_dirAroundMean,NPHI_min_11winSize_dirAboveMean,NPHI_min_11winSize_dirAbovenLarge,NPHI_min_11winSize_dirAroundnLarge,NPHI_min_21winSize_dirAroundMin,NPHI_min_21winSize_dirAboveMin,NPHI_min_21winSize_dirAroundMax,NPHI_min_21winSize_dirAboveMax,NPHI_min_21winSize_dirAroundMean,NPHI_min_21winSize_dirAboveMean,NPHI_min_21winSize_dirAbovenLarge,NPHI_min_21winSize_dirAroundnLarge,DPHI_min_5winSize_dirAroundMin,DPHI_min_5winSize_dirAboveMin,DPHI_min_5winSize_dirAroundMax,DPHI_min_5winSize_dirAboveMax,DPHI_min_5winSize_dirAroundMean,DPHI_min_5winSize_dirAboveMean,DPHI_min_5winSize_dirAbovenLarge,DPHI_min_5winSize_dirAroundnLarge,DPHI_min_7winSize_dirAroundMin,DPHI_min_7winSize_dirAboveMin,DPHI_min_7winSize_dirAroundMax,DPHI_min_7winSize_dirAboveMax,DPHI_min_7winSize_dirAroundMean,DPHI_min_7winSize_dirAboveMean,DPHI_min_7winSize_dirAbovenLarge,DPHI_min_7winSize_dirAroundnLarge,DPHI_min_11winSize_dirAroundMin,DPHI_min_11winSize_dirAboveMin,DPHI_min_11winSize_dirAroundMax,DPHI_min_11winSize_dirAboveMax,DPHI_min_11winSize_dirAroundMean,DPHI_min_11winSize_dirAboveMean,DPHI_min_11winSize_dirAbovenLarge,DPHI_min_11winSize_dirAroundnLarge,DPHI_min_21winSize_dirAroundMin,DPHI_min_21winSize_dirAboveMin,DPHI_min_21winSize_dirAroundMax,DPHI_min_21winSize_dirAboveMax,DPHI_min_21winSize_dirAroundMean,DPHI_min_21winSize_dirAboveMean,DPHI_min_21winSize_dirAbovenLarge,DPHI_min_21winSize_dirAroundnLarge,diff_DEPT_vs_NN1_topTarget_DEPTH
11203,268.224,0.238,105.0,4.286,0.588,1,2,452.5,507.0,54.5,445.68,177.456,0.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,105.0,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,4.286,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.588,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,0.238,-184.276
11204,268.474,0.225,106.238,4.287,0.585,1,2,452.5,507.0,54.5,445.68,177.206,0.25,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,106.238,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.585,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,0.225,-184.026
11205,268.724,0.236,108.683,4.289,0.576,1,2,452.5,507.0,54.5,445.68,176.956,0.5,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,108.683,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,4.289,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.576,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,-183.776
11206,268.974,0.236,106.884,4.287,0.593,1,2,452.5,507.0,54.5,445.68,176.706,0.75,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,106.884,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,4.287,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.593,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,0.236,-183.526
11207,269.224,0.237,105.7,4.424,0.595,1,2,452.5,507.0,54.5,445.68,176.456,1.0,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,105.7,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,4.424,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.595,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,0.237,-183.276


In [301]:
print(len(test_y))
test_y.head()

255885


11203    0
11204    0
11205    0
11206    0
11207    0
Name: class_DistFrPick_TopTarget, dtype: int64

-------------------

## Save the dataframes as a dict
- train_X 
- train_y 
- test_X
- test_y
- & full

### Write pandas dataframes to HDF5

In [302]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = preSplit
df_testPlusRebalTrain_featWithHighCount.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='preSplitpreBal', mode='w')

In [303]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = preSplit

train_X.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='train_X')

In [304]:
# Write hdf5 to current directory
# df = train_y
# key = train_y

train_y.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='train_y')

In [305]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = test_X

test_X.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='test_X')

In [310]:
# Write hdf5 to current directory
# df = df_all_Col_preSplit_wTrainTest_ClassBalanced
# key = test_y

test_y.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='test_y')

In [307]:
# train_index.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='test_y')

In [308]:
train_index.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='train_index')

In [309]:
test_index.to_hdf('df_all_Col_preSplit_wTrainTest_ClassBalanced_PreML_20181003.h5', key='test_index')

---------------------

# Machine-learning<a name=machineLearningNoDask></a>

In [244]:
seed = 123

In [245]:
# .values.ravel()
model = XGBClassifier(
    gamma=0, 
    reg_alpha=0.2, 
    max_depth=3, 
    subsample=0.8, 
    colsample_bytree= 0.8, 
    n_estimators= 300, 
    learning_rate= 0.03, 
    min_child_weight= 3,n_jobs=8)
model.fit(train_X,train_y)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0, learning_rate=0.03, max_delta_step=0,
       max_depth=3, min_child_weight=3, missing=None, n_estimators=300,
       n_jobs=8, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0.2, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.8)

In [246]:
result = model.predict(test_X)
result

  if diff:


array([0, 0, 0, ..., 0, 0, 0])

In [247]:
type(result)

numpy.ndarray

In [248]:
len(result)

255885

In [249]:
test_y[3300:4900]

31605      0
31606      0
31607      0
31608      0
31609      0
31610      0
31611      0
31612      0
31613      0
31614      0
31615      0
31616      0
31617      0
31618      0
31619      0
31620      0
31621      0
31622      0
31623      0
31624      0
31625      0
31626      0
31627      0
31628      0
31629      0
31630      0
31631      0
31632      0
31633      0
31634      0
31635      0
31636      0
31637      0
31638      0
31639      0
31640      0
31641      0
31642      0
31643      0
31644      0
31645      0
31646      0
31647      0
31648      0
31649      0
31650      0
31651      0
31652      0
31653      0
31654      0
31655      0
31656      0
31657      0
31658      0
31659      0
31660      0
31661      0
31662      0
31663      0
31664      0
31665      0
31666      0
31667      0
31668      0
31669      0
31670      0
31671      0
31672      0
31673      0
31674      0
31675      0
31676      0
31677      0
31678      0
31679      0
31680      0
31681      0

In [250]:
test_y_indexValues = test_y.index.values
df_result = pd.DataFrame(result, index=test_y_indexValues, columns=['TopTarget_Pick_pred'])
df_results_2 = pd.concat([test_y, df_result], axis=1)

In [251]:
df_results_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 255885 entries, 11203 to 1296777
Data columns (total 2 columns):
class_DistFrPick_TopTarget    255885 non-null int64
TopTarget_Pick_pred           255885 non-null int64
dtypes: int64(2)
memory usage: 5.9 MB


In [252]:
df_results_2.head()

Unnamed: 0,class_DistFrPick_TopTarget,TopTarget_Pick_pred
11203,0,0
11204,0,0
11205,0,0
11206,0,0
11207,0,0


In [253]:
# test_df_pred = test_y.copy()
# test_df_pred['Pick_pred'] = result
# test_df_pred.head()

# Examination of first-level results<a name="ml_evaluation"></a>

In [254]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# make predictions for test data
# y_pred = model.predict(X_test)
# predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(df_results_2['class_DistFrPick_TopTarget'], df_results_2['TopTarget_Pick_pred'])

#### Results of accuracy prediction where only exact label matches count on a row by row basis, so 60=60,100=100

In [255]:
accuracy

0.8886296578541142

#### Making another dataframe to make rows that lump in classes to combare to other groups of prediction classes

In [257]:
df_results_3 = df_results_2.copy()

In [258]:
df_results_3[0:500]

Unnamed: 0,class_DistFrPick_TopTarget,TopTarget_Pick_pred
11203,0,0
11204,0,0
11205,0,0
11206,0,0
11207,0,0
11208,0,0
11209,0,0
11210,0,0
11211,0,0
11212,0,0


In [259]:
df_results_3['class_DistFrPick_TopTarget_95or100'] = np.where(df_results_3['class_DistFrPick_TopTarget']>70, 1, 0)
df_results_3['TopTarget_Pick_pred_95or100'] = np.where(df_results_3['TopTarget_Pick_pred']>70, 1, 0)

In [261]:
#### inspect
df_results_3[500:800]

Unnamed: 0,class_DistFrPick_TopTarget,TopTarget_Pick_pred,class_DistFrPick_TopTarget_95or100,TopTarget_Pick_pred_95or100
11703,0,0,0,0
11704,0,0,0,0
11705,0,0,0,0
11706,0,0,0,0
11707,0,0,0,0
11708,0,0,0,0
11709,0,0,0,0
11710,0,0,0,0
11711,0,0,0,0
11712,0,0,0,0


#### accuracy if looking at only the labels for 95 and 100 in both known and prediction

In [262]:
accuracy = accuracy_score(df_results_3['class_DistFrPick_TopTarget'], df_results_3['TopTarget_Pick_pred_95or100'])
accuracy

0.9021669890771244

Create more columns for lumped labels

In [88]:
df_results_3['cat_isTopMcMrNearby_known_60or95or100'] = np.where(df_results_3['cat_isTopMcMrNearby_known']>59, 1, 0)
df_results_3['TopMcMr_Pick_pred_60or95or100'] = np.where(df_results_3['TopMcMr_Pick_pred']>59, 1, 0)
df_results_3['cat_isTopMcMrNearby_known_100'] = np.where(df_results_3['cat_isTopMcMrNearby_known']==100, 1, 0)
df_results_3['TopMcMr_Pick_pred_known_100'] = np.where(df_results_3['TopMcMr_Pick_pred']==100, 1, 0)

In [89]:
accuracy = accuracy_score(df_results_3['cat_isTopMcMrNearby_known_60or95or100'], df_results_3['TopMcMr_Pick_pred_60or95or100'])
accuracy

0.87035775680321759

In [90]:
accuracy = accuracy_score(df_results_3['cat_isTopMcMrNearby_known_100'], df_results_3['TopMcMr_Pick_pred_60or95or100'])
accuracy

0.62803671629204372

In [91]:
#### inspecting results manually
df_results_3[7000:9000]

Unnamed: 0,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100,cat_isTopMcMrNearby_known_60or95or100,TopMcMr_Pick_pred_60or95or100,cat_isTopMcMrNearby_known_100,TopMcMr_Pick_pred_known_100
39398,60,100,0,1,1,1,0,1
39399,60,100,0,1,1,1,0,1
39400,60,100,0,1,1,1,0,1
39401,0,60,0,0,0,1,0,0
39402,0,0,0,0,0,0,0,0
39403,0,0,0,0,0,0,0,0
39404,0,0,0,0,0,0,0,0
39405,0,0,0,0,0,0,0,0
39406,0,0,0,0,0,0,0,0
39407,0,0,0,0,0,0,0,0


In [92]:
len(df_results_3)

61662

In [93]:
df_all_Col_preSplit_wTrainTest_ClassBalanced.head()

Unnamed: 0,CALI,COND,DELT,DENS,DEPT,DEPTH,DPHI,DPHI:1,DPHI:2,DT,GR,GR:1,GR:2,IL,ILD,ILD:1,ILD:2,ILM,LITH,LLD,LLS,NPHI,PHID,PHIN,RESD,RHOB,RT,SFL,SFLU,SN,SNP,SP,UWI,SitID,McMurray_Base_HorID,McMurray_Top_HorID,McMurray_Base_DEPTH,McMurray_Top_DEPTH,McMurray_Base_Qual,McMurray_Top_Qual,lat,lng,NN1_McMurray_Top_DEPTH,NN1_McMurray_Base_DEPTH,NN1_thickness,MM_Top_Depth_predBy_NN1thick,HorID,Pick,Quality,HorID_paleoz,Pick_paleoz,Quality_paleoz,diff_TMcM_Pick_v_DEPT,diff_TPal_Pick_v_DEPT,cat_isTopMcMrNearby_known,cat_isTopPalNearby_known,DistFrom_NN1_TopDepth_Abs,NewWell,LastBitWell,TopWellDept,BotWellDept,FromTopWell,FromBotWell,WellThickness,closerToBotOrTop,closTopBotDist,rowsToEdge,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,trainOrTest
0,167.003,,,,149.602,,0.227,,,,102.473,,,,0.0,,,,,,,0.46,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,228.348,235.058,0,0,210.058,True,False,149.602,396.102,0.0,246.5,246.5,FromTopWell,0.0,0,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,102.473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,train
1,166.675,,,,152.102,,0.269,,,,26.625,,,,30.179,,,,,,,0.355,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,225.848,232.558,0,0,207.558,False,False,149.602,396.102,2.5,244.0,246.5,FromTopWell,2.5,10,25.825,26.625,50.213,26.625,32.768,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,26.625,20.262,30.179,30.37,30.179,27.1088,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,30.179,train
2,211.701,,,,154.602,,0.339,,,,31.562,,,,21.793,,,,,,,0.428,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,223.348,230.058,0,0,205.058,False,False,149.602,396.102,5.0,241.5,246.5,FromTopWell,5.0,20,23.605,31.562,49.258,31.562,34.0164,31.562,31.562,31.562,22.7,31.562,60.528,31.562,36.187143,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,31.562,16.583,21.793,25.975,21.793,21.5708,21.793,21.793,21.793,14.586,21.793,26.774,21.793,21.316286,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,21.793,train
3,188.132,,,,157.102,,0.291,,,,51.257,,,,7.449,,,,,,,0.452,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,220.848,227.558,0,0,202.558,False,False,149.602,396.102,7.5,239.0,246.5,FromTopWell,7.5,30,40.739,37.621,72.481,51.257,53.8586,43.6546,43.6546,53.8586,37.621,37.621,87.074,60.965,56.284,47.374143,50.6518,63.1256,37.621,51.257,88.401,51.257,59.979636,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,51.257,5.151,7.449,13.402,13.402,8.5374,11.2332,11.2332,8.5374,4.945,7.449,13.402,13.402,8.707571,10.551143,11.5662,10.1714,4.945,7.449,13.402,7.449,8.569,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,7.449,train
4,165.135,,,,159.602,,0.275,,,,24.048,,,,28.931,,,,,,,0.384,,,,,,,,,,,00/10-32-080-20W4/0,112385,14000,13000,384.66,377.95,1,3,55.978836,-113.095365,389.0,414.0,25.0,359.66,13000,377.95,3,14000,384.66,1,218.348,225.058,0,0,200.058,False,False,149.602,396.102,10.0,236.5,246.5,FromTopWell,10.0,40,24.048,24.048,38.122,71.567,28.8192,43.1398,43.1398,28.8192,24.048,24.048,54.197,88.401,33.679571,55.188,66.9006,37.4412,24.048,24.048,82.216,24.048,43.925455,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,24.048,17.23,9.333,28.931,28.931,23.7516,18.6082,18.6082,23.7516,12.412,6.043,28.931,28.931,21.132,15.137571,18.6082,23.7516,6.879,28.931,28.931,28.931,17.155545,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,28.931,train


In [94]:
len(df_all_Col_preSplit_wTrainTest_ClassBalanced)

307648

In [95]:
df_all_Col_preSplit_wTrainTest_ClassBalanced_Copy = np.where(df_all_Col_preSplit_wTrainTest_ClassBalanced['trainOrTest'] == 'test')

In [96]:
len(df_all_Col_preSplit_wTrainTest_ClassBalanced_Copy)

1

In [98]:
predictedPickIsExactlyHere = df_results_3[df_results_3['TopMcMr_Pick_pred_known_100'] == 1]
test100 = predictedPickIsExactlyHere['TopMcMr_Pick_pred_known_100']

In [99]:
type(test100)

pandas.core.series.Series

In [100]:
test100.values

array([1, 1, 1, ..., 1, 1, 1])

### More evaluation

In [102]:
df_featPlus_wUWI_testCopy = df_train_featWithHighCount[df_train_featWithHighCount['trainOrTest'] == 'test' ].copy()

In [103]:
df_featPlus_wUWI_testCopy.head()

Unnamed: 0,UWI,trainOrTest,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge
390,00/11-19-073-16W4/0,test,0.185,101.752,3.723,0.537,2,23.78,421.84,1,2,183.096,445.994,1.0,208.25,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,101.752,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723,3.723
391,00/11-19-073-16W4/0,test,0.212,100.657,2.95,0.516,2,23.78,421.84,1,2,180.596,445.994,3.5,208.25,100.349,100.657,104.476,100.657,101.5134,100.657,100.657,100.657,100.349,100.657,106.802,100.657,102.304429,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,100.657,2.95,2.95,3.254,2.95,3.1066,2.95,2.95,2.95,2.95,2.95,3.414,2.95,3.194286,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95,2.95
392,00/11-19-073-16W4/0,test,0.175,100.744,3.409,0.532,2,23.78,421.84,1,2,178.096,445.994,6.0,208.25,99.221,100.744,106.397,106.397,102.582,104.5656,104.5656,102.582,99.221,100.744,106.397,100.744,103.294,100.744,100.744,100.744,99.221,100.744,106.729,100.744,103.907273,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,100.744,3.299,3.299,3.558,3.493,3.423,3.3906,3.3906,3.423,3.299,3.409,3.632,3.409,3.449143,3.409,3.409,3.409,3.299,3.409,3.632,3.409,3.478455,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409,3.409
393,00/11-19-073-16W4/0,test,0.265,91.018,4.864,0.489,2,23.78,421.84,1,2,175.596,445.994,8.5,208.25,67.81,91.018,102.635,102.635,88.5874,98.7966,98.7966,88.5874,58.59,91.018,102.635,105.471,85.816714,100.284,102.356,94.8634,53.847,91.018,102.635,91.018,83.585545,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,91.018,3.541,3.298,7.441,4.864,5.2084,3.8452,3.8452,5.2084,3.298,3.298,9.898,4.864,5.605429,3.758,3.934,6.4798,3.298,4.864,10.327,4.864,6.026455,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864,4.864
394,00/11-19-073-16W4/0,test,0.298,74.735,7.736,0.426,2,23.78,421.84,1,2,173.096,445.994,11.0,208.25,71.051,69.128,74.946,74.946,73.5338,73.189,73.189,73.5338,70.149,53.847,74.946,74.946,72.959,68.991429,73.189,73.9026,63.148,74.735,74.946,74.735,71.501545,74.735,74.735,74.735,53.847,74.735,97.36,74.735,75.98419,74.735,74.735,74.735,7.182,7.736,8.256,9.347,7.6842,8.4198,8.4198,7.6842,7.111,7.736,8.756,10.327,7.755429,8.909429,9.3252,7.999,7.11,7.736,9.94,7.736,7.981636,7.736,7.736,7.736,4.864,7.736,10.327,7.736,7.779095,7.736,7.736,7.736


In [115]:
len(df_featPlus_wUWI_testCopy)

61662

In [104]:
df_featPlus_wUWI_testCopy_wResults = pd.concat([df_featPlus_wUWI_testCopy, df_results_3], axis=1)
df_featPlus_wUWI_testCopy_wResults.tail()

Unnamed: 0,UWI,trainOrTest,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100,cat_isTopMcMrNearby_known_60or95or100,TopMcMr_Pick_pred_60or95or100,cat_isTopMcMrNearby_known_100,TopMcMr_Pick_pred_known_100
307633,00/16-29-073-05W5/0,test,0.132,69.814,7.565,0.377,1,3.0,612.0,1,1,20.25,595.0,231.75,235.0,58.985,69.814,79.901,69.814,69.5914,69.814,69.814,69.814,58.985,69.814,80.779,69.814,70.854714,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,69.814,4.601,7.565,9.848,7.565,7.386,7.565,7.565,7.565,3.528,7.565,9.848,7.565,7.104429,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,7.565,95,0,1,0,1,0,0,0
307634,00/06-26-075-21W4/0,test,0.234,52.644,18.299,0.312,3,17.07,562.66,3,3,13.032,594.442,182.5,201.25,39.106,39.106,94.659,101.032,62.0152,65.09,65.09,62.0152,39.106,39.106,94.659,124.887,69.691286,80.616286,95.372,80.077,39.106,39.106,113.977,146.483,77.909727,101.844273,136.172,97.4862,39.106,52.644,146.483,52.644,93.937857,52.644,52.644,52.644,16.342,13.736,18.642,18.642,17.5814,16.528,16.528,17.5814,14.874,10.919,18.642,18.642,16.854857,15.112,16.528,17.5814,12.225,7.541,18.642,18.642,15.585273,12.802636,16.528,17.5814,7.541,18.299,18.642,18.299,13.319762,18.299,18.299,18.299,95,60,1,0,1,1,0,0
307635,00/06-26-075-21W4/0,test,0.211,75.319,17.535,0.336,3,17.07,562.66,3,3,13.282,594.442,182.75,201.25,39.106,39.106,94.659,84.32,71.0342,59.9474,59.9474,71.0342,39.106,39.106,94.659,113.977,69.303857,73.535143,85.4584,79.5346,39.106,39.106,101.032,142.116,74.123182,95.374818,129.6708,91.0124,39.106,75.319,142.116,75.319,91.498381,75.319,75.319,75.319,15.203,14.874,18.642,18.642,17.2042,17.2878,17.2878,17.2042,14.116,12.225,18.642,18.642,16.746571,16.057143,17.2878,17.5814,12.795,8.334,18.642,18.642,15.637091,13.711182,17.2878,17.5814,8.334,17.535,18.642,17.535,13.511762,17.535,17.535,17.535,95,60,1,0,1,1,0,0
307636,00/06-26-075-21W4/0,test,0.183,94.659,16.342,0.361,3,17.07,562.66,3,3,13.532,594.442,183.0,201.25,52.644,39.106,94.659,94.659,79.5346,62.0152,62.0152,79.5346,39.106,39.106,94.659,101.032,72.761429,70.775429,81.5948,83.516,39.106,39.106,94.659,136.765,71.935727,91.060545,121.454,86.2,39.106,94.659,136.765,94.659,89.694,94.659,94.659,94.659,14.116,16.342,18.299,18.642,16.299,17.5814,17.5814,16.299,13.377,13.736,18.642,18.642,16.216286,16.645286,17.5814,17.2042,12.513,9.142,18.642,18.642,15.525909,14.439182,17.5814,17.5814,9.142,16.342,18.642,16.342,13.624762,16.342,16.342,16.342,95,60,1,0,1,1,0,0
307637,00/06-26-075-21W4/0,test,0.206,93.443,15.203,0.391,3,17.07,562.66,3,3,13.782,594.442,183.25,201.25,72.551,39.106,94.659,94.659,83.516,71.0342,71.0342,83.516,52.644,39.106,94.659,94.659,77.507,69.691286,80.077,83.516,39.106,39.106,94.659,130.609,72.447909,87.122182,113.0328,87.3268,39.106,93.443,130.609,93.443,88.627381,93.443,93.443,93.443,13.377,15.203,17.535,18.642,15.3146,17.2042,17.2042,15.3146,12.795,14.874,18.299,18.642,15.381,16.854857,17.5814,16.299,12.351,10.028,18.642,18.642,15.296545,14.990182,17.5814,17.5814,10.028,15.203,18.642,15.203,13.672571,15.203,15.203,15.203,95,60,1,0,1,1,0,0


In [116]:
len(df_featPlus_wUWI_testCopy_wResults)

61662

In [105]:
wells_in_test = df_featPlus_wUWI_testCopy_wResults['UWI'].unique()
len(wells_in_test)

382

limt new dataframe to rows that are less than 1 from actual pick

In [107]:
df_look_at_pred_class_vs_distFromRealLess1 = df_featPlus_wUWI_testCopy_wResults[df_featPlus_wUWI_testCopy_wResults['DistFrom_NN1_TopDepth_Abs'] < 1 ]

In [108]:
df_look_at_pred_class_vs_distFromRealLess1['cat_isTopMcMrNearby_known'].nunique()

4

groupy label and get counts as dataframe using nunique

In [118]:
df_count = df_look_at_pred_class_vs_distFromRealLess1.groupby('TopMcMr_Pick_pred').nunique()

In [119]:
df_count

Unnamed: 0_level_0,UWI,trainOrTest,DPHI,GR,ILD,NPHI,McMurray_Base_Qual,NN1_thickness,MM_Top_Depth_predBy_NN1thick,Quality,Quality_paleoz,DistFrom_NN1_TopDepth_Abs,BotWellDept,FromTopWell,WellThickness,GR_min_5winSize_dirAroundMin,GR_min_5winSize_dirAboveMin,GR_min_5winSize_dirAroundMax,GR_min_5winSize_dirAboveMax,GR_min_5winSize_dirAroundMean,GR_min_5winSize_dirAboveMean,GR_min_5winSize_dirAbovenLarge,GR_min_5winSize_dirAroundnLarge,GR_min_7winSize_dirAroundMin,GR_min_7winSize_dirAboveMin,GR_min_7winSize_dirAroundMax,GR_min_7winSize_dirAboveMax,GR_min_7winSize_dirAroundMean,GR_min_7winSize_dirAboveMean,GR_min_7winSize_dirAbovenLarge,GR_min_7winSize_dirAroundnLarge,GR_min_11winSize_dirAroundMin,GR_min_11winSize_dirAboveMin,GR_min_11winSize_dirAroundMax,GR_min_11winSize_dirAboveMax,GR_min_11winSize_dirAroundMean,GR_min_11winSize_dirAboveMean,GR_min_11winSize_dirAbovenLarge,GR_min_11winSize_dirAroundnLarge,GR_min_21winSize_dirAroundMin,GR_min_21winSize_dirAboveMin,GR_min_21winSize_dirAroundMax,GR_min_21winSize_dirAboveMax,GR_min_21winSize_dirAroundMean,GR_min_21winSize_dirAboveMean,GR_min_21winSize_dirAbovenLarge,GR_min_21winSize_dirAroundnLarge,ILD_min_5winSize_dirAroundMin,ILD_min_5winSize_dirAboveMin,ILD_min_5winSize_dirAroundMax,ILD_min_5winSize_dirAboveMax,ILD_min_5winSize_dirAroundMean,ILD_min_5winSize_dirAboveMean,ILD_min_5winSize_dirAbovenLarge,ILD_min_5winSize_dirAroundnLarge,ILD_min_7winSize_dirAroundMin,ILD_min_7winSize_dirAboveMin,ILD_min_7winSize_dirAroundMax,ILD_min_7winSize_dirAboveMax,ILD_min_7winSize_dirAroundMean,ILD_min_7winSize_dirAboveMean,ILD_min_7winSize_dirAbovenLarge,ILD_min_7winSize_dirAroundnLarge,ILD_min_11winSize_dirAroundMin,ILD_min_11winSize_dirAboveMin,ILD_min_11winSize_dirAroundMax,ILD_min_11winSize_dirAboveMax,ILD_min_11winSize_dirAroundMean,ILD_min_11winSize_dirAboveMean,ILD_min_11winSize_dirAbovenLarge,ILD_min_11winSize_dirAroundnLarge,ILD_min_21winSize_dirAroundMin,ILD_min_21winSize_dirAboveMin,ILD_min_21winSize_dirAroundMax,ILD_min_21winSize_dirAboveMax,ILD_min_21winSize_dirAroundMean,ILD_min_21winSize_dirAboveMean,ILD_min_21winSize_dirAbovenLarge,ILD_min_21winSize_dirAroundnLarge,cat_isTopMcMrNearby_known,TopMcMr_Pick_pred,cat_isTopMcMrNearby_known_95or100,TopMcMr_Pick_pred_95or100,cat_isTopMcMrNearby_known_60or95or100,TopMcMr_Pick_pred_60or95or100,cat_isTopMcMrNearby_known_100,TopMcMr_Pick_pred_known_100
TopMcMr_Pick_pred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1
0,17,1,21,22,22,22,3,17,17,3,3,12,17,21,14,18,19,20,22,22,22,22,22,17,20,21,20,22,22,22,22,19,20,20,22,22,22,22,21,20,20,22,20,22,22,21,22,20,17,22,22,22,22,22,22,18,17,22,22,22,22,22,22,17,22,22,22,22,22,22,22,22,22,21,22,22,22,22,22,2,1,1,1,2,1,1,1
60,212,1,260,620,610,247,4,163,202,3,4,300,177,409,135,446,466,492,480,623,623,623,622,413,444,472,439,623,623,596,600,371,399,427,393,623,623,531,567,340,404,371,371,623,622,462,490,543,536,497,522,617,616,614,611,519,487,450,498,617,616,593,564,470,442,398,474,615,616,560,501,418,398,383,467,616,615,532,464,4,1,2,1,2,1,2,1
95,119,1,152,287,269,150,4,104,115,3,4,187,106,257,87,248,247,222,232,287,287,287,287,239,217,194,225,287,287,279,259,217,193,169,225,287,287,269,223,187,199,153,216,287,287,255,210,217,224,254,242,274,274,273,273,201,223,245,207,274,273,253,271,174,224,215,189,274,274,216,260,171,209,175,200,273,274,215,222,4,1,2,1,2,1,2,1
100,69,1,94,137,137,88,4,56,65,3,4,34,58,107,46,126,122,111,113,137,137,137,137,119,112,99,113,137,137,132,132,105,100,95,106,137,137,128,119,91,89,88,99,137,137,126,109,118,125,130,122,137,137,137,137,111,120,125,117,137,137,134,134,102,111,115,100,137,137,118,130,96,109,85,97,137,137,106,103,4,1,2,1,2,1,2,1


In [120]:
total_rows_less_than_1_from_pick = df_count['UWI'].unique().sum()
total_rows_less_than_1_from_pick

417

Why is the number of unique wells less than the number calculated above 382? Where there rows included in the test dataset that didn't have any rows within 1 of the pick for that well?

In [121]:
df_count['UWI']

TopMcMr_Pick_pred
0       17
60     212
95     119
100     69
Name: UWI, dtype: int64

In [132]:
def getPercents(df,total_wells):
    index_list = df.index.values
    index_num = -1
    for Each in df:
        index_num = index_num+1
        print("label is =", index_num," and total instaces of that label =",Each, "and the % is: ",Each/total_wells)

In [133]:
getPercents(df_count['UWI'],total_rows_less_than_1_from_pick)

label is = 0  and total instaces of that label = 17 and the % is:  0.0407673860911
label is = 1  and total instaces of that label = 212 and the % is:  0.508393285372
label is = 2  and total instaces of that label = 119 and the % is:  0.285371702638
label is = 3  and total instaces of that label = 69 and the % is:  0.165467625899


#### The numbers above show the number of rows within 1 of the actual pick in terms of their predicted label.
#### What we see from this is there are very few rows within 1 (foot?) of actual pick that are predicted to be class 0, or more than 5 from the pick. 

In [134]:
df_look_at_pred_class_vs_distFromRealLess1 = df_featPlus_wUWI_testCopy_wResults[df_featPlus_wUWI_testCopy_wResults['DistFrom_NN1_TopDepth_Abs'] < 1 ]

In [135]:
df_count = df_look_at_pred_class_vs_distFromRealLess1.groupby('TopMcMr_Pick_pred').nunique()

In [136]:
total_rows_less_than_1_from_pick = df_count['UWI'].unique().sum()
total_rows_less_than_1_from_pick

417

In [137]:
df_count['UWI']

TopMcMr_Pick_pred
0       17
60     212
95     119
100     69
Name: UWI, dtype: int64

In [138]:
getPercents(df_count['UWI'],total_rows_less_than_1_from_pick)

label is = 0  and total instaces of that label = 17 and the % is:  0.0407673860911
label is = 1  and total instaces of that label = 212 and the % is:  0.508393285372
label is = 2  and total instaces of that label = 119 and the % is:  0.285371702638
label is = 3  and total instaces of that label = 69 and the % is:  0.165467625899


In [139]:
def getStatsOnWithinDistOfPick(df,distOfPick):
    df_look_at_pred_class_vs_distFromRealLessNum = df[df['DistFrom_NN1_TopDepth_Abs'] < distOfPick]
    df_count = df_look_at_pred_class_vs_distFromRealLessNum.groupby('TopMcMr_Pick_pred').nunique()
    total_rows_less_than_Num_from_pick = df_count['UWI'].unique().sum()
    getPercents(df_count['UWI'],total_rows_less_than_Num_from_pick)

In [140]:
getStatsOnWithinDistOfPick(df_featPlus_wUWI_testCopy_wResults,5)

label is = 0  and total instaces of that label = 53 and the % is:  0.0706666666667
label is = 1  and total instaces of that label = 329 and the % is:  0.438666666667
label is = 2  and total instaces of that label = 198 and the % is:  0.264
label is = 3  and total instaces of that label = 170 and the % is:  0.226666666667


### What this tells us is most of the predicted classes at or around the pick are predicted class of at or around the pick, which is good!

----------

# Turning row-by-row classification into single pick value prediction<a name="classificationToPick"></a>

1. Create function that treats depth & classification prediction column like histogram and finds median value (in this case depth)
2. Create widgeted function that changes values of labels to shift how much weight is given to each class (at pick, right by pick, sorta nearby pick, etc.)
3. Visualize step #2
4. Use steps 1,2,3 to create a new prediction that is a depth for each well
5. Calculate average distance between actual pick and predicted pick
6. Plot results of step 5 as simple scatter plot
7. Plot results of step 5 as map