# Conoco Phillips Challenge
---
A data set has been provided that has documented failure events that occurred on surface equipment and down-hole equipment. For each failure event, data has been collected from over 107 sensors that collect a variety of physical information both on the surface and below the ground.

Using this data, can we predict failures that occur both on the surface and below the ground? Using this information, how can we minimize costs associated with failures?

The goal of this challenge will be to predict surface and down-hole failures using the data set provided. This information can be used to send crews out to a well location to fix equipment on the surface or send a workover rig to the well to pull down-hole equipment and address the failure.

In [1]:
import pandas as pd
import numpy as np
print('Setup complete.')

Setup complete.


In [2]:
data = pd.read_csv('equipfails/equip_failures_training_set.csv', na_values=['na'], dtype=np.float64)

FileNotFoundError: [Errno 2] File b'equipfails/equip_failures_training_set.csv' does not exist: b'equipfails/equip_failures_training_set.csv'

## Data Pre-Processing
---

In this cell, we filter out the data using three criteria: 

    1) Remove all columns with a histogram bin for a sensor in them

    2) Remove all columns that have more than X% NaNs in them

    3) Remove all columns that have more than X% zeroes in them

In [513]:
threshold = 0.10 * len(data[col])
cols = []

for col in data.columns:
    if 'histogram' in col:
        cols.append(col)
        
    elif data[col].isna().sum() > threshold:
        cols.append(col)
        
    elif data[col].isin([0]).sum(axis=0) > threshold and col != 'target':
        cols.append(col)
        
data.drop(columns=cols, inplace=True, axis=1)
data.fillna(value=0.0)
data.describe()

Unnamed: 0,id,target,sensor1_measure,sensor8_measure,sensor14_measure,sensor15_measure,sensor16_measure,sensor17_measure,sensor27_measure,sensor29_measure,...,sensor59_measure,sensor61_measure,sensor67_measure,sensor79_measure,sensor80_measure,sensor89_measure,sensor94_measure,sensor95_measure,sensor96_measure,sensor97_measure
count,60000.0,60000.0,60000.0,59355.0,59358.0,59411.0,59358.0,59411.0,59355.0,57273.0,...,59662.0,59662.0,59309.0,57497.0,57276.0,59309.0,57273.0,57273.0,57274.0,57274.0
mean,30000.5,0.016667,59336.5,1809931.0,3461037.0,3002440.0,1004160.0,442404.5,4526177.0,921.775461,...,3481204.0,714342.7,4515325.0,3123.961911,375.147112,33745.45,89655.0,15403.35467,4058712.0,593835.0
std,17320.652413,0.12802,145430.1,4185740.0,7790350.0,6819518.0,3088457.0,1262469.0,10886740.0,4833.065431,...,8355997.0,2180714.0,10859900.0,9516.675102,1689.062059,97337.19,208201.6,33801.022975,11567770.0,2082998.0
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15000.75,0.0,834.0,29733.0,73238.5,65585.0,25189.0,4161.0,105601.0,8.0,...,48249.84,14586.72,105444.0,132.0,66.0,660.0,684.0,150.0,5380.0,742.0
50%,30000.5,0.0,30776.0,1002420.0,1918629.0,1643556.0,357281.0,178792.0,2360728.0,66.0,...,1858641.0,250267.2,2359656.0,1354.0,144.0,14330.0,47940.0,8316.0,185400.0,30592.0
75%,45000.25,0.0,48668.0,1601366.0,3128416.0,2675796.0,724660.5,376900.0,3868370.0,438.0,...,2947266.0,549352.3,3863322.0,2678.0,296.0,27340.0,99202.0,17630.0,3472540.0,534655.5
max,60000.0,1.0,2746564.0,74247320.0,140861800.0,122201800.0,77934940.0,25562650.0,192871500.0,306452.0,...,140986100.0,55428670.0,192871500.0,445142.0,176176.0,2924584.0,4970962.0,656432.0,460207600.0,127034500.0


## Build the model


In [516]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(32,)))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer='adam',
             loss='categorical_crossentropy',
             metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_94 (Dense)             (None, 64)                2112      
_________________________________________________________________
dropout_19 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_95 (Dense)             (None, 64)                4160      
_________________________________________________________________
dropout_20 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_96 (Dense)             (None, 2)                 130       
Total params: 6,402
Trainable params: 6,402
Non-trainable params: 0
_________________________________________________________________


In [517]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.utils import to_categorical

X = data.loc[:, 'sensor1_measure'::]
y = data['target']

scaler = StandardScaler()
X = scaler.fit_transform(X)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

y_binary = to_categorical(y_train)
model.fit(x_train, y_binary, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a9128e3d0>

Evaluate the model

In [518]:
y_test_bin = to_categorical(y_test)
model.evaluate(x_test, y_test_bin)



[nan, 0.9846666666666667]

In [506]:
from sklearn.metrics import f1_score
y_pred = model.predict(x_test)
print(y_pred)
# f1_score(y_pred, y_test_bin)

[[nan nan]
 [nan nan]
 [nan nan]
 ...
 [nan nan]
 [nan nan]
 [nan nan]]
