# Conoco Phillips Challenge
---
A data set has been provided that has documented failure events that occurred on surface equipment and down-hole equipment. For each failure event, data has been collected from over 107 sensors that collect a variety of physical information both on the surface and below the ground.

Using this data, can we predict failures that occur both on the surface and below the ground? Using this information, how can we minimize costs associated with failures?

The goal of this challenge will be to predict surface and down-hole failures using the data set provided. This information can be used to send crews out to a well location to fix equipment on the surface or send a workover rig to the well to pull down-hole equipment and address the failure.

In [609]:
import pandas as pd
import numpy as np
print('Setup complete.')

Setup complete.


In [610]:
data = pd.read_csv('equip_failures_training_set.csv', na_values=['na'], dtype=np.float64)

## Data Pre-Processing
---

In this cell, we filter out the data using three criteria: 

    1) Remove all columns with a histogram bin for a sensor in them

    2) Remove all columns that have more than X% NaNs in them

    3) Remove all columns that have more than X% zeroes in them

In [611]:
threshold = 0.10 * len(data['target'])
cols = []

for col in data.columns:
    if 'histogram' in col:
        cols.append(col)
        
    elif data[col].isna().sum() > threshold:
        cols.append(col)
        
    elif data[col].isin([0]).sum(axis=0) > threshold and col != 'target':
        cols.append(col)
        
data.drop(columns=cols, inplace=True, axis=1)
data.fillna(value=0.0, inplace=True)
data.describe()

Unnamed: 0,id,target,sensor1_measure,sensor8_measure,sensor14_measure,sensor15_measure,sensor16_measure,sensor17_measure,sensor27_measure,sensor29_measure,...,sensor59_measure,sensor61_measure,sensor67_measure,sensor79_measure,sensor80_measure,sensor89_measure,sensor94_measure,sensor95_measure,sensor96_measure,sensor97_measure
count,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,...,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0
mean,30000.5,0.016667,59336.5,1790474.0,3424004.0,2972966.0,993415.0,438061.5,4477521.0,879.880767,...,3461593.0,710318.6,4463324.0,2993.640633,358.115433,33356.82,85580.18,14703.2722,3874311.0,566855.1
std,17320.652413,0.12802,145430.1,4167363.0,7756737.0,6792416.0,3073626.0,1257014.0,10838120.0,4725.856739,...,8336500.0,2175220.0,10807930.0,9336.970915,1652.121315,96842.01,204270.5,33179.435232,11333490.0,2038882.0
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,15000.75,0.0,834.0,27549.0,67934.0,60874.0,23700.5,3906.0,98863.5,6.0,...,45442.32,14074.32,98247.5,108.0,60.0,624.0,418.0,96.0,3160.0,486.0
50%,30000.5,0.0,30776.0,989741.0,1892068.0,1620136.0,350133.0,175051.0,2332741.0,56.0,...,1846821.0,247503.4,2329541.0,1240.0,134.0,14050.0,39492.0,6908.0,142190.0,23115.0
75%,45000.25,0.0,48668.0,1585356.0,3101668.0,2649132.0,718323.0,373568.5,3837839.0,408.0,...,2934873.0,546440.6,3831318.0,2594.0,284.0,27072.5,96420.0,17080.0,3095990.0,491102.0
max,60000.0,1.0,2746564.0,74247320.0,140861800.0,122201800.0,77934940.0,25562650.0,192871500.0,306452.0,...,140986100.0,55428670.0,192871500.0,445142.0,176176.0,2924584.0,4970962.0,656432.0,460207600.0,127034500.0


## PCA
---


## Imbalanced Data
---

## Build the model
---


In [613]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(32,)))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

sgd = SGD(lr=1.0, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
             loss='categorical_crossentropy',
             metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_137 (Dense)            (None, 64)                2112      
_________________________________________________________________
dropout_48 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_138 (Dense)            (None, 2)                 130       
Total params: 2,242
Trainable params: 2,242
Non-trainable params: 0
_________________________________________________________________


In [614]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.utils import to_categorical

X = data.loc[:, 'sensor1_measure'::]
y = data['target']

scaler = StandardScaler()
X = scaler.fit_transform(X)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# y_binary = to_categorical(y_train)
# model.fit(x_train, y_binary, epochs=10, batch_size=32)

y_binary = to_categorical(y)
model.fit(X, y_binary, epochs=5, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a960c48d0>

Evaluate the model

In [615]:
y_test_bin = to_categorical(y_test)
model.evaluate(x_test, y_test_bin)



[0.2471442198753357, 0.9846666666666667]

In [640]:
from sklearn.metrics import f1_score

y_guess = []
y_pred = model.predict(x_test)

for pred in y_pred:
    if pred[0] > pred[1]:
        y_guess.append(1)
    else:
        y_guess.append(0)
              
f1_score(y_guess, y_test)

0.030203545633617854

## Test Set
---

In [631]:
test_data = pd.read_csv('equipfails/equip_failures_test_set.csv', na_values=['na'], dtype=np.float64)
X = test_data.loc[:, 'sensor1_measure'::]

In [632]:
X.fillna(value=0.0, inplace=True)
X.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  **kwargs


Unnamed: 0,sensor1_measure,sensor2_measure,sensor3_measure,sensor4_measure,sensor5_measure,sensor6_measure,sensor7_histogram_bin0,sensor7_histogram_bin1,sensor7_histogram_bin2,sensor7_histogram_bin3,...,sensor105_histogram_bin2,sensor105_histogram_bin3,sensor105_histogram_bin4,sensor105_histogram_bin5,sensor105_histogram_bin6,sensor105_histogram_bin7,sensor105_histogram_bin8,sensor105_histogram_bin9,sensor106_measure,sensor107_measure
0,66888.0,0.0,2130706000.0,332.0,0.0,0.0,0.0,0.0,0.0,0.0,...,544762.0,504820.0,1597028.0,631494.0,5644.0,5448.0,11096.0,1982.0,0.0,0.0
1,91122.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,696774.0,345742.0,939332.0,943744.0,504048.0,203698.0,287374.0,36566.0,0.0,0.0
2,218924.0,0.0,0.0,0.0,0.0,0.0,0.0,280.0,119070.0,1236386.0,...,1032974.0,866000.0,1645644.0,1154924.0,3549128.0,1550716.0,15900.0,0.0,0.0,0.0
3,16.0,0.0,30.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,...,70.0,24.0,40.0,12.0,56.0,0.0,0.0,0.0,0.0,0.0
4,39084.0,0.0,1054.0,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,...,276304.0,123720.0,225722.0,281462.0,295244.0,256146.0,241074.0,2372.0,0.0,0.0


In [None]:
y_true = 