### Problem Statement

The high-level problem statement is mentioned in the competition’s description page. It highlights the problem that deals with predicting high-value customers for their business based on the operational interaction data and thereby helping the company effectively prioritize resources to generate more business and serve its customers better.

Let’s have a look at the problem statement from a more business-centric view. We will start by understanding the customer better. The organization is an American multinational software company that provides open source software products to the enterprise community. Their primary product is Red Hat Enterprise Linux, the most popular distribution of Linux OS, used by various large enterprises. In its services, it helps organizations align their IT strategies by providing enterprise-grade
solutions through an open business model and an affordable, predictable subscription model. These subscriptions from large enterprise customers create a substantial part of their revenue, and therefore it is of paramount importance for them to understand their valuable customers and serve them better by prioritizing resources and strategies to drive improved
business value.

### How Can We Identify a Potential Customer?

Red Hat has been in existence for over 25 years. In the long stint of
business, they have accumulated and captured a vast amount of data from
customer interactions and their descriptive attributes. This rich source
of data could be a gold mine of patterns that can help in identifying a
potential customer by studying the vast and complex historical patterns in
the interaction data.


With the ever-growing popularity and prowess of DL, we can develop
a DNN that can learn from historic customer attributes and operational
interaction data to understand the deep patterns and predict whether
a new customer will potentially be a high-value customer for various
business services.

Therefore, we will develop and train a DNN to learn the chances that a
customer will be a potential high-value customer, using various customer
attributes and operational interaction attributes.

In [29]:
import numpy as np
import pandas as pd
from keras.models import Model
from keras.layers import Input, merge, Convolution2D, MaxPooling2D, UpSampling2D,ZeroPadding2D
from keras.optimizers import Adam , RMSprop, Adadelta, SGD
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, EarlyStopping
from keras.layers.normalization import BatchNormalization
from keras import backend as K
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.advanced_activations import PReLU,ELU

In [30]:
act_train = pd.read_csv('data2/act_train.csv')
act_test = pd.read_csv('data2/act_test.csv')
people = pd.read_csv('data2/people.csv')


In [31]:
# Save the test IDs for Kaggle submission
test_ids = act_test['activity_id']

def preprocess_acts(data, train_set=True):
    
    # Getting rid of data feature for now
    data = data.drop(['date', 'activity_id'], axis=1)
    if(train_set):
        data = data.drop(['outcome'], axis=1)
    
    ## Split off _ from people_id
    data['people_id'] = data['people_id'].apply(lambda x: x.split('_')[1])
    data['people_id'] = pd.to_numeric(data['people_id']).astype(int)
    
    columns = list(data.columns)
    
    # Convert strings to ints
    for col in columns[1:]:
        data[col] = data[col].fillna('type 0')
        data[col] = data[col].apply(lambda x: x.split(' ')[1])
        data[col] = pd.to_numeric(data[col]).astype(int)
    return data

In [32]:
def preprocess_people(data):
    
    # TODO refactor this duplication
    data = data.drop(['date'], axis=1)
    data['people_id'] = data['people_id'].apply(lambda x: x.split('_')[1])
    data['people_id'] = pd.to_numeric(data['people_id']).astype(int)
    
    #  Values in the people df is Booleans and Strings    
    columns = list(data.columns)
    bools = columns[11:]
    strings = columns[1:11]
    
    for col in bools:
        data[col] = pd.to_numeric(data[col]).astype(int)        
    for col in strings:
        data[col] = data[col].fillna('type 0')
        data[col] = data[col].apply(lambda x: x.split(' ')[1])
        data[col] = pd.to_numeric(data[col]).astype(int)
    return data

In [33]:
 # Preprocess each df
peeps = preprocess_people(people)
actions_train = preprocess_acts(act_train)
actions_test = preprocess_acts(act_test, train_set=False)

# Merege into a unified table

# Training 
features = actions_train.merge(peeps, how='left', on='people_id')
labels = act_train['outcome']


In [34]:
# Testing
test = actions_test.merge(peeps, how='left', on='people_id')

# Check it out...
features.sample(10)

Unnamed: 0,people_id,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,char_8_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
79634,105776,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1739874,381274,2,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,46
21887,103828,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,70
1311241,315216,2,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,94
1222139,299252,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
346863,154083,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,72
714117,220439,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1276568,308846,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1026388,273835,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,83
1669654,370270,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, EarlyStopping


In [39]:
def create_model_v1( input_dim):
    nb_classes = 1
    # number of convolutional filters to use
 
    model = Sequential()

  
    model.add(Dense(100,input_dim=input_dim,activation='relu'))
   
  
    model.add(Dense(100,activation='relu'))
  
   
    
    model.add(Dense(nb_classes))
    model.add(Activation('sigmoid'))

    sgd = SGD(lr=0.05, decay=0, momentum=0.95, nesterov=True)
    #sgd = SGD(lr=1e-2, decay=1e-6)
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model


In [42]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

features = features.as_matrix()
scaler = preprocessing.StandardScaler().fit(features)
features = scaler.transform(features)   
num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(features, labels.as_matrix(), test_size=num_test, random_state=1337)

model_checkpoint = ModelCheckpoint('redhat1.hdf5', monitor='val_loss', save_best_only=True)

input_dim = X_train.shape[1]

model= create_model_v1(input_dim)

print("Start fitting the model")



Start fitting the model


In [43]:
model.fit(X_train , y_train, batch_size=100, nb_epoch=1, validation_data =(X_test,y_test) ,
          verbose=1, shuffle=True,callbacks=[model_checkpoint])

test= scaler.transform(test.as_matrix())

model.load_weights('redhat1.hdf5') 
proba= model.predict(X_test, verbose=1)
test_proba = model.predict(test, verbose=1)

print(np.shape(proba))

  


Train on 1757832 samples, validate on 439459 samples
Epoch 1/1




(439459, 1)


In [46]:
## Out of box random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
print("start predicting")
#clf = RandomForestClassifier()
#clf.fit(X_train, y_train)


start predicting


In [47]:
preds = proba
score = roc_auc_score(y_test, preds)
print("Area under ROC {0}".format(score))


#test_proba = clf.predict_proba(test)
test_preds = test_proba.flatten()

print(np.shape(test_preds))
# Format for submission
output = pd.DataFrame({ 'activity_id' : test_ids, 'outcome': test_preds })

output.to_csv('redhat.csv', index = False)

Area under ROC 0.9576760872130916
(498687,)
