# Insurance Fraud Claim Prediction - Step 4: Modeling
We are now ready to begin building our regression model to predict fraud.  This notebook demonstrates how to use the previous work (cleaning, feature prep) to quickly build up the engineered features we need to train our ML model.

In [54]:
# Basic setup
%run config.ipynb

In [56]:
# Connect to Cortex 5 and create a Builder instance
cortex = Cortex.client()
builder = cortex.builder()

### Training Data
We will start with the training dataset from our previous steps and run the _features_ pipeline to get cleaned and prepared data

In [59]:
train_ds = cortex.dataset('claims-fraud/motorinsurancefraud')

In [61]:
pipeline = train_ds.pipeline('features')
train_df = pipeline.run()

running pipeline [clean] for dataset [claims-fraud/motorinsurancefraud]:
> drop_unused 
> fill_na_none 
running pipeline [features] for dataset [claims-fraud/motorinsurancefraud]:


### Feature Framing
We now need to split out our target variable from the training data and convert our categorical values into _dummies_.

In [64]:
y = train_df['Fraud Flag']

In [66]:
from sklearn import preprocessing
from sklearn import utils

lab_enc = preprocessing.LabelEncoder()
y_encoded = lab_enc.fit_transform(y)

In [68]:
train_df.shape

(500, 13)

In [72]:
def drop_target(pipeline, df):
    df.drop('Fraud Flag', 1, inplace=True)
    
def get_dummies(pipeline, df):
    return pd.get_dummies(df)


pipeline = train_ds.pipeline('engineer', depends=['features'], clear_cache=True)
pipeline.reset()
pipeline.add_step(drop_target)
pipeline.add_step(get_dummies)

# Run the feature engineering pipeline to prepare for model training
train_df = pipeline.run()

# Remember the full set of engineered columns we need to produce for the model
pipeline.set_context('columns', train_df.columns.tolist())

print('\nTrain shape: (%d, %d)' % train_df.shape)

running pipeline [clean] for dataset [claims-fraud/motorinsurancefraud]:
> drop_unused 
> fill_na_none 
running pipeline [features] for dataset [claims-fraud/motorinsurancefraud]:
running pipeline [engineer] for dataset [claims-fraud/motorinsurancefraud]:
> drop_target 
> get_dummies 

Train shape: (500, 24)


## Model Training, Validation, and Experimentation
We are going to try a variety of alogithms and parameters to achieve optimal results.  This will be an iterative process that Cortex 5 will help us track and reproduce in the future by recording the data pipeline used, the model parameters, model metrics, and model artifacts in Experiments.

In [75]:
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV,LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [77]:
def train(x, y, **kwargs):
    alphas = kwargs.get('alphas', [1, 0.1, 0.001, 0.0001])

    # Select alogrithm
    mtype = kwargs.get('model_type')
    if mtype == 'Lasso':
        model = LassoCV(alphas=alphas)
    elif mtype == 'Ridge':
        model = RidgeCV(alphas=alphas)
    elif mtype == 'ElasticNet':
        model = ElasticNetCV(alphas=alphas)
    else:
        model = LogisticRegression()

    # Train model
    model.fit(x, y)
    
    return model

In [79]:
def predict_and_score(model, x, y):
    predictions = model.predict(x)
    rmse = np.sqrt(mean_squared_error(predictions, y))
    return [predictions, rmse]

In [81]:
X_train, X_test, y_train, y_test = train_test_split(train_df, y.values, test_size=0.20, random_state=10)

### Experiment Management
We are ready to run our train and validation loop and select the optimal model.  As we run our experiment, Cortex will track each run and record the key params, metrics, and artifacts needed to reproduce and/or deploy the model later.

In [84]:
%%time

best_model = None
best_model_type = None
best_rmse = 1.0

exp = cortex.experiment('claims-fraud/motorinsurancefraud-regression')
# exp.reset()
exp.set_pipeline(pipeline)
exp.set_meta('style', 'supervised')
exp.set_meta('function', 'regression')

with exp.start_run() as run:
    alphas = [1, 0.1, 0.001, 0.0001]
    for model_type in ['Logistic', 'Lasso', 'Ridge', 'ElasticNet']:
        print('---'*30)
        print('Training model using {} regression algorithm'.format(model_type))
        model = train(X_train, y_train, model_type=model_type, alphas=alphas)
        [predictions, rmse] = predict_and_score(model, X_train, y_train)
        print('Training error:', rmse)
        [predictions, rmse] = predict_and_score(model, X_test, y_test)
        print('Testing error:', rmse)
        
        if rmse < best_rmse:
            best_rmse = rmse
            best_model = model
            best_model_type = model_type
    
    r2 = best_model.score(X_test, y_test)
    run.log_metric('r2', r2)
    run.log_metric('rmse', best_rmse)
    run.log_param('model_type', best_model_type)
    run.log_param('alphas', alphas)
    run.log_artifact('model', best_model)

print('---'*30)

------------------------------------------------------------------------------------------
Training model using Logistic regression algorithm
Training error: 0.0
Testing error: 0.1
------------------------------------------------------------------------------------------
Training model using Lasso regression algorithm
Training error: 0.4059424519410545
Testing error: 0.41364074527261824
------------------------------------------------------------------------------------------
Training model using Ridge regression algorithm
Training error: 0.40390040669216887
Testing error: 0.41248003101742137
------------------------------------------------------------------------------------------
Training model using ElasticNet regression algorithm
Training error: 0.4030527924585826
Testing error: 0.4113655141350472
------------------------------------------------------------------------------------------
CPU times: user 90 ms, sys: 110 ms, total: 200 ms
Wall time: 131 ms


In [86]:
print('Best model: ' + best_model_type)
print('Best testing error: %.6f' % best_rmse)
print('R2 score: %.6f' % r2)

Best model: Logistic
Best testing error: 0.100000
R2 score: 0.990000


In [88]:
exp

ID,Date,Took,Params,Params,Metrics,Metrics
ID,Date,Took,alphas,model_type,r2,rmse
koff7hn,"Mon, 24 Sep 2018 19:40:05 GMT",0.00 s,‑,‑,0.0,0.0
8ygf7t0,"Mon, 24 Sep 2018 19:41:28 GMT",0.00 s,‑,‑,0.0,0.0
gwhf77v,"Mon, 24 Sep 2018 20:29:29 GMT",0.00 s,‑,‑,0.0,0.0
doif7rc,"Mon, 24 Sep 2018 20:31:12 GMT",0.00 s,‑,‑,0.0,0.0
dmjf79a,"Mon, 24 Sep 2018 20:34:12 GMT",0.00 s,‑,‑,0.0,0.0
2gkf7g5,"Mon, 24 Sep 2018 20:40:09 GMT",0.03 s,"[1, 0.1, 0.001, 0.0001]",Logistic,0.99,0.1
iflf7gi,"Mon, 24 Sep 2018 20:41:48 GMT",0.04 s,"[1, 0.1, 0.001, 0.0001]",Logistic,0.99,0.1
jn03y1r,"Mon, 24 Sep 2018 23:21:13 GMT",0.04 s,"[1, 0.1, 0.001, 0.0001]",Logistic,0.99,0.1
4593yrn,"Mon, 24 Sep 2018 23:55:59 GMT",0.05 s,"[1, 0.1, 0.001, 0.0001]",Logistic,0.99,0.1
