## Predicting if water pumps need repair



This code is a solution to the [Pump it Up: Data Mining the Water Table](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/) competition. The goal of this competition is to take a rich dataset provided by [Taarifa](http://taarifa.org/) and [Tanzanian Ministry of Water](http://maji.go.tz/)

It's worth noting, that while this file contains all the code you need to have a good offline model for this problem, it does not really explain the process of getting to this point. That could be a whole other post. But I could quickly mention that the process is much less linear than it looks in this post. 

In the post it looks pretty direct: load -> clean -> model -> tune -> Done drink a beer!
But in reality its more like: load -> sample a tiny bit of data -> model -> clean -> model -> clean -> try with more data -> tune hyper params -> model -> clean -> and so on and so forth. So if it doesn't seem immediately obvious how to get to this point, it's because it isn't. Don't give up!

Authors note: While you could spend days or even weeks trying to squeeze that last .001 percent of accuracy out of your model. The majority of useful learning occurs before you start squeezing the dry orange, so it's best to try to make your way to the top %1 of the competition and move on and learn from a new competition.

### Sections
* data cleaning
* fitting a basic model
* tuning

### Imports

In [2]:
from IPython.display import display  # Use this function to display more than one thing per cell.
import pandas as pd
import seaborn as sns
import numpy as np
pd.options.display.max_columns = None  # Show all the columns when we display a dataframe.

### Make a tool to reduce the dimensionality of categorical Data
We do this so our data can fit in memory, but we don't totally have to throw categorical data out.

In [3]:
REDUCE_DEFAULT = "OTHER"

class FactorReducer():
    """ Reduce a categorical column to a maximum number of factors. We'll keep the most commonly occuring values."""
    def __init__(self):
        self.factors_to_keep = []
        
    def fit(self, series, max_factors, reduce_to=REDUCE_DEFAULT):
        self.factors_to_keep = series.value_counts().keys()[0:max_factors]
        self.reduce_to = REDUCE_DEFAULT
        
    def transform(self, series):
        new_column = series.copy(deep=True)
        indices_for_other = np.logical_not(series.isin(self.factors_to_keep))
        new_column.loc[indices_for_other] = self.reduce_to
        return new_column

### Load up the data

In [4]:
train_features = pd.read_csv("train_features.csv")
train_target = pd.read_csv("train_target.csv").fillna("Empty")
test_features = pd.read_csv("test_features.csv").fillna("Empty")
train = train_features.set_index("id").join(train_target.set_index("id"))  # Join into one training DataFrame

### Choose our features

In [5]:
TARGET = "status_group"

NUMERIC_FEATURES = [
    'amount_tsh',
    'gps_height',
    'longitude',
    'latitude',
    'population'
]

CATEGORICAL_FEATURES = [
    'construction_year',
    'funder',
    'installer',
    'basin',
    'region',
    'public_meeting',
    'scheme_management',
    'permit',
    'extraction_type',
    'extraction_type_group',
    'extraction_type_class',
    'management',
    'management_group',
    'payment',
    'payment_type',
    'water_quality',
    'quality_group',
    'quantity',
    'quantity_group',
    'source',
    'source_type',
    'source_class',
    'waterpoint_type',
    'waterpoint_type_group'
]

DROP_COLUMNS = [
    "wpt_name",
    "num_private",
    "subvillage",
    "region_code",
    "district_code",
    "lga",
    "ward",
    "recorded_by",
    "scheme_name",
]

CATEGORY_CAPS = {
    "funder": 55,
    "installer": 45,
    "construction_year": 9
}

In [6]:
train.drop(DROP_COLUMNS, 1)
test_features.drop(DROP_COLUMNS, 1)

# Make it so we can fit categorical data into memory.
categorical_reducers = {}
for feature, max_factors in CATEGORY_CAPS.items():
    reducer = FactorReducer()
    reducer.fit(train[feature], max_factors)
    categorical_reducers[feature] = reducer

def build_cleaned_frame(input_frame, numerics, categoricals, categorical_reducers):
    category_frame = input_frame[categoricals]
    for feature, reducer in categorical_reducers.items():
        category_frame[feature] = reducer.transform(category_frame[feature])
    cleaned_frame = input_frame[numerics]
    dummy_category_frame = pd.get_dummies(category_frame, prefix=categoricals, prefix_sep="=")
    return cleaned_frame.join(dummy_category_frame)

cleaned_train = build_cleaned_frame(train, NUMERIC_FEATURES, CATEGORICAL_FEATURES, categorical_reducers)
cleaned_test = build_cleaned_frame(test_features, NUMERIC_FEATURES, CATEGORICAL_FEATURES, categorical_reducers)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### Convert date features into a more usable format

In [7]:
import datetime
def get_date_frame(df):
    ret_frame = pd.DataFrame(index=df.index)

    dates = pd.DatetimeIndex(df['date_recorded'])
    ret_frame["year_recorded"] = dates.year
    ret_frame["month_recorded"] = dates.month

    dates = df['date_recorded'].astype('datetime64[ns]')
    ret_frame["recorded_days_ago"] = (datetime.date.today() - dates).apply(lambda x: x.days)
    return ret_frame

train_date_frame = get_date_frame(train)
test_date_frame = get_date_frame(test_features)

In [14]:
cleaned_train = cleaned_train.join(train_date_frame)
cleaned_train.to_csv('cleaned_train.csv', index=False)
cleaned_test = cleaned_test.join(test_date_frame)
cleaned_test.to_csv('cleaned_test.csv', index=False)
train[TARGET].to_csv('YS.csv', index=False, header=True)
cols = np.intersect1d(cleaned_train.columns, cleaned_test.columns) # ONLY use the columns that exist for both

ValueError: columns overlap but no suffix specified: Index(['year_recorded', 'month_recorded', 'recorded_days_ago'], dtype='object')

In [15]:
train[TARGET].to_csv('YS.csv', index=False, header=True)

In [9]:
## Just a quick sanity check we didn't let any nonexistent data slip through
cleaned_test[cols].isnull().sum()

amount_tsh                                     0
basin=Internal                                 0
basin=Lake Nyasa                               0
basin=Lake Rukwa                               0
basin=Lake Tanganyika                          0
basin=Lake Victoria                            0
basin=Pangani                                  0
basin=Rufiji                                   0
basin=Ruvuma / Southern Coast                  0
basin=Wami / Ruvu                              0
construction_year=0                            0
construction_year=2000                         0
construction_year=2003                         0
construction_year=2006                         0
construction_year=2007                         0
construction_year=2008                         0
construction_year=2009                         0
construction_year=2010                         0
construction_year=2011                         0
construction_year=OTHER                        0
extraction_type=afri

## Build a Model

We have clean data now, let's run a decent and fast classification model on it and see what kind of accuracy we can get.

In [37]:
%%time

from sklearn.ensemble import RandomForestClassifier
SAMPLE_SIZE = 10000

sample = cleaned_train.sample(SAMPLE_SIZE)
sample_target = train[TARGET].loc[sample.index]

rfc = RandomForestClassifier()
scores = cross_val_score(rfc, sample, sample_target, cv=5)  # (1)
print("all scores: ", scores)
print("mean: ", scores.mean())

all scores:  [ 0.75574426  0.763       0.763       0.75737869  0.77088544]
mean:  0.762001677562
CPU times: user 1.25 s, sys: 141 ms, total: 1.39 s
Wall time: 1.42 s


(1) `cross_val_score` splits our data up into several cross validation sets.
For each set, we split the data into a training portion and a testing portion. 
As implied by the name, we train the model on the training portion, then we see how well it does on 
data it hasn't seen before using the testing portion. more on [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html)

## What if we run an rfc on our whole dataset?

In [22]:
%%time

rfc = RandomForestClassifier()
scores = cross_val_score(rfc, cleaned_train, train[TARGET], cv=5)
print("all scores: ", scores)
print("mean: ", scores.mean())

all scores:  [ 0.79808097  0.79277839  0.79452862  0.79234007  0.79264186]
mean:  0.794073980208
CPU times: user 8.75 s, sys: 914 ms, total: 9.66 s
Wall time: 9.69 s


It's worth noting, these are pretty solid results for not doing any model tuning. We just took our cleaned data and dropped a stock sklearn model on it. We could use XGBoost, but actually, it runs way slower on my machine. Maybe this faster run time will allow us to have bigger tuning wins.

The best model in the competition is at .8285.

There are a few cools things we can get from our `RandomForestClassifier` like a peek at feature importance

In [35]:
rfc.fit(cleaned_train[cols], train[TARGET])
pd.DataFrame({
        'name': cols,
        'importance': rfc.feature_importances_}).sort_values(by="importance", ascending=False)

Unnamed: 0,importance,name
160,0.111561,latitude
161,0.111560,longitude
205,0.057815,quantity=dry
113,0.056570,gps_height
215,0.056486,recorded_days_ago
196,0.039908,population
210,0.031158,quantity_group=dry
40,0.023556,extraction_type_class=other
29,0.020316,extraction_type=other
179,0.018073,month_recorded


Cool, now we can quickly tell very interesting things like...the location of a well can pretty well indicate how likely it is to break down.

## XGBoost
How could we not go there. Have you ever seen a kaggle forum?

The random forest classifier is a pretty good model to use while we're still cleaning, exploring and playing with our dataset, because it runs fast. However, a well tuned xgboost usually performs better. Since we have all the features we want setup already, we can move to performance tuning.

In [20]:
%%time

from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

SAMPLE_SIZE = 10000

sample = cleaned_train.sample(SAbMPLE_SIZE)
sample_target = train[TARGET].loc[sample.index]

xgb = XGBClassifier()
scores = cross_val_score(xgb, sample, sample_target, cv=5)  # (1)
print("all scores: ", scores)
print("mean: ", scores.mean())

all scores:  [ 0.72677323  0.747       0.733       0.73636818  0.74437219]
mean:  0.737502719392
CPU times: user 1min 6s, sys: 477 ms, total: 1min 7s
Wall time: 1min 7s


With no tuning. XGB actually does worse than our random forest classifier. Let's tune this badboy up a little bit.

### Initialize xgboost and run it

In [None]:
#  colsample_bytree |     gamma |   max_depth |   scale_pos_weight
# 0.3124 |    1.7012 |     18.4428 |            20.0331
xgb = XGBClassifier(max_depth=18, colsample_bytree=0.31, gamma=18)
# xgb = XGBClassifier(max_depth=20, colsample_bytree=0.2)
scores = cross_val_score(xgb, cleaned_train, train[TARGET])
# scores = cross_val_score(xgb, sample, sample_target)
print(scores.mean())

### Set up Bayesian Optimization

In [27]:
from bayes_opt import bayesian_optimization

def rfc_eval(criterion, max_features, max_depth):
    
    max_depth = int(max_depth)
    crit = "gini" if criterion > .5 else "entropy"
    xgbc = RandomForestClassifier(
        criterion=crit,
        max_features=max_features, 
        max_depth=max_depth 
    )
    scores = cross_val_score(xgbc, cleaned_train, train[TARGET], n_jobs=-1)
    return scores.mean()

rfcBO = bayesian_optimization.BayesianOptimization(rfc_eval, {
        'criterion': (0,1),
        'max_features': (.05, .95),
        'max_depth':(3,50)
    })

rfcBO.maximize(init_points=5, n_iter=100)

[31mInitialization[0m
[94m--------------------------------------------------------------------------[0m
 Step |   Time |      Value |   criterion |   max_depth |   max_features | 
    1 | 00m05s | [35m   0.79160[0m | [32m     0.9572[0m | [32m    38.7470[0m | [32m        0.2406[0m | 
    2 | 00m03s |    0.70389 |      0.8487 |      4.0443 |         0.7075 | 
    3 | 00m07s | [35m   0.79357[0m | [32m     0.0031[0m | [32m    38.2414[0m | [32m        0.4993[0m | 
    4 | 00m02s |    0.73229 |      0.9330 |      7.3668 |         0.1477 | 
    5 | 00m07s |    0.79140 |      0.4597 |     34.0428 |         0.4425 | 
[31mBayesian Optimization[0m
[94m--------------------------------------------------------------------------[0m
 Step |   Time |      Value |   criterion |   max_depth |   max_features | 
    6 | 00m09s |    0.79056 |      0.6137 |     49.3645 |         0.1554 | 
    7 | 00m05s |    0.79281 |      0.0000 |     25.1771 |         0.0500 | 
    8 | 00m06s |    0



   15 | 00m16s |    0.79109 |      0.0000 |     50.0000 |         0.9500 | 


  " state: %s" % convergence_dict)


   16 | 00m07s |    0.79354 |      1.0000 |     26.3839 |         0.0500 | 


  " state: %s" % convergence_dict)


   17 | 00m08s | [35m   0.79458[0m | [32m     0.6302[0m | [32m    22.8174[0m | [32m        0.0574[0m | 
   18 | 00m09s |    0.79074 |      0.0739 |     16.6335 |         0.2439 | 
   19 | 00m12s |    0.79322 |      0.1203 |     47.0269 |         0.3362 | 


  " state: %s" % convergence_dict)


   20 | 00m07s |    0.78397 |      1.0000 |     16.4412 |         0.0607 | 
   21 | 00m11s |    0.79332 |      0.1074 |     39.2539 |         0.2444 | 


  " state: %s" % convergence_dict)


   22 | 00m18s |    0.79052 |      0.9029 |     46.1422 |         0.8308 | 


  " state: %s" % convergence_dict)


   23 | 00m16s |    0.79424 |      0.0779 |     20.5891 |         0.8428 | 
   24 | 00m09s | [35m   0.79604[0m | [32m     0.1281[0m | [32m    22.6534[0m | [32m        0.3791[0m | 
   25 | 00m16s |    0.79444 |      0.0878 |     21.8017 |         0.9203 | 
   26 | 00m08s |    0.79542 |      0.0460 |     22.2877 |         0.2397 | 


  " state: %s" % convergence_dict)


   27 | 00m09s |    0.79234 |      0.8450 |     32.6161 |         0.2621 | 


  " state: %s" % convergence_dict)


   28 | 00m17s |    0.78891 |      0.8600 |     28.3482 |         0.9371 | 
   29 | 00m10s |    0.79473 |      0.1704 |     18.8833 |         0.4143 | 
   30 | 00m08s |    0.79557 |      0.1094 |     23.1440 |         0.1852 | 


  " state: %s" % convergence_dict)


   31 | 00m16s |    0.79227 |      0.0244 |     45.1178 |         0.8984 | 


  " state: %s" % convergence_dict)


   32 | 00m09s |    0.79382 |      0.0863 |     26.7159 |         0.2230 | 
   33 | 00m14s |    0.79175 |      0.0065 |     47.4707 |         0.8602 | 


  " state: %s" % convergence_dict)


   34 | 00m15s |    0.78926 |      0.4172 |     17.7807 |         0.9489 | 
   35 | 00m09s |    0.79557 |      0.0756 |     19.2232 |         0.3028 | 


Process ForkPoolWorker-301:
Process ForkPoolWorker-304:
Process ForkPoolWorker-300:
Process ForkPoolWorker-299:
Process ForkPoolWorker-303:
Process ForkPoolWorker-302:
Process ForkPoolWorker-297:
Process ForkPoolWorker-298:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/David/anaconda/envs/sci/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/David/anaconda/envs/sci/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/David/anaconda/envs/sci/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/David/anaconda/envs/sci/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Use

TypeError: catching classes that do not inherit from BaseException is not allowed

In [None]:
%%time
from xgboost import XGBClassifier

xgb_out = XGBClassifier(colsample_bytree=0.309, gamma=0.795, max_depth=18)
xgb_out.fit(cleaned_train[cols], train[TARGET])

## Wait, how did you get those params?
I just ran bayesian optimization on smaller samples of the data.

In [None]:
centered = cleaned_train[cols] - cleaned_train[cols].mean()
centered = centered / centered.std()

xgb_cross_val = XGBClassifier(colsample_bytree=0.309, gamma=0.795, max_depth=24)
# scores = cross_val_score(xgb_cross_val, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)
scores = cross_val_score(xgb_cross_val, centered, train[TARGET], cv=5, n_jobs=-1)

print(scores.mean())

"""0.815336592481
"""

In [None]:
predictions = xgb_out.predict(cleaned_test[cols])
out_frame = pd.DataFrame({
        "id": test_features["id"],
        "status_group": predictions
    })
out_frame.to_csv("bayes_opt_xgb_out.csv", index=False)

In [None]:
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from xgboost import XGBClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=22)
xgb = XGBClassifier(colsample_bytree=0.309, gamma=0.795, max_depth=18)

voter = VotingClassifier(estimators=[
        ('xgb', xgb),
        ('rfc', rfc),
    ], voting="soft")
scores = cross_val_score(voter, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)
print(scores.mean())
voter.fit(cleaned_train[cols], train[TARGET])
voter.predict(cleaned_test[cols])
out_frame = pd.DataFrame({
        "id": test_features["id"],
        "status_group": predictions
    })
out_frame.to_csv("voter_bayes_opt_xgb_rfc2.csv", index=False)

"""
0.814865190752
"""

In [None]:
predictions = voter.fit(cleaned_train[cols], train[TARGET])
out_frame = pd.DataFrame({
        "id": test_features["id"],
        "status_group": predictions
    })
out_frame.to_csv("voter_bayes_opt_xgb_rfc.csv", index=False)

### Let's make a voter

In [None]:
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Make a bunch of xgboosts
COL = 'colsample_bytree'
GAM = "gamma"
MD  = 'max_depth'
xgb_settings = [
    {
        COL:0.309,
        GAM: 0.795,
        MD: 18
    },
    {
        COL:0.8892,
        GAM: 0.3534,
        MD: 12
    },
    {
        COL:0.3356,
        GAM: 2.6913,
        MD: 9
    },
    {
        COL:0.8892,
        GAM: 0.3534,
        MD: 12
    },
    {
        COL:0.3575,
        GAM: 0.9201,
        MD: 19
    }
]



# rfc_most_deep = RandomForestClassifier(n_estimators=100, max_depth=30)
# rfc_even_more_deep = RandomForestClassifier(n_estimators=100, max_depth=26)
# rfc_more_deep = RandomForestClassifier(n_estimators=100, max_depth=22)
# rfc_deep = RandomForestClassifier(n_estimators=100, max_depth=20)
# rfc_shallow = RandomForestClassifier(n_estimators=100, max_depth=18)

# rfc_estimators = [
#         ('shallow_rfc', rfc_shallow),
#         ('deep_rfc', rfc_deep),
#         ('more_deep_rfc', rfc_more_deep),
#         ('more_even_more_rfc', rfc_even_more_deep),
#         ('more_most_rfc', rfc_most_deep),
#     ]

xgbs = {}
for ii, xgb_setting in enumerate(xgb_settings):
    xgbs[ii] = XGBClassifier(colsample_bytree=xgb_setting[COL], gamma=xgb_setting[GAM], max_depth=xgb_setting[MD])
    
# put them together with a voting classifier
xgb_voter =  VotingClassifier(estimators=list(xgbs.items()), voting='soft')
mixed_estimators = list(xgbs.items()) + rfc_estimators
mixed_voter = VotingClassifier(estimators=mixed_estimators, voting="soft")

xgb_scores = cross_val_score(xgb_voter, cleaned_train[cols], train[TARGET], n_jobs=-1)
mixed_scores = cross_val_score(mixed_voter, cleaned_train[cols], train[TARGET], n_jobs=-1)

print("xgb voter: %s" % xgb_scores.mean())
print("mixed voter: %s" % mixed_scores.mean())

"""
xgb voter: 0.808097643098
mixed voter: 0.809646464646
"""


In [None]:
%%time
mixed_voter.fit(cleaned_train[cols], train[TARGET])
predictions = voter.predict(cleaned_test[cols])
output = pd.DataFrame({
        'id': test_features['id'],
        'status_group': predictions
    })
output.to_csv('mixed_voter_xgb_rfc.csv', index=False)

In [None]:
cleaned_test.isnull().sum()

In [None]:
rfc_most_deep = RandomForestClassifier(n_estimators=100, max_depth=26)
rfc_even_more_deep = RandomForestClassifier(n_estimators=100, max_depth=24)
rfc_more_deep = RandomForestClassifier(n_estimators=100, max_depth=22)
rfc_deep = RandomForestClassifier(n_estimators=100, max_depth=20)
rfc_shallow = RandomForestClassifier(n_estimators=100, max_depth=18)
voter = VotingClassifier(estimators=[
        ('shallow_rfc', rfc_shallow),
        ('deep_rfc', rfc_deep),
        ('more_deep_rfc', rfc_more_deep),
        ('more_even_more_rfc', rfc_even_more_deep),
        ('more_most_rfc', rfc_most_deep),
    ], voting="soft")


In [None]:
%%time
# 40 0.747087542088 md=10
# 100 0.794511784512 md=20
# 40 0.798097643098 md=20
# 30 0.795757575758 md=30
from sklearn.ensemble import RandomForestClassifier

rfc_most_deep = RandomForestClassifier(n_estimators=100, max_depth=30)
rfc_even_more_deep = RandomForestClassifier(n_estimators=100, max_depth=26)
rfc_more_deep = RandomForestClassifier(n_estimators=100, max_depth=22)
rfc_deep = RandomForestClassifier(n_estimators=100, max_depth=20)
rfc_shallow = RandomForestClassifier(n_estimators=100, max_depth=18)

scores_most_deep = cross_val_score(rfc_most_deep, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)
scores_even_more_deep = cross_val_score(rfc_even_more_deep, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)
scores_more_deep = cross_val_score(rfc_more_deep, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)
scores_deep = cross_val_score(rfc_deep, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)
scores_shallow = cross_val_score(rfc_shallow, cleaned_train[cols], train[TARGET], cv=5, n_jobs=-1)

scores_voter = cross_val_score(voter, cleaned_train[cols], train[TARGET], n_jobs=-1, cv=5)
print("shallow: %s" % scores_shallow.mean())
print("deep: %s" % scores_deep.mean())
print("more deep: %s" % scores_more_deep.mean())
print("even more deep: %s" % scores_even_more_deep.mean())
print("most deep: %s" % scores_most_deep.mean())
print("voter: %s" % scores_voter.mean())

In [None]:
from bayes_opt import bayesian_optimization

def eval_rfc(n_years, n_funders, n_installers):
    train = loaded_train.copy(deep=True)
    train.drop(DROP_COLUMNS, 1)
    test_features.drop(DROP_COLUMNS, 1)

    category_caps = {
        "funder": int(n_funders),
        "installer": int(n_installers),
        "construction_year": int(n_years)
    }
    categorical_reducers = {}
    for feature, max_factors in category_caps.items():
        reducer = FactorReducer()
        reducer.fit(train[feature], max_factors)
        categorical_reducers[feature] = reducer
        train[feature] = reducer.transform(train[feature])
        
    cleaned_train = build_cleaned_frame(train, NUMERIC_FEATURES, CATEGORICAL_FEATURES, categorical_reducers)
    cleaned_test = build_cleaned_frame(test_features, NUMERIC_FEATURES, CATEGORICAL_FEATURES, categorical_reducers)
    cleaned_train = cleaned_train.join(train_date_frame)
    cleaned_test = cleaned_test.join(test_date_frame)
    
    rfc = RandomForestClassifier(n_estimators=20, max_depth=20)
    scores = cross_val_score(rfc, cleaned_train[cleaned_test.columns], train[TARGET])
    return scores.mean()

bopt = bayesian_optimization.BayesianOptimization(eval_rfc, {
        'n_years': (1, 30),
        'n_funders': (1, 100),
        'n_installers': (1, 100)
    })

bopt.maximize(init_points=5, n_iter=100)

In [None]:
from keras.utils import np_utils

le = LabelEncoder()
le.fit(train[TARGET])
encoded_Y = le.transform(train[TARGET])
dummy_y = np_utils.to_categorical(encoded_Y)

In [None]:
train[TARGET].value_counts()

In [None]:
centered = cleaned_train[cols] - cleaned_train[cols].mean()
centered = centered / centered.std()

In [None]:
centered.head(5)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(centered, train[TARGET], test_size=0.2, random_state=0)

In [None]:
X_train.

In [None]:
class_weights = pd.get_dummies(y_train).sum().max() / pd.get_dummies(y_train).sum()
cw = dict(zip(range(3), class_weights.values))
cw

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers.convolutional import Convolution1D
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from keras.optimizers import SGD, Adam
from keras.regularizers import WeightRegularizer
from sklearn.metrics import accuracy_score

def keras_model():
#     adam = Adam(lr=0.01)
    model = Sequential() # make a model in a linear stack of layers
    model.add(Dense(64, input_dim=290, activation="relu"))
    model.add(Dropout(.5))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(.5))
    model.add(Dense(output_dim=3, activation="softmax"))
    model.compile(loss='categorical_crossentropy', optimizer="adagrad", metrics=['accuracy'])
    return model

kc = Sequential()
kc.add(Dense(10, input_shape=(290,), activation="relu"))
kc.add(Dropout(.5))
kc.add(Dense(3, activation='softmax'))
kc.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['accuracy'])

# kc = KerasClassifier(keras_model)
# kc = keras_model()
kc.fit(X_train.as_matrix(), pd.get_dummies(y_train).as_matrix(), validation_split=0.2, nb_epoch=50, batch_size=32, class_weight=cw)

predicted = kc.predict_classes(X_test.as_matrix())
actual = y_test.map({
        "functional": 0,
        "functional needs repair": 1,
        "non functional": 2
    }).values


print(pd.Series(predicted).value_counts())
accuracy_score(predicted, actual)
## Your input to keras needs to be input arrays, not dataframes.
# scores = cross_val_score(kc, cleaned_train[cols].as_matrix(), pd.get_dummies(train[TARGET]).as_matrix(), cv=5, fit_params = {'nb_epoch': 3})
# scores

# train_test_split(cleaned_train[cols], pd.get_dummies(train[TARGET]), test_size=0.33)


In [None]:
predictions2 = kc.predict_classes(cleaned_train[cols].as_matrix())

In [None]:
predictions2

In [None]:
"""
1    34314
2    20514
0     4572
"""

"""
2    53609
0     5791
"""

pd.Series(predictions2).value_counts()

In [None]:
pd.Series(np.argmax(predictions, axis=1)).value_counts()

In [None]:
pd.Series(np.argmax(predictions2, axis=1)).value_counts()

In [None]:
pd.DataFrame(predictions).sum()

In [None]:
import tflearn
# Classification
tflearn.init_graph(num_cores=8, gpu_memory_fraction=0.5)

net = tflearn.input_data(shape=[None, 290])
net = tflearn.fully_connected(net, 64)
net = tflearn.dropout(net, 0.5)
net = tflearn.fully_connected(net, 3, activation='softmax')
net = tflearn.regression(net, optimizer='adam', loss='categorical_crossentropy')

model = tflearn.DNN(net)
kc.fit(cleaned_train[cols].as_matrix(), pd.get_dummies(train[TARGET]).as_matrix(), validation_split=0.33, nb_epoch=3, batch_size=10)

In [23]:
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.ensemble import BaggingClassifier
logistic = LogisticRegression()
print(cross_val_score(logistic, cleaned_train[cols], train[TARGET], n_jobs=-1).mean())
n_estimators=10
bagging = BaggingClassifier(logistic, max_samples=.2, n_estimators=n_estimators)
cross_val_score(bagging, cleaned_train[cols], train[TARGET], n_jobs=-1).mean()

0.744225589226


  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


0.74473063973063969

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier

n_estimators=100
svm = SVC(verbose=True, degree=5, kernel='poly')
bagging = BaggingClassifier(svm, max_samples=1.0 / n_estimators, n_estimators=n_estimators)
scores = cross_val_score(bagging, cleaned_train[cols], train[TARGET], n_jobs=-1)
scores