#ML Search

This notebook is the search for the best ML algorithm for the database and the best combination of features,  differently from the Grid Search from SKLearn lib that search over parameters of one algorithm only.

The basic flow here is:

- create a combinations of all the column groups
- for each combination
    - read the appropriate data 
    - run a series of machine learning algorithms
    - order the algorithms by the best results
    - save the prediction of the 2 best algorithms

In [1]:
%pylab inline
import sklearn
import pandas as pd
import mylib.utils as mu
from itertools import combinations 
import seaborn as sns
from mylib.utils import print_time

Populating the interactive namespace from numpy and matplotlib


In [2]:
groups = ['balance', 'location', 'nu_info', 'personal', 'raw_scores', 'scores_class']
all_combinations = [ p for i in range(len(groups)) for p in combinations(groups, i)][1:]
target = mu.load_target_data()

### Creating a benchmark
Checking what would be the result if the predictions are all 0 or all 1

In [3]:
# split the target to get the validation 
X_train, X_val, y_train, y_val = mu.split_data(target, target)
print_time('All Zeros')
mu.classification_report_matrix(y_val, [0]*len(y_val))
print_time('All Ones')
mu.classification_report_matrix(y_val, [1]*len(y_val))
print_time('Random Guess')
print_time( mu.np.mean([ mu.f1_scorer(y_val, mu.np.random.randint(2, size=len(y_val))) for _ in range(100)]) )

02:51:49 10/08/15 BRT - All Zeros
02:51:49 10/08/15 BRT - 
              precision    recall  f1-score   support

          0       0.48      1.00      0.65        31
          1       0.00      0.00      0.00        33

avg / total       0.23      0.48      0.32        64

02:51:49 10/08/15 BRT - 
 [[31  0]
 [33  0]]
02:51:49 10/08/15 BRT - All Ones
02:51:49 10/08/15 BRT - 
              precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.52      1.00      0.68        33

avg / total       0.27      0.52      0.35        64

02:51:49 10/08/15 BRT - 
 [[ 0 31]
 [ 0 33]]
02:51:49 10/08/15 BRT - Random Guess
02:51:49 10/08/15 BRT - 0.515750165829


  'precision', 'predicted', average, warn_for)


### Search over all combinations

In [4]:
proc_name = 'combination'
best_score = 0
best_run = None
best_models = {}
for comb in all_combinations:
    # load only the data from the combination
    print_time('Processing combination {}'.format(comb))
    data = mu.load_data(comb)
    results = mu.train_regression(data, mu.np.ravel(target), scorer=mu.f1_scorer)
    
    # evaluate the best model agains the test set
    run =  results[0][1]
    run['comb'] = comb
    score_test = run['score_val']
    print_time('Test score {}'.format(score_test))
    
    # save the best model
    if best_score<score_test:
        best_score = score_test
        best_run = run
    
    # save models with score test more than .65
    best_models[score_test] = run

02:51:49 10/08/15 BRT - Processing combination ('balance',)
02:51:49 10/08/15 BRT - Created train and validation
02:51:49 10/08/15 BRT - Size train: (570, 9) test:(64, 9)
02:51:49 10/08/15 BRT - Starting to train models
02:51:49 10/08/15 BRT - Took 0.287822961807 seconds
02:51:49 10/08/15 BRT - Test score 0.533333333333
02:51:49 10/08/15 BRT - Processing combination ('location',)
02:51:49 10/08/15 BRT - Created train and validation
02:51:49 10/08/15 BRT - Size train: (570, 246) test:(64, 246)
02:51:49 10/08/15 BRT - Starting to train models
02:51:50 10/08/15 BRT - Took 1.08130288124 seconds
02:51:50 10/08/15 BRT - Test score 0.49098621421
02:51:50 10/08/15 BRT - Processing combination ('nu_info',)
02:51:50 10/08/15 BRT - Created train and validation
02:51:50 10/08/15 BRT - Size train: (570, 8) test:(64, 8)
02:51:50 10/08/15 BRT - Starting to train models
02:51:51 10/08/15 BRT - Took 0.290830135345 seconds
02:51:51 10/08/15 BRT - Test score 0.59748427673
02:51:51 10/08/15 BRT - Processi

  'precision', 'predicted', average, warn_for)


In [5]:
print_time('Best run is:')
print_time('Comb: {}'.format(best_run['comb']))
print_time('Model: {}'.format(best_run['model']))
print_time('Scores: train - {} :: validation - {}'.
           format(best_run['score_train'], best_run['score_val']))
mu.classification_report_matrix(best_run['y_val'], best_run['pred_val'])
mu.save_predictions_from_model(best_run, 'mlsearch_{}'.format(best_run['comb']))

02:52:50 10/08/15 BRT - Best run is:
02:52:50 10/08/15 BRT - Comb: ('location', 'nu_info', 'personal', 'raw_scores', 'scores_class')
02:52:50 10/08/15 BRT - Model: LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', refit=True,
           scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
02:52:50 10/08/15 BRT - Scores: train - 0.568329650493 :: validation - 0.661886792453
02:52:50 10/08/15 BRT - 
              precision    recall  f1-score   support

          0       0.61      0.87      0.72        31
          1       0.80      0.48      0.60        33

avg / total       0.71      0.67      0.66        64

02:52:50 10/08/15 BRT - 
 [[27  4]
 [17 16]]
02:52:50 10/08/15 BRT - Saved ./Output/pred_mlsearch_('location', 'nu_info', 'personal', 'raw_scores', 'scores_class')_661_LogisticRegressionCV.csv


In [6]:
num_best = 3
for score in sorted(best_models.keys())[-num_best-2:-2]:
    model = best_models[score]
    print_time('Comb: {}'.format(model['comb']))
    print_time('Model: {}'.format(model['model']))
    print_time('Scores: train - {} :: validation - {}'.
               format(model['score_train'], model['score_val']))
    mu.classification_report_matrix(model['y_val'], model['pred_val'])
    mu.save_predictions_from_model(model, 'mlsearch_{}'.format(model['comb']))

02:52:50 10/08/15 BRT - Comb: ('location', 'nu_info', 'personal')
02:52:50 10/08/15 BRT - Model: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=56909,
  shrinking=True, tol=0.001, verbose=False)
02:52:50 10/08/15 BRT - Scores: train - 0.623556647588 :: validation - 0.625159154571
02:52:50 10/08/15 BRT - 
              precision    recall  f1-score   support

          0       0.59      0.87      0.70        31
          1       0.78      0.42      0.55        33

avg / total       0.69      0.64      0.62        64

02:52:50 10/08/15 BRT - 
 [[27  4]
 [19 14]]
02:52:50 10/08/15 BRT - Saved ./Output/pred_mlsearch_('location', 'nu_info', 'personal')_625_SVC.csv
02:52:50 10/08/15 BRT - Comb: ('balance', 'nu_info', 'personal', 'raw_scores', 'scores_class')
02:52:50 10/08/15 BRT - Model: LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scali

### Dimensionality redution
The main table has more than 260 columns, for 634 lines. That makes the model overfit. Applying PCA there may be a benefical reduction of the number of columns.

In [7]:
from sklearn.decomposition import PCA, RandomizedPCA 

In [8]:
data = mu.load_data(['balance', 'location', 'nu_info', 'personal', 'raw_scores', 'scores_class'])
pca = RandomizedPCA()

02:52:50 10/08/15 BRT - Loaded info group location, current shape of data (634, 256)
02:52:51 10/08/15 BRT - Loaded info group nu_info, current shape of data (634, 264)
02:52:51 10/08/15 BRT - Loaded info group personal, current shape of data (634, 286)
02:52:51 10/08/15 BRT - Loaded info group raw_scores, current shape of data (634, 291)
02:52:51 10/08/15 BRT - Loaded info group scores_class, current shape of data (634, 295)


In [9]:
pca_train = pca.fit_transform(data)

In [10]:
results = mu.train_regression(pca_train, ravel(target))
mu.print_best(results, 2)

02:52:51 10/08/15 BRT - Created train and validation
02:52:51 10/08/15 BRT - Size train: (570, 294) test:(64, 294)
02:52:51 10/08/15 BRT - Starting to train models
02:52:55 10/08/15 BRT - Took 4.56168603897 seconds
------------------------------------------------- 
0 Model - Score: val - 0.640625 :: train - 0.621053
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', refit=True,
           scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
------------------------------------------------- 
------------------------------------------------- 
1 Model - Score: val - 0.593750 :: train - 0.710526
RidgeClassifierCV(alphas=array([  0.1,   1. ,  10. ]), class_weight=None,
         cv=None, fit_intercept=True, normalize=False, scoring=None)
------------------------------------------------- 
