# Model Selection

Exhaustively fine tuning each algorithm can be very time consuming. However, the question still stands: "Which model should I chose?". We can answer this question by first understanding the data.

**Prerequisites**

* Is online learning required? I other words, is the model going to update in real time, with real sequential streams of information? Or is the learning take place batch-level, where all of the learning takes place in one go?


**Exploratory Data Analysis**

* What is the size and dimension (shape) of the training dataset?
* Is the data is linearly separable?
* Are features are independent?


## Comparison of all Models

In [11]:
import pandas as pd
import textwrap
from collections import OrderedDict as d
from tabulate import tabulate


def wrap(d, width=30):
    for k, v in d.items():
        d[k] = '\n'.join(textwrap.wrap(v, width=width)) 
    return d


model_comp = [wrap(d({'model': 'naive bayes', 
                      'use when': 'small data with independent features\n'
                                  'large data without feature independence', 
                      'advantages': 'fast\n'
                                    'low variance\n', 
                      'disadvantages': 'high bias'})),
              wrap(d({'model': 'logistic regression', 
                      'use when': 'data is ~ linearly seperable, or seperable after tranformation', 
                      'advantages': 'scalable with SGD optimization\n'
                                    'low bias\n', 
                      'disadvantages': 'high variance\n'
                                       'doesnt perform well with highly dimensional dataset'})),
             wrap(d({'model': 'svm', 
                     'use when': 'high dimensional dataset (equip with nonlinear kernal such as RBF)', 
                     'advantages': 'on par with logistic regression with a linear kernal\n',
                     'disadvantages': 'parameter tuning can be successful but will require a lot of computational power and memory'})),
             wrap(d({'model': 'random forest', 
                     'use when': 'linear seperablilty is an issue with other algorithms\n', 
                     'advantages': 'categorical features do not require encoding\n'
                                   'easy to explain to nonpractitioners',
                     'disadvantages': ''})),
             wrap(d({'model': 'nueral networks', 
                     'use when': 'you have a lot of data and you know what you are doing', 
                     'advantages': 'provably powerful (esp. deep learning)',
                     'disadvantages': 'finding the correct topology (layer structure) is difficult\n'
                                      'computationally expensive and time consuming\n'})),
             ]


model_comp_df = pd.DataFrame(model_comp)
print(tabulate(model_comp_df, headers='keys', tablefmt='pipe'))

|    | model               | use when                       | advantages                     | disadvantages                  |
|---:|:--------------------|:-------------------------------|:-------------------------------|:-------------------------------|
|  0 | naive bayes         | small data with independent    | fast low variance              | high bias                      |
|    |                     | features large data without    |                                |                                |
|    |                     | feature independence           |                                |                                |
|  1 | logistic regression | data is ~ linearly seperable,  | scalable with SGD optimization | high variance doesnt perform   |
|    |                     | or seperable after             | low bias                       | well with highly dimensional   |
|    |                     | tranformation                  |                                | dataset  

['sdaokdsoak',
 'sdokaosdko',
 'aksdokasod',
 'koaskdpjkd',
 'saghjahfgi',
 'uhaighoifa',
 'hgoiudfhio',
 'ughfdosiuh',
 'gifudhsgiu',
 'sdfhiuoghd',
 'fsiouhgius',
 'fhgiuhsdf']