$$ \huge{\text{  Module 3 GridSearch and Pipelines}}$$

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

$ \huge{\text{ Overview}}$

<font size =4> Our goal in the section is to run a number of models at the same time and find out what processes and parameters produce the best models.  In our first example we will run 24 models at the same time.  We will introduce a few new procedures __PCA__ and __Neural Networks__ that I will very briefly describe but go into greater detail in __Module 4 and Module 5.__ 

# Pipeline
<font size=4>
    
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html 
    
Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.




# GridSearch 
<font size=4>

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=grid%20searchcv

Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

# Example [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis)

<font size=4>Let's look at  SKlearns's  PCA and Logistic Regression, specifically let's look at all the parameters.
    
<font color=blue>__[PCA Parameters](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)__</font>
    
<dt><strong>n_components</strong><span class="classifier">int, float or ‘mle’, default=None</span></dt><dd><p>Number of components to keep.
if n_components is not set all components are kept:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">n_components</span> <span class="o">==</span> <span class="nb">min</span><span class="p">(</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">n_features</span><span class="p">)</span>
</pre></div>
</div>
<p>If <code class="docutils literal notranslate"><span class="pre">n_components</span> <span class="pre">==</span> <span class="pre">'mle'</span></code> and <code class="docutils literal notranslate"><span class="pre">svd_solver</span> <span class="pre">==</span> <span class="pre">'full'</span></code>, Minka’s
MLE is used to guess the dimension. Use of <code class="docutils literal notranslate"><span class="pre">n_components</span> <span class="pre">==</span> <span class="pre">'mle'</span></code>
will interpret <code class="docutils literal notranslate"><span class="pre">svd_solver</span> <span class="pre">==</span> <span class="pre">'auto'</span></code> as <code class="docutils literal notranslate"><span class="pre">svd_solver</span> <span class="pre">==</span> <span class="pre">'full'</span></code>.</p>
<p>If <code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">&lt;</span> <span class="pre">n_components</span> <span class="pre">&lt;</span> <span class="pre">1</span></code> and <code class="docutils literal notranslate"><span class="pre">svd_solver</span> <span class="pre">==</span> <span class="pre">'full'</span></code>, select the
number of components such that the amount of variance that needs to be
explained is greater than the percentage specified by n_components.</p>
<p>If <code class="docutils literal notranslate"><span class="pre">svd_solver</span> <span class="pre">==</span> <span class="pre">'arpack'</span></code>, the number of components must be
strictly less than the minimum of n_features and n_samples.</p>
<p>Hence, the None case results in:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">n_components</span> <span class="o">==</span> <span class="nb">min</span><span class="p">(</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">n_features</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span>
</pre></div>
</div>
</dd>
<dt><strong>copy</strong><span class="classifier">bool, default=True</span></dt><dd><p>If False, data passed to fit are overwritten and running
fit(X).transform(X) will not yield the expected results,
use fit_transform(X) instead.</p>
</dd>
<dt><strong>whiten</strong><span class="classifier">bool, default=False</span></dt><dd><p>When True (False by default) the <code class="docutils literal notranslate"><span class="pre">components_</span></code> vectors are multiplied
by the square root of n_samples and then divided by the singular values
to ensure uncorrelated outputs with unit component-wise variances.</p>
<p>Whitening will remove some information from the transformed signal
(the relative variance scales of the components) but can sometime
improve the predictive accuracy of the downstream estimators by
making their data respect some hard-wired assumptions.</p>
</dd>
<dt><strong>svd_solver</strong><span class="classifier">{‘auto’, ‘full’, ‘arpack’, ‘randomized’}, default=’auto’</span></dt><dd><dl class="simple">
<dt>If auto :</dt><dd><p>The solver is selected by a default policy based on <code class="docutils literal notranslate"><span class="pre">X.shape</span></code> and
<code class="docutils literal notranslate"><span class="pre">n_components</span></code>: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient ‘randomized’
method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards.</p>
</dd>
<dt>If full :</dt><dd><p>run exact full SVD calling the standard LAPACK solver via
<code class="docutils literal notranslate"><span class="pre">scipy.linalg.svd</span></code> and select the components by postprocessing</p>
</dd>
<dt>If arpack :</dt><dd><p>run SVD truncated to n_components calling ARPACK solver via
<code class="docutils literal notranslate"><span class="pre">scipy.sparse.linalg.svds</span></code>. It requires strictly
0 &lt; n_components &lt; min(X.shape)</p>
</dd>
<dt>If randomized :</dt><dd><p>run randomized SVD by the method of Halko et al.</p>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.18.0.</span></p>
</div>
</dd>
<dt><strong>tol</strong><span class="classifier">float, default=0.0</span></dt><dd><p>Tolerance for singular values computed by svd_solver == ‘arpack’.
Must be of range [0.0, infinity).</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.18.0.</span></p>
</div>
</dd>
<dt><strong>iterated_power</strong><span class="classifier">int or ‘auto’, default=’auto’</span></dt><dd><p>Number of iterations for the power method computed by
svd_solver == ‘randomized’.
Must be of range [0, infinity).</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.18.0.</span></p>
</div>
</dd>
<dt><strong>random_state</strong><span class="classifier">int, RandomState instance or None, default=None</span></dt><dd><p>Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int
for reproducible results across multiple function calls.
See <a class="reference internal" href="../../glossary.html#term-random_state"><span class="xref std std-term">Glossary</span></a>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.18.0.</span></p>
</div>
</dd>
</dl>
</dd>

    
    
    
<font color=blue>__[Logisitic Regression Parameters](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)__</font>
    
<dd class="field-odd"><dl>
<dt><strong>penalty</strong><span class="classifier">{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’</span></dt><dd><p>Used to specify the norm used in the penalization. The ‘newton-cg’,
‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is
only supported by the ‘saga’ solver. If ‘none’ (not supported by the
liblinear solver), no regularization is applied.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.19: </span>l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)</p>
</div>
</dd>
<dt><strong>dual</strong><span class="classifier">bool, default=False</span></dt><dd><p>Dual or primal formulation. Dual formulation is only implemented for
l2 penalty with liblinear solver. Prefer dual=False when
n_samples &gt; n_features.</p>
</dd>
<dt><strong>tol</strong><span class="classifier">float, default=1e-4</span></dt><dd><p>Tolerance for stopping criteria.</p>
</dd>
<dt><strong>C</strong><span class="classifier">float, default=1.0</span></dt><dd><p>Inverse of regularization strength; must be a positive float.
Like in support vector machines, smaller values specify stronger
regularization.</p>
</dd>
<dt><strong>fit_intercept</strong><span class="classifier">bool, default=True</span></dt><dd><p>Specifies if a constant (a.k.a. bias or intercept) should be
added to the decision function.</p>
</dd>
<dt><strong>intercept_scaling</strong><span class="classifier">float, default=1</span></dt><dd><p>Useful only when the solver ‘liblinear’ is used
and self.fit_intercept is set to True. In this case, x becomes
[x, self.intercept_scaling],
i.e. a “synthetic” feature with constant value equal to
intercept_scaling is appended to the instance vector.
The intercept becomes <code class="docutils literal notranslate"><span class="pre">intercept_scaling</span> <span class="pre">*</span> <span class="pre">synthetic_feature_weight</span></code>.</p>
<p>Note! the synthetic feature weight is subject to l1/l2 regularization
as all other features.
To lessen the effect of regularization on synthetic feature weight
(and therefore on the intercept) intercept_scaling has to be increased.</p>
</dd>
<dt><strong>class_weight</strong><span class="classifier">dict or ‘balanced’, default=None</span></dt><dd><p>Weights associated with classes in the form <code class="docutils literal notranslate"><span class="pre">{class_label:</span> <span class="pre">weight}</span></code>.
If not given, all classes are supposed to have weight one.</p>
<p>The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as <code class="docutils literal notranslate"><span class="pre">n_samples</span> <span class="pre">/</span> <span class="pre">(n_classes</span> <span class="pre">*</span> <span class="pre">np.bincount(y))</span></code>.</p>
<p>Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.17: </span><em>class_weight=’balanced’</em></p>
</div>
</dd>
<dt><strong>random_state</strong><span class="classifier">int, RandomState instance, default=None</span></dt><dd><p>Used when <code class="docutils literal notranslate"><span class="pre">solver</span></code> == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the
data. See <a class="reference internal" href="../../glossary.html#term-random-state"><span class="xref std std-term">Glossary</span></a> for details.</p>
</dd>
<dt><strong>solver</strong><span class="classifier">{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’},             default=’lbfgs’</span></dt><dd><p>Algorithm to use in the optimization problem.</p>
<ul class="simple">
<li><p>For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.</p></li>
<li><p>For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest
schemes.</p></li>
<li><p>‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty</p></li>
<li><p>‘liblinear’ and ‘saga’ also handle L1 penalty</p></li>
<li><p>‘saga’ also supports ‘elasticnet’ penalty</p></li>
<li><p>‘liblinear’ does not support setting <code class="docutils literal notranslate"><span class="pre">penalty='none'</span></code></p></li>
</ul>
<p>Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on
features with approximately the same scale. You can
preprocess the data with a scaler from sklearn.preprocessing.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.17: </span>Stochastic Average Gradient descent solver.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.19: </span>SAGA solver.</p>
</div>
<div class="versionchanged">
<p><span class="versionmodified changed">Changed in version 0.22: </span>The default solver changed from ‘liblinear’ to ‘lbfgs’ in 0.22.</p>
</div>
</dd>
<dt><strong>max_iter</strong><span class="classifier">int, default=100</span></dt><dd><p>Maximum number of iterations taken for the solvers to converge.</p>
</dd>
<dt><strong>multi_class</strong><span class="classifier">{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’</span></dt><dd><p>If the option chosen is ‘ovr’, then a binary problem is fit for each
label. For ‘multinomial’ the loss minimised is the multinomial loss fit
across the entire probability distribution, <em>even when the data is
binary</em>. ‘multinomial’ is unavailable when solver=’liblinear’.
‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’,
and otherwise selects ‘multinomial’.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.18: </span>Stochastic Average Gradient descent solver for ‘multinomial’ case.</p>
</div>
<div class="versionchanged">
<p><span class="versionmodified changed">Changed in version 0.22: </span>Default changed from ‘ovr’ to ‘auto’ in 0.22.</p>
</div>
</dd>
<dt><strong>verbose</strong><span class="classifier">int, default=0</span></dt><dd><p>For the liblinear and lbfgs solvers set verbose to any positive
number for verbosity.</p>
</dd>
<dt><strong>warm_start</strong><span class="classifier">bool, default=False</span></dt><dd><p>When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
Useless for liblinear solver. See <a class="reference internal" href="../../glossary.html#term-warm-start"><span class="xref std std-term">the Glossary</span></a>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 0.17: </span><em>warm_start</em> to support <em>lbfgs</em>, <em>newton-cg</em>, <em>sag</em>, <em>saga</em> solvers.</p>
</div>
</dd>
<dt><strong>n_jobs</strong><span class="classifier">int, default=None</span></dt><dd><p>Number of CPU cores used when parallelizing over classes if
multi_class=’ovr’”. This parameter is ignored when the <code class="docutils literal notranslate"><span class="pre">solver</span></code> is
set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or
not. <code class="docutils literal notranslate"><span class="pre">None</span></code> means 1 unless in a <a class="reference external" href="https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend" title="(in joblib v0.17.0.dev0)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">joblib.parallel_backend</span></code></a>
context. <code class="docutils literal notranslate"><span class="pre">-1</span></code> means using all processors.
See <a class="reference internal" href="../../glossary.html#term-n-jobs"><span class="xref std std-term">Glossary</span></a> for more details.</p>
</dd>
<dt><strong>l1_ratio</strong><span class="classifier">float, default=None</span></dt><dd><p>The Elastic-Net mixing parameter, with <code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">&lt;=</span> <span class="pre">l1_ratio</span> <span class="pre">&lt;=</span> <span class="pre">1</span></code>. Only
used if <code class="docutils literal notranslate"><span class="pre">penalty='elasticnet'</span></code>. Setting <code class="docutils literal notranslate"><span class="pre">l1_ratio=0</span></code> is equivalent
to using <code class="docutils literal notranslate"><span class="pre">penalty='l2'</span></code>, while setting <code class="docutils literal notranslate"><span class="pre">l1_ratio=1</span></code> is equivalent
to using <code class="docutils literal notranslate"><span class="pre">penalty='l1'</span></code>. For <code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">&lt;</span> <span class="pre">l1_ratio</span> <span class="pre">&lt;1</span></code>, the penalty is a
combination of L1 and L2.</p>
</dd>
</dl>
</dd>
    
    

<font size=4 color =red>__That is a lot of parameters to choose from to fine tune our models__</font>   

<font size=4> Lets just pick a couple from each and give them several different values.
    
__From PCA__
    
   n_components [.85,.9,.95] 
  
    
__From Logisitic Regression__
    
    C [1,2,3,4]
    solver ['lbfgs','newton-cg']
    
    
<font color=brown>  How many models is that total?  $ 3\cdot 4 \cdot 2$
    
    
<font color=red> __It is very subtle but notice the double underscores in the code used in creating  the parameter grid. After kpca and log_reg__
    
    clf = Pipeline([("scale",StandardScaler()),
        ("pca", PCA()),
        ("log_reg", LogisticRegression())
        ])

    param_grid = [{
        "pca__n_components":[.85,.9,.95] ,
        "log_reg__C":[1,2,3,4],
        "log_reg__solver":['lbfgs','newton-cg']
    }]

In [35]:
3*4*2

24

In [3]:
from sklearn import datasets
np.random.seed(5)
data = datasets.load_digits()
X=data.data
y=data.target

<font size=4>  Let's take a quick look at what [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) does before we introduce the pipeline.  In this Module Exercise we use PCA as a dimension reduction technique.  Our digits data set has 1797 samples and 64 features (8 $\times$ 8 pixels). __Dimension reduction reduces the number of features of each sample.__ Let look at a PCA where we keep 90% of the variance.  

In [8]:
from sklearn.decomposition import PCA
pca=PCA(n_components=.85)#change to .99
X_c=pca.fit_transform(X)
X.shape,X_c.shape

((1797, 64), (1797, 17))

<font size=4>  Our reduced data set now has 21 features for each sample. Let's see how it performs with the Logistic Regression.  

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler



clf = Pipeline([ ("pca", PCA()),
                ("scale",StandardScaler()),
        ("log_reg", LogisticRegression())
        ])

param_grid = [{
        "pca__n_components":[.85,.9,.95] ,
        "log_reg__C":[1,2,3,4],
        "log_reg__solver":['lbfgs','newton-cg']
    }]
grid_search = GridSearchCV(clf, param_grid, cv=3,return_train_score=True,scoring='accuracy')
grid_search.fit(X, y)

print("Best Model",grid_search.best_params_)
print("Best Score", grid_search.best_score_)

Best Model {'log_reg__C': 1, 'log_reg__solver': 'lbfgs', 'pca__n_components': 0.9}
Best Score 0.9154145798553145


<font size=4> Let's look at all of the results

In [11]:
DF=pd.DataFrame(grid_search.cv_results_)
DF=DF.sort_values(by=['rank_test_score'])
Summary=DF[[ "param_pca__n_components",
            "param_log_reg__C","param_log_reg__solver","mean_test_score","rank_test_score"]]
Summary.head(20)


Unnamed: 0,param_pca__n_components,param_log_reg__C,param_log_reg__solver,mean_test_score,rank_test_score
1,0.9,1,lbfgs,0.915415,1
4,0.9,1,newton-cg,0.915415,1
7,0.9,2,lbfgs,0.913745,3
10,0.9,2,newton-cg,0.913745,3
11,0.95,2,newton-cg,0.912632,5
2,0.95,1,lbfgs,0.912632,5
5,0.95,1,newton-cg,0.912632,5
8,0.95,2,lbfgs,0.912632,5
17,0.95,3,newton-cg,0.912076,9
14,0.95,3,lbfgs,0.912076,9


## <font color=blue> Exercise:  Compare this with out using PCA.  I had to bump my max_iter =500. Discuss results, i.e. did we lose much accuracy by doing PCA?  What did we gain?

In [13]:
#clf = Pipeline([ ("scale",StandardScaler()),
#        ("log_reg", LogisticRegression())])
clf = Pipeline([ ("scale",StandardScaler()),
       ("log_reg", LogisticRegression(max_iter=500))])

param_grid = [{"log_reg__C":[1,2,3,4],
        "log_reg__solver":['lbfgs','newton-cg']}]
grid_search = GridSearchCV(clf, param_grid, cv=3,return_train_score=True,scoring='accuracy')
grid_search.fit(X, y)

print("Best Model",grid_search.best_params_)
print("Best Score", grid_search.best_score_)

Best Model {'log_reg__C': 1, 'log_reg__solver': 'lbfgs'}
Best Score 0.9298831385642737


## <font color=blue> Exercise:  Does scaling really help?  Let's take it away. 

In [15]:
clf = Pipeline([ 
       ("log_reg", LogisticRegression(max_iter=5000))])

param_grid = [{"log_reg__C":[1,2,3,4],
        "log_reg__solver":['lbfgs','newton-cg']}]
grid_search = GridSearchCV(clf, param_grid, cv=3,return_train_score=True,scoring='accuracy')
grid_search.fit(X, y)

print("Best Model",grid_search.best_params_)
print("Best Score", grid_search.best_score_)

Best Model {'log_reg__C': 2, 'log_reg__solver': 'lbfgs'}
Best Score 0.9293266555370061


<font size =5 color =red> __I don't like this one: Why?__

# [Neural nets](https://en.wikipedia.org/wiki/Neural_network)

<font size=4> 
    
Let's look at Neural Nets and some of their parameters form [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).  Let's do a pipeline with a standard scaler and a Neural Net for the wine data set.  We will use 
    
    param_grid = [{
        "mlp_clf__hidden_layer_sizes":[(16,16,16,16),(256,256,256,256)],
        "mlp_clf__solver":['lbfgs','sgd','adam'],
        "mlp_clf__activation":['identity', 'logistic', 'tanh', 'relu']
        }]    

In [17]:
from sklearn import datasets
wine = datasets.load_wine()
features=np.array(wine.feature_names)
features
X=wine.data
y=wine.target
X.shape

(178, 13)

In [18]:
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


clf = Pipeline([ ("scale",StandardScaler()),("mlp_clf", MLPClassifier(max_iter=1000))])


param_grid = [{
        "mlp_clf__hidden_layer_sizes":[(16,16,16,16),(256,256,256,256)],
        "mlp_clf__solver":['lbfgs','sgd','adam'],
        "mlp_clf__activation":['identity', 'logistic', 'tanh', 'relu']
        }]

In [19]:
grid_search = GridSearchCV(clf, param_grid, cv=4,return_train_score=True,scoring='accuracy')
grid_search.fit(X, y)

print("Best Model",grid_search.best_params_)
print("Best Score", grid_search.best_score_)

Best Model {'mlp_clf__activation': 'relu', 'mlp_clf__hidden_layer_sizes': (256, 256, 256, 256), 'mlp_clf__solver': 'sgd'}
Best Score 0.9833333333333334


<font size =4>  Let's look at the shape of the matrices.

In [62]:
grid_search.best_estimator_.steps[1][1].coefs_
M1=grid_search.best_estimator_.named_steps['mlp_clf'].coefs_[0]
M2=grid_search.best_estimator_.named_steps['mlp_clf'].coefs_[1]
np.shape(M1), np.shape(M2)

((13, 256), (256, 256))

In [20]:
DF=pd.DataFrame(grid_search.cv_results_)
#DF

In [61]:
DF=DF.sort_values(by=['rank_test_score'])
Summary=DF[["param_mlp_clf__activation","param_mlp_clf__hidden_layer_sizes","param_mlp_clf__solver","mean_test_score","rank_test_score"]]
Summary.head(10)

Unnamed: 0,param_mlp_clf__activation,param_mlp_clf__hidden_layer_sizes,param_mlp_clf__solver,mean_test_score,rank_test_score
22,relu,"(256, 256, 256, 256)",sgd,0.983333,1
17,tanh,"(256, 256, 256, 256)",adam,0.983207,2
11,logistic,"(256, 256, 256, 256)",adam,0.977652,3
16,tanh,"(256, 256, 256, 256)",sgd,0.977652,3
21,relu,"(256, 256, 256, 256)",lbfgs,0.972222,5
4,identity,"(256, 256, 256, 256)",sgd,0.972096,6
13,tanh,"(16, 16, 16, 16)",sgd,0.972096,6
23,relu,"(256, 256, 256, 256)",adam,0.972096,8
6,logistic,"(16, 16, 16, 16)",lbfgs,0.972096,8
5,identity,"(256, 256, 256, 256)",adam,0.972096,8


## <font color=blue> Exercise:  Compare a Neural Net and a Logistic regression on the wine data set in a pipeline.  Use all the same parameters above for both the MLP and the Logisitc regression. Do not do any PCA. 