This script contain a variety of additional feature relevant when using scikit-learn. This script is mean as an example and should provide inspiration for finding the best classifier.

---
## List of Classifiers 
There is a lot of classifiers in scikit-learn for a list you could examine this thread [here](https://stackoverflow.com/questions/41844311/list-of-all-classification-algorithms) on stackoverflow, it does not contain all classifiers though. Classifiers from external packages are such as XGboost is not included. For future proofing I have also added the list in the end of the document.

---
## Pipelines
scikit-learn allows you to make pipelines which can end up becoming very convenient when testing out multiple classifiers. For instance:


In [3]:
from classification import read_imdb

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB


imdb = read_imdb()
X_train, X_test, y_train, y_test = train_test_split(imdb.text, imdb.tag)
text_clf = Pipeline([('vect', CountVectorizer()), 
                     ('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)
predictions = text_clf.predict(X_test)

acc = sum(predictions == y_test)/len(y_test)
print(f"Our model obtained a performance accuracy of {acc}")


Our model obtained a performance accuracy of 0.84


---
## Cross-Validation
If you compare the performance of the pipeline classification and the classification in the script `classification.py` you are likely to see that the two performances aren't the same and can very quite a bit. To solve this we can do cross-validation. Cognitive Science bachelor student should be familiar with cross-validation, but if you aren't please see [this video](https://www.youtube.com/watch?v=fSytzGwwBVw).

In [6]:
from sklearn.model_selection import cross_val_score

# perform a 5-fold cross validation
scores = cross_val_score(text_clf, X=imdb.text, y=imdb.tag, cv=5, scoring='accuracy')
print(scores)
import numpy as np
print("mean:", np.mean(scores), "\n SD:", np.std(scores))

[0.785  0.83   0.7975 0.8375 0.8225]
mean: 0.8145 
 SD: 0.019962464777677123


---
## Grid Search

A lot of the ML methods need you specify some hyperparameters. If you have only a weak idea of how to set these you can use a grid search to search through the parameters. This is simply a method which goes through all possible combination of the specified hyperparameters.

**Note:** Remember the computation time increase quite a lot. To calculate it multiple the length of each list of hyperparameters you are searching over together. E.g. for the example below we have 3 parameter for C and 2 for kernel and thus have a $2\cdot3=6$ times increase in computation (this does not include the computational cost of the specific hyperparameters).

You can naturally also create a loop and loop through all the different classifiers. It will probably take quite a while though ;) 

In [23]:
from sklearn.svm import SVC  # this is another type of classifer called a support vector machine
from sklearn.model_selection import GridSearchCV

# hyperparameters to search over
param_grid = {'svm__kernel':('linear', 'rbf'),  # change the kernel
              'svm__C': [0.1, 1, 10]}           # change the C hyperparameter
text_clf = Pipeline([('vect', CountVectorizer()), 
                     ('svm', SVC())])


search = GridSearchCV(text_clf, param_grid, 
                      cv=5,       # 5 fold cross-validation for each
                      n_jobs=-1)  # indicate that you want to run on all the cores of the computer 
search.fit(X=imdb.text, y=imdb.tag)
print(f"The cross validates score was:\n{round(search.best_score_, 4)})")
print("\nThe best parameters where:\n", search.best_params_)

The cross validates score was:
0.8335)

The best parameters where:
 {'svm__C': 0.1, 'svm__kernel': 'linear'}


## The not quite full list of classifiers for Scikit-learn

Missing candidates include (maybe more):
* Xgboost


```
from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm.classes import OneClassSVM
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
from sklearn.neighbors.classification import RadiusNeighborsClassifier
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OutputCodeClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.linear_model.ridge import RidgeClassifierCV
from sklearn.linear_model.ridge import RidgeClassifier
from sklearn.linear_model.passive_aggressive import PassiveAggressiveClassifier    
from sklearn.gaussian_process.gpc import GaussianProcessClassifier
from sklearn.ensemble.voting_classifier import VotingClassifier
from sklearn.ensemble.weight_boosting import AdaBoostClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.ensemble.bagging import BaggingClassifier
from sklearn.ensemble.forest import ExtraTreesClassifier
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import MultinomialNB  
from sklearn.neighbors import NearestCentroid
from sklearn.svm import NuSVC
from sklearn.linear_model import Perceptron
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.mixture import DPGMM
from sklearn.mixture import GMM 
from sklearn.mixture import GaussianMixture
from sklearn.mixture import VBGMM
