<a href="https://colab.research.google.com/github/LennartKeller/TextklassifikationsProjekt2019/blob/master/HyperparamOptimization_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from typing import *

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator
from sklearn.exceptions import NotFittedError
from sklearn.metrics import f1_score
from tqdm import tqdm


class PeriodEstimatorWrapper(BaseEstimator):

    def __init__(self, clf: BaseEstimator, **params):
        self.clf = clf(**params)
        if params.get('verbose'):
            self.verbose = params['verbose']

    def fit(self, X_train: Union[csr_matrix, np.ndarray], y_train: np.array):
        """
        Fits the estimator.

        :param X_train: normal feature matrix e.g. shape (n_samples, n_features)
        :param y_train: label vector shape (n_samples,)
        :return: fitted instance of itself
        """

        self.clf.fit(X_train, y_train)
        self.fitted_ = True

        return self

    def predict(self, X_test: List[Union[csr_matrix, np.ndarray]]):
        """
        Predicts classes for n periods
        :param X_test: list of feature matrices (n_samples, n_features) to predict (one for each period)
        :return: list of predicted label vectors
        """

        if not self.fitted_:
            raise NotFittedError

        result = []
        if self.verbose:
            iterator = tqdm(X_test, desc='Predicting classes for periods')
        else:
            iterator = X_test

        for X in iterator:
            result.append(self.clf.predict(X))

        return result

    def predict_proba(self, X_test: List[Union[csr_matrix, np.ndarray]]):
        """
        Predicts probabilities for n periods
        :param X_test: list of feature matrices (n_samples, n_features) to predict (one for each period)
        :return: list of predicted label vectors
        """
        if not hasattr(self.clf, 'predict_proba'):
            raise Exception(f"Method predict_proba is not implemented in {self.clf.__class__.__name__}")

        if not self.fitted_:
            raise NotFittedError

        result = []
        if self.verbose:
            iterator = tqdm(X_test, desc='Predicting classes for periods')
        else:
            iterator = X_test

        for X in iterator:
            result.append(self.clf.predict_proba(X))

        return result

    def decision_function(self, X_test: List[Union[csr_matrix, np.ndarray]]):
        """
        Predicts decision scores for n periods
        :param X_test: list of feature matrices (n_samples, n_features) to predict (one for each period)
        :return: list of predicted label vectors
        """
        if not hasattr(self.clf, 'decision_function'):
            raise Exception(f"Method decision_function is not implemented in {self.clf.__class__.__name__}")

        if not self.fitted_:
            raise NotFittedError

        result = []
        if self.verbose:
            iterator = tqdm(X_test, desc='Predicting classes for periods')
        else:
            iterator = X_test

        for X in iterator:
            result.append(self.clf.predict_proba(X))

        return result

    def score(self,
              X_test: List[Union[csr_matrix, np.ndarray]],
              y_true: List[np.array],
              scoring_func: callable = lambda y_true, y_pred: f1_score(y_true, y_pred, average='macro'),
              pooling_func: callable = np.mean):

        if not self.fitted_:
            raise NotFittedError

        scores = []
        for X, y in zip(X_test, y_true):
            y_pred = self.clf.predict(X)
            score = scoring_func(y, y_pred)
            scores.append(score)

        return pooling_func(scores)


### Problem: Wie tunen wir die Hyperparameter?

Problem: Unsere Idee sieht vor ein Modell auf alle Genres innerhalb einer "Periode" zu trainieren und auf alle anderen anzuwenden, um abzuschätzen wie sehr sich die Genres über die Zeit verändern. Hierbei stellt sich die Frage, wie man die Hyperparameter der Modelle valide und gleichzeitig effektiv optimieren kann.

* Möglichkeit 1:
    * Gridsearch auf Ausgangsperiode
    * Vorteile:
        * Wahrscheinlich am ehesten valide
    * Nachteile:
        * Unsere Datengrundlage ist zu klein, um dass für einzelne Epochen sinnvoll durchzuführen
* Möglichkeit 2:
    * Gridsearch auf allen Daten
    * Vorteile:
        * Große Datenmenge
        * Modell würde auf alle Eigenheiten der Perioden getuned werden (wobei das eher ein Nachteil ist)
    * Nachteil:
        * Spätere Testdaten würden fürs Optimieren verwendet werden
* Möglichkeit 3:
    * ParamDict verwenden, um die den eigentlich Lauf (das Trainieren auf einer Epoche und Testen auf allen Anderen) mit allen möglichen Hyperparamtern zu testen. Eigene Evaulation (bsp. Mittelwert der F1-Scores für die verschiedenen Epochen)
    * Vorteile:
        * Klare Trennung von Test und Trainingsdaten
        * Mehr Daten für die Optimierung als bei Möglichkeit 1
    * Nachteile:
        * keine cross-validation

In [0]:
import pandas as pd

In [3]:
!pip install stop_words

Collecting stop_words
  Downloading https://files.pythonhosted.org/packages/1c/cb/d58290804b7a4c5daa42abbbe2a93c477ae53e45541b1825e86f0dfaaf63/stop-words-2018.7.23.tar.gz
Building wheels for collected packages: stop-words
  Building wheel for stop-words (setup.py) ... [?25l[?25hdone
  Created wheel for stop-words: filename=stop_words-2018.7.23-cp36-none-any.whl size=32916 sha256=5074ecbf21f20e99ebd3d92422d9655f7ec51458e6caa935f16c419bc351a14e
  Stored in directory: /root/.cache/pip/wheels/75/37/6a/2b295e03bd07290f0da95c3adb9a74ba95fbc333aa8b0c7c78
Successfully built stop-words
Installing collected packages: stop-words
Successfully installed stop-words-2018.7.23


In [38]:
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [39]:
!ls /content/gdrive/My\ Drive/

'Colab Notebooks'   full_dataset.csv	     smv_tuning1.csv
 data		    full_taggeddataset.csv


In [0]:
df = pd.read_csv('/content/gdrive/My Drive/full_taggeddataset.csv')

In [0]:
# remove news genre

df = df[df.genre != 'NEWS']
df = df[df.genre != 'NEWS-P4']
df.region = df.region.str.upper()
df = df[df.year != 'GesetzsammlungThÅringen']
df = df[df.year != 'GesetzsammlungThüringen']
df = df[df.year != '1851-54']

In [0]:
df_p1 = df.loc[df['period'] == 'P1']
df_rest = df.loc[df['period'] != 'P1']

# Feature Extraction

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from stop_words import get_stop_words

tfidf = TfidfVectorizer()

# Bauen der Pipeline

In [0]:
from sklearn.pipeline import make_pipeline, make_union, Pipeline

In [0]:
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

In [12]:
pipe_svm = Pipeline([('tfidf', tfidf), ('linearsvc', LinearSVC())])
pipe_svm

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('linearsvc',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
         

In [0]:
pipe_svm_params = {
    'tfidf__max_features': [300, 1000, 5000,  10000, 20000],
    'tfidf__ngram_range': [(1,1), (1,3), (1,5)],
    'tfidf_lowercase': [True, False],
    'tfidf_stopwords': [None, get_stop_words('de')],
    'linearsvc__C': [0.1, 0.5, 1, 3, 7, 10],
    'linearsvc__penalty': ['l2', 'l1']
    
}

# 1. Möglichkeit: Gridsearch auf Trainingsperiode

In [0]:
from sklearn.model_selection import GridSearchCV

gridsearch = GridSearchCV(
    pipe_svm,
    pipe_svm_params,
    scoring='f1_macro',
    verbose=1,
    n_jobs=-1)

In [0]:
gridsearch.fit(df_p1.text, df_p1.genre.to_numpy())

Fitting 5 folds for each of 900 candidates, totalling 4500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done 796 tasks      | elapsed: 23.4min
[Parallel(n_jobs=-1)]: Done 1246 tasks      | elapsed: 36.4min
[Parallel(n_jobs=-1)]: Done 1796 tasks      | elapsed: 51.8min
[Parallel(n_jobs=-1)]: Done 2446 tasks      | elapsed: 71.9min
[Parallel(n_jobs=-1)]: Done 3196 tasks      | elapsed: 92.3min
[Parallel(n_jobs=-1)]: Done 4046 tasks      | elapsed: 117.0min
[Parallel(n_jobs=-1)]: Done 4500 out of 4500 | elapsed: 130.0min finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [0]:
gridsearch.best_params_, gridsearch.best_score_

({'linearsvc__C': 3,
  'linearsvc__penalty': 'l2',
  'tfidf__analyzer': 'word',
  'tfidf__max_features': 15000,
  'tfidf__ngram_range': (1, 1)},
 0.7830952380952381)

In [0]:
svm_results = pd.DataFrame.from_dict(gridsearch.cv_results_)
svm_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_linearsvc__C,param_linearsvc__penalty,param_tfidf__analyzer,param_tfidf__max_features,param_tfidf__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.322687,0.003073,0.052285,0.002864,1,l2,word,1000,"(1, 1)","{'linearsvc__C': 1, 'linearsvc__penalty': 'l2'...",0.708730,0.659524,0.703968,0.626984,0.706349,0.681111,0.032613,187
1,2.463731,0.034002,0.145152,0.010937,1,l2,word,1000,"(1, 3)","{'linearsvc__C': 1, 'linearsvc__penalty': 'l2'...",0.646825,0.659524,0.515873,0.637302,0.659524,0.623810,0.054613,299
2,5.083001,0.029989,0.256598,0.006509,1,l2,word,1000,"(1, 5)","{'linearsvc__C': 1, 'linearsvc__penalty': 'l2'...",0.646825,0.659524,0.515873,0.637302,0.659524,0.623810,0.054613,299
3,0.301410,0.012273,0.045825,0.001982,1,l2,word,5000,"(1, 1)","{'linearsvc__C': 1, 'linearsvc__penalty': 'l2'...",0.706746,0.716667,0.609524,0.626984,0.811905,0.694365,0.072394,173
4,2.401519,0.025374,0.145544,0.006093,1,l2,word,5000,"(1, 3)","{'linearsvc__C': 1, 'linearsvc__penalty': 'l2'...",0.706746,0.817857,0.527778,0.637302,0.811905,0.700317,0.109590,169
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
895,2.596360,0.023152,0.000000,0.000000,10,l1,char_wb,15000,"(1, 3)","{'linearsvc__C': 10, 'linearsvc__penalty': 'l1...",,,,,,,,458
896,4.748061,0.059630,0.000000,0.000000,10,l1,char_wb,15000,"(1, 5)","{'linearsvc__C': 10, 'linearsvc__penalty': 'l1...",,,,,,,,457
897,0.862964,0.042816,0.000000,0.000000,10,l1,char_wb,20000,"(1, 1)","{'linearsvc__C': 10, 'linearsvc__penalty': 'l1...",,,,,,,,456
898,2.596670,0.018641,0.000000,0.000000,10,l1,char_wb,20000,"(1, 3)","{'linearsvc__C': 10, 'linearsvc__penalty': 'l1...",,,,,,,,777


In [0]:
svm_results.to_csv('/content/gdrive/My Drive/smv_tuning1.csv')

# 2. Möglichkeit: Gridsearch auf allen Daten

In [0]:
gridsearch_full_svm = gridsearch = GridSearchCV(
    pipe_svm,
    pipe_svm_params,
    scoring='f1_macro',
    verbose=1,
    n_jobs=-1)

In [0]:
gridsearch_full_svm.fit(df.text, df.genre.to_numpy())

In [0]:
gridsearch_full_svm.best_params_, gridsearch_full_svm.best_score_

In [0]:
svm_full_results = pd.DataFrame.from_dict(gridsearch_full_svm.cv_results_)
svm_full_results

# 3. Möglichkeit: Alle Parametern mit normalen Durchläufen testen.

In [0]:
from typing import *
from tqdm import tqdm_notebook
from sklearn.metrics import f1_score
from sklearn.model_selection import ParameterGrid
import numpy as np
from pprint import pprint
import pickle
import time

def tune(pipe: Pipeline, params: Dict[str, Any], df_train: pd.DataFrame, test_dfs: List[pd.DataFrame]) -> Dict[str, Any]:
  
  paramdict = ParameterGrid(params) 

  results = []
  result_fn = f'{pipe.steps[-1][0]}_tuning_{time.time()}.h'
  for current_params in tqdm_notebook(paramdict):
    try:
      pipe.set_params(**current_params)
      pipe.fit(df_train.lemmas, df_train.genre)

      scores = []
      for df_test in test_dfs:
        y_pred = pipe.predict(df_test.lemmas)
        scores.append(f1_score(df_test.genre, y_pred, average='macro'))
      result = {'avg_score': np.mean(scores), 'params': current_params}
      results.append(result)
      pprint(result['avg_score'])

    
    except Exception as e:
      result = {'avg_score': 0.0, 'params': current_params}
      results.append(result)
      print(f'Error on {current_params}')
      print(e)
    
    finally:
      # save results and go on 
      with open(result_fn, 'wb') as f:
        pickle.dump([pipe_svm_params, results], f)
  return results

In [0]:
from sklearn.model_selection import ParameterGrid

In [0]:
df_train = df[df['period'] == 'P1']
test_dfs = [df[df['period'] == p] for p in df.period.unique() if p != 'P1']

# 1. Tune all clfs

# LinearSVC Tuning

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline, make_pipeline
from stop_words import get_stop_words

pipe_svm = make_pipeline(TfidfVectorizer(), LinearSVC())

pipe_svm_params = {
    'tfidfvectorizer__max_features': [300, 1000, 5000,  10000, 20000],
    'tfidfvectorizer__ngram_range': [(1,1), (1,3), (1,5)],
    'tfidfvectorizer__lowercase': [True, False],
    'tfidfvectorizer__stop_words': [None, get_stop_words('de')],
    'linearsvc__C': [0.1, 0.5, 1, 3, 7, 10],
    'linearsvc__penalty': ['l2', 'l1']
    
}

In [50]:
svm_results = tune(pipe_svm, pipe_svm_params, df_train, test_dfs)

HBox(children=(IntProgress(value=0, max=720), HTML(value='')))

0.5680636408496424
0.5878308086474031


KeyboardInterrupt: ignored

In [32]:
list(sorted(svm_results, key=lambda x: x['avg_score'], reverse=True))[:5]

[{'avg_score': 0.7605820888258503,
  'params': {'linearsvc__C': 7,
   'linearsvc__penalty': 'l2',
   'tfidf__lowercase': False,
   'tfidf__max_features': 5000,
   'tfidf__ngram_range': (1, 1),
   'tfidf__stop_words': None}},
 {'avg_score': 0.7585329099282458,
  'params': {'linearsvc__C': 10,
   'linearsvc__penalty': 'l2',
   'tfidf__lowercase': False,
   'tfidf__max_features': 5000,
   'tfidf__ngram_range': (1, 1),
   'tfidf__stop_words': None}},
 {'avg_score': 0.7483915458612229,
  'params': {'linearsvc__C': 3,
   'linearsvc__penalty': 'l2',
   'tfidf__lowercase': False,
   'tfidf__max_features': 5000,
   'tfidf__ngram_range': (1, 1),
   'tfidf__stop_words': None}},
 {'avg_score': 0.7428161816614313,
  'params': {'linearsvc__C': 7,
   'linearsvc__penalty': 'l2',
   'tfidf__lowercase': False,
   'tfidf__max_features': 10000,
   'tfidf__ngram_range': (1, 5),
   'tfidf__stop_words': None}},
 {'avg_score': 0.7427645509959899,
  'params': {'linearsvc__C': 10,
   'linearsvc__penalty': 'l2',

# Logistic Regression Tuning

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from stop_words import get_stop_words

pipe_logreg = make_pipeline(TfidfVectorizer(), LogisticRegression(n_jobs=-1))

pipe_logreg_params = {
    'tfidfvectorizer__max_features': [300, 1000, 5000,  10000, 20000],
    'tfidfvectorizer__ngram_range': [(1,1), (1,3), (1,5)],
    'tfidfvectorizer__lowercase': [True, False],
    'tfidfvectorizer__stop_words': [None, get_stop_words('de')],
    'logisticregression__C': [0.1, 0.5, 1, 3, 7, 10],
    'logisticregression__penalty': ['l2', 'l1'],
    'logisticregression__solver': ['liblinear', 'lbfgs']
    
}

In [0]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore") # catch user warning because of n_jobs = -1
    tune(pipe_logreg, pipe_logreg_params, df_train, test_dfs)

HBox(children=(IntProgress(value=0, max=1440), HTML(value='')))

0.5681571308025674
0.5619733230380801
0.5234789798974424
0.5511693537236442
0.5234789798974424
0.5511693537236442
0.6085933379035096
0.6685011082356263
0.5776770704753034
0.662712380269575
0.5756671056295106
0.6618077756600995
0.39719408329762895
0.675912379439507
0.6045431382371502
0.6901857346504827
0.6076526367464804
0.6864019493414873
0.41439827138769064
0.6784412444289848
0.3954671690168422
0.6864365812409304
0.3917838764014329
0.6906176829755775
0.4258986815674824
0.6737986257262061
0.3995989641422435
0.682013119279915
0.40060220817829506
0.6830178829388382
0.5846939397194013
0.5379976211728084
0.5549081098204216
0.5271952682763841
0.5549081098204216
0.5329890754427403
0.6235850457116561
0.6683230928616942
0.6051767527822681
0.6622358342426446
0.6032307342939218
0.6550046711242655
0.6279845852636485
0.6808627854093856
0.634082577705127
0.6875183586201323
0.6341159415425403
0.6864151205463783
0.6254285086987375
0.6776138683437128
0.6404005666364831
0.6822379044669824
0.64119626228

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import make_pipeline

from stop_words import get_stop_words

pipe_logreg = make_pipeline(TfidfVectorizer(), LogisticRegression())
pipe_logreg_params = {

    'logisticregression__C': [0.1, 0.5, 1, 3, 7, 10],
    'logisticregression__penalty': ['l2', 'l1'],
    'logisticregression__solver': ['liblinear', 'lbfgs']
    
}

logreg = [pipe_logreg, pipe_logreg_params]

pipe_svm = make_pipeline(TfidfVectorizer(), LinearSVC())
pipe_svm_params = {
    'linearsvc__C': [0.1, 0.5, 1, 3, 7, 10],
    'linearsvc__penalty': ['l2', 'l1']
    
}

svm = [pipe_svm, pipe_svm_params]

pipe_naivebayes = make_pipeline(TfidfVectorizer(), MultinomialNB())


pipe_dectree = make_pipeline(TfidfVectorizer(), DecisionTreeClassifier())
pipe_dectree_params = {
    'decisiontreeclassifier__max_features': ["auto", "sqrt", "log2"],
    'decisiontreeclassifier__max_depth': [None, 100, 125, 150, 175, 200],
    'decisiontreeclassifier__min_samples_split': [2, 5, 10, 20], 
}

dectree = [pipe_dectree, pipe_dectree_params]


pipe_randomforest = make_pipeline(TfidfVectorizer(), RandomForestClassifier())
pipe_randomforest_params = {
    'randomforestclassifier__n_estimator': [5, 100, 400, 1000],
}

randomforest = [pipe_randomforest, pipe_randomforest_params]


all_pipes_params = [logreg, svm, dectree, randomforest]

In [0]:
from sklearn.model_selection import GridSearchCV
from tqdm import tqdm_notebook
import pandas as pd
gs_results = {}


def tune_hyperparams(pipe, params, df):
  
  for period in tqdm_notebook(sorted(df.period.unique())):
    df_train = df[df.period == period]
    
    gs = GridSearchCV(pipe,
                      param_grid=params,
                      n_jobs=-1,
                      verbose=1,
                      scoring='f1_macro')
    
    gs.fit(df_train.lemmas, df_train.genre)

    gs_results[period] = gs.cv_results_

    result_df = pd.DataFrame.from_dict(gs.cv_results_)

    result_df.to_csv(f'{pipe.steps[-1][0]}_{period}.csv')



In [0]:
results = []

for pipe, params in all_pipes_params:
  result = tune_hyperparams(pipe, params, df)
  results.append(result)

HBox(children=(IntProgress(value=0, max=6), HTML(value='')))

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   10.7s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   32.0s finished


Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   29.3s finished


Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    9.0s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   29.8s finished


Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   42.5s finished


Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   12.1s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   40.0s finished


Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   40.4s finished


HBox(children=(IntProgress(value=0, max=6), HTML(value='')))

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    9.1s finished


Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    9.2s finished


Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    8.8s finished


Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    9.6s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:   12.4s finished


Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    9.3s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:   12.1s finished


Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


In [0]:
tfidf_param = {
    'tfidfvectorizer__max_features': [300, 1000, 5000,  10000, 20000],
    'tfidfvectorizer__ngram_range': [(1,1), (1,3), (1,5)],
    'tfidfvectorizer__lowercase': [True, False],
    'tfidfvectorizer__stop_words': [None, get_stop_words('de')],
}