# **Neocase Case Category HyperParameter Optimization**

In [1]:
%matplotlib inline

####  Pipeline for text feature extraction and evaluation

The dataset used here is the neocase case dataset which will be automatically downloaded and then cached and reused for the document classification example.

You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get the 12 of them.

Here is a sample output of a run on a quad-core machine::

Loading 12 Category dataset for categories: ['ABSENCES','AVANTAGES SOCIETE GENERALE',"DEPART DE L'ENTREPRISE",'OUTILS RH','PAIE, REMUNERATION ET FRAIS PROFESSIONNELS','THEME AUTRE','DONNEES PERSONNELLES','FORMATION','INTERIM  STAGES  EXTERNES','MUTUELLE  PREVOYANCE','PARCOURS PROFESSIONNEL','TEMPS DE TRAVAIL'] 556documents 12 categories
Performing grid search... pipeline: ['vect', 'tfidf', 'clf'] parameters: {'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07), 'clf__max_iter': (10, 50, 80), 'clf__penalty': ('l2', 'elasticnet'), 'tfidf__use_idf': (True, False), 'vect__max_n': (1, 2), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000, 50000)} done in 1737.030s

Best score: 1.000 Best parameters set: clf__alpha:1e-05  clf__max_iter: 5 clf__penalty: 'elasticnet' tfidf__use_idf: True vect__max_n: 2 vect__max_df: 1.0 vect__max_features: 50000

In [2]:
from __future__ import print_function

import sklearn 
from pprint import pprint
from time import time
import logging
import document 
from sklearn import datasets 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [3]:
print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


Automatically created module for IPython interactive environment


In [4]:
# #############################################################################
# Load some categories from the training set
categories=['ABSENCES','AVANTAGES SOCIETE GENERALE',"DEPART DE L'ENTREPRISE",'OUTILS RH','PAIE, REMUNERATION ET FRAIS PROFESSIONNELS','THEME AUTRE','DONNEES PERSONNELLES','FORMATION','INTERIM  STAGES  EXTERNES','MUTUELLE  PREVOYANCE','PARCOURS PROFESSIONNEL','TEMPS DE TRAVAIL']
# Uncomment the following to do the analysis on all the categories
#categories = None

In [5]:
print("Loading neocase Case dataset for categories:")
print(categories)

Loading neocase Case dataset for categories:
['ABSENCES', 'AVANTAGES SOCIETE GENERALE', "DEPART DE L'ENTREPRISE", 'OUTILS RH', 'PAIE, REMUNERATION ET FRAIS PROFESSIONNELS', 'THEME AUTRE', 'DONNEES PERSONNELLES', 'FORMATION', 'INTERIM  STAGES  EXTERNES', 'MUTUELLE  PREVOYANCE', 'PARCOURS PROFESSIONNEL', 'TEMPS DE TRAVAIL']


In [7]:
docs_to_train = sklearn.datasets.load_files("document",categories=categories)


print("%d documents" % len(docs_to_train.filenames))
print("%d categories" % len(docs_to_train.target_names))
print()

556 documents
12 categories



In [8]:
# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

In [9]:
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (5,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}


In [10]:
if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, cv=5,
                               n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    print(parameters)
    t0 = time()
    grid_search.fit(docs_to_train.data, docs_to_train.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2)), 'clf__max_iter': (5,), 'clf__alpha': (1e-05, 1e-06), 'clf__penalty': ('l2', 'elasticnet')}
Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 19.7min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 42.3min finished


done in 2557.651s

Best score: 1.000
Best parameters set:
	clf__alpha: 1e-05
	clf__max_iter: 5
	clf__penalty: 'elasticnet'
	vect__max_df: 1.0
	vect__ngram_range: (1, 1)
