## Using wrappers for Scikit learn API

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration```

The wrappers available (as of now) are :
* LdaModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_ldaModel.SklLdaModel```),which implements gensim's ```LDA Model``` in a scikit-learn interface

* LsiModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLsiModel```),which implements gensim's ```LSI Model``` in a scikit-learn interface

* RpModel (```gensim.sklearn_integration.sklearn_wrapper_gensim_rpmodel.SklRpModel```),which implements gensim's ```Random Projections Model``` in a scikit-learn interface

* LDASeq Model (```gensim.sklearn_integration.sklearn_wrapper_gensim_lsiModel.SklLdaSeqModel```),which implements gensim's ```LdaSeqModel``` in a scikit-learn interface

### LDA Model

To use LdaModel begin with importing LdaModel wrapper

In [None]:
from gensim.sklearn_integration import SklLdaModel

Next we will create a dummy set of texts and convert it into a corpus

In [None]:
from gensim.corpora import Dictionary
texts = [
    ['complier', 'system', 'computer'],
    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
    ['graph', 'flow', 'network', 'graph'],
    ['loading', 'computer', 'system'],
    ['user', 'server', 'system'],
    ['tree', 'hamiltonian'],
    ['graph', 'trees'],
    ['computer', 'kernel', 'malfunction', 'computer'],
    ['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it

In [None]:
model = SklLdaModel(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)

#### Integration with Sklearn

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.

In [None]:
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldamodel import SklLdaModel

In [None]:
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)

Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts.

In [None]:
vec = CountVectorizer(min_df=10, stop_words='english')

X = vec.fit_transform(data.data)
vocab = vec.get_feature_names()  # vocab to be converted to id2word 

id2word = dict([(i, s) for i, s in enumerate(vocab)])

Next, we just need to fit X and id2word to our Lda wrapper.

In [None]:
obj = SklLdaModel(id2word=id2word, num_topics=5, passes=20)
lda = obj.fit(X)

#### Example for Using Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
from gensim.models.coherencemodel import CoherenceModel

In [None]:
def scorer(estimator, X, y=None):
    goodcm = CoherenceModel(model=estimator, texts= texts, dictionary=estimator.id2word, coherence='c_v')
    return goodcm.get_coherence()

In [None]:
obj = SklLdaModel(id2word=dictionary, num_topics=5, passes=20)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
model = GridSearchCV(obj, parameters, scoring=scorer, cv=5)
model.fit(corpus)

In [None]:
model.best_params_

#### Example of Using Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn import linear_model

def print_features_pipe(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    coef = clf.named_steps['classifier'].coef_[0]
    print coef
    print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))
    print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))

In [None]:
id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]

In [None]:
model = SklLdaModel(num_topics=15, id2word=id2word, iterations=50, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

### LSI Model

To use LsiModel begin with importing LsiModel wrapper

In [None]:
from gensim.sklearn_integration import SklLsiModel

#### Example of Using Pipeline

In [None]:
model = SklLsiModel(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

### Random Projections Model

To use RpModel begin with importing RpModel wrapper

In [None]:
from gensim.sklearn_integration import SklRpModel

#### Example of Using Pipeline

In [None]:
model = SklRpModel(num_topics=2)
numpy.random.mtrand.RandomState(1)  # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, data.target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))

### LDASeq Model

To use LdaSeqModel begin with importing LdaSeqModel wrapper

In [None]:
from gensim.sklearn_integration import SklLdaSeqModel

#### Example of Using Pipeline

In [None]:
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word = Dictionary(map(lambda x: x.split(), test_data))
corpus = [id2word.doc2bow(i.split()) for i in test_data]

model = SklLdaSeqModel(id2word=id2word, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline((('features', model,), ('classifier', clf)))
pipe.fit(corpus, test_target)
print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, test_target))