# Text Mining: doc2vec
In the second Text Mining lecture you have learned about some more advanced model and techniques to analyze text: n-grams and word2vec/doc2vec. In this instruction we are going to see an example of how you can train a do2vec model and use it for text classification.

For this we are going to use the 20newsgroups corpus again, where the documents are newsgroups posts and the label is the newsgroup the post was published in (and thus the topic).

Let's first fetch the dataset:

In [1]:
# Loading the training set part of 20newsgroups
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

The first entry looks like this:

In [2]:
print(twenty_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







Notice that, as shown below, targets are not strings but numbers. The target_names attribute allows us to fetch the list of labels: targets are indexes in this list of labels.

In [3]:
print(twenty_train.target[0])

7


In [4]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [5]:
twenty_train.target_names[twenty_train.target[0]]

'rec.autos'

## Preprocessing
Let's preprocess the text. For this test, we are not going to normalize the text, but we will only tokenize it. The gensim tool `gensim.utils.simple_preprocess` tokenizes a text, puts everything in lowercase and eliminates punctuation.

Gensim's doc2vec needs a list of TaggedDocument objects in input. A TaggedDocument is creating with two explicit parameters: `words`, which has to be a list of strings (tokens) and `tags`, which has to be a list of strings (labels). In our case, the label is unique, so we have to use a list with just one element (targets are lists because TaggedDocument also supports multilabel classification). Using the syntax above we fetch the string label for each document and we create TaggedDocuments.
We do this for both training and test set.

In [13]:
# Tokenizing, normalizing, and creating lists of TaggedDocument objects
import gensim

twenty_train_tagged = []
twenty_test_tagged = []

for i in range (0, len(twenty_train.data)):
    twenty_train_tagged.append(gensim.models.doc2vec.TaggedDocument(words=gensim.utils.simple_preprocess(twenty_train.data[i]), tags=[twenty_train.target_names[twenty_train.target[i]]]))

for i in range (0, len(twenty_test.data)):
    twenty_test_tagged.append(gensim.models.doc2vec.TaggedDocument(words=gensim.utils.simple_preprocess(twenty_test.data[i]), tags=[twenty_test.target_names[twenty_test.target[i]]]))

# print(repr(twenty_train_tagged[0]))
twenty_train_tagged

[TaggedDocument(words=['from', 'lerxst', 'wam', 'umd', 'edu', 'where', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neig

In order to speed up a bit the calculations, let's fetch the number of cores the machine has:

In [10]:
import multiprocessing

cores = multiprocessing.cpu_count()

cores

12

## Creating a vocabulary

At this point, we are ready to train our doc2vec model. The first thing to do is to create the vocabulary, in order to determine the sizes of input and output and also build the one-hot encoding for tokens:

In [11]:
# Building the vocabulary
from gensim.models import Doc2Vec
from tqdm import tqdm

#doc2vec_model = Doc2Vec(dm=0, vector_size=40, min_count=2, workers=cores)
doc2vec_model = Doc2Vec(dm=0, vector_size=40, workers=cores)
doc2vec_model.build_vocab([x for x in tqdm(twenty_train_tagged)])

100%|██████████| 11314/11314 [00:00<00:00, 3031646.04it/s]


## Obtaining a document embedding
Once created the object for the model and the vocabulary, it is time to train the encoding neural network that will provide the representation. The hyperparameters are the regular ones for neural networks.

In [12]:
# Training the doc2vec model
from sklearn import utils

for epoch in range(30):
    doc2vec_model.train(utils.shuffle([x for x in tqdm(twenty_train_tagged)]), total_examples=len(twenty_train_tagged), epochs=1)
    doc2vec_model.alpha -= 0.002
    doc2vec_model.min_alpha = doc2vec_model.alpha

100%|██████████| 11314/11314 [00:00<00:00, 3288818.04it/s]
100%|██████████| 11314/11314 [00:00<00:00, 2647236.16it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3332468.78it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3281540.38it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3393717.76it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3613917.86it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3432751.41it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3368423.87it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3543220.75it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3527942.57it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3428535.18it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3557831.42it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3557031.37it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3435485.08it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3164678.59it/s]
100%|██████████| 11314/11314 [00:00<00:00, 3420133.73it/s]
100%|██████████| 11314/11314 [00:00<00:00, 2665975.03it/

## Creating the document features vector space
Once trained the doc2vec representation model, we can use it to convert documents to fixed-length vectors in order to use these vectors in a classifier. The method `infer_vector` can be used for that:

In [None]:
# Building the feature vector for the classifier
def vec_for_learning(model, docs):
    doc2vec_vectors = [model.infer_vector(doc.words) for doc in docs]
    targets = [doc.tags[0] for doc in docs]
    return doc2vec_vectors, targets

In [None]:
# Translating docs into vectors for training and test set
X_train, y_train = vec_for_learning(doc2vec_model, twenty_train_tagged)
X_test, y_test = vec_for_learning(doc2vec_model, twenty_test_tagged)

## Creating and training a classifier
Finally, we can create a classifier with the usual syntax, and evaluate the results using the usual performance metrics.

In [None]:
# Training a classification model
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(n_jobs=1, C=1e5)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

In [None]:
# Classification performance metrics
from sklearn.metrics import accuracy_score, f1_score

print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))