# Text Mining
In the first lecture of Text Mining you have seen how to preprocess the data, how to build a Bag of Words model, how to do tf-idf weighting and the model can be used in conjunction with other Data Science techniques.

We are going to see how to put that in practice with Python. We are going to solve a document classification problem using the techniques seen in the lecture.

The packages that we are going to use are `scikit-learn` and `nltk`. The first is the most complete Data Science library in Python, and contains quite a lot of utilities for Text Mining, while the second is more specialized and contains more advanced algorithms. In the second instruction we are going to see some advanced functionalities of `nltk`.

## Loading the data

The first step is loading the corpus, which is `20newsgroups`, a collection of posts of different topics from newsgroups. It is contained directly in `scikit-learn`.

In [1]:
# Loading the training set part of 20newsgroups
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Then we can see the topics - the target attribute.

In [2]:
# Visualize the categories (target attribute)
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
# Fragment of document

print("\n".join(twenty_train.data[0].split("\n")[:8]))

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/


## Tokenization and Bag of Words model
The `CountVectorizer` method can be used to directly transform the dataset in a BoW model. This will first tokenize the text into words, and then create a vector space with one dimension for every word in the dictionary. Finally, it translates the documents in the corpora into count vectors of this space.

In [4]:
# Tokenization and construction of the Bag of Words model

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

We can then build our first classifier using the count vector space. Notice that we are using the `pipeline` function of `scikit-learn`: we can specify a sequence of operations to be performed on the data. In this case, we apply the `CountVectorizer` and then an SVM classifier with linear kernel and stochastic gradient descent as solver for the optimization problem.

In [5]:
# Text Mining pipeline v1: tokenization, BoW model, classification with SVM (linear kernel)

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import numpy as np

text_clf = Pipeline([('vect', CountVectorizer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=3, random_state=42))])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)



0.7007434944237918

## tf-idf scoring
Next, we can try to improve our results by introducing a tf-idf scoring step in the pipeline. We can use `TfidfTransformer` to convert the values of the vector from simple counts (tf) to tf-idf scores.

The tf-idf calculation is done like this:

In [6]:
# Construction of the tf-idf score matrix
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

We can add it direcly to our `pipeline`, which now looks like this:

In [7]:
# Text Mining pipeline v2: tokenization, tf-idf scoring, BoW model, classification with SVM (linear kernel)

from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([('vect', CountVectorizer()), ('tf-idf', TfidfTransformer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=3, random_state=42))])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)



0.8244822092405736

## Stopword removal
The next step would be to remove the stopwords. `CountVectorizer` has an integrated stoplist, and we can add an option to remove the stopwords as we are building the vector space. I can simply add a clause, and my `pipeline` now includes stopword removal and looks like this:

In [8]:
# Text Mining pipeline v3: tokenization, stopword removal, tf-idf scoring, BoW model, classification with SVM (linear kernel)

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tf-idf', TfidfTransformer()), ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=3, random_state=42))])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)



0.8234200743494424

## Stemming
Stemming consists of "chopping off" a word eliminating the suffix and obtaining the root. We are not integrating stemming in the pipeline (it is quite heavy for the dataset and classifier we have here). Examples of stemmer can be found in the `nltk` package: we are going to see the `Snowball` stemmer.

In [13]:
# Stemming

# import nltk
# nltk.download('stopwords')

from nltk.stem.snowball import SnowballStemmer, PorterStemmer

snowball_stemmer = SnowballStemmer('english', ignore_stopwords=True)

example = 'Process mining is a family of techniques in the field of process management that support the analysis of business processes based on event logs. During process mining, specialized data mining algorithms are applied to event log data in order to identify trends, patterns and details contained in event logs recorded by an information system. Process mining aims to improve process efficiency and understanding of processes.'
wordlist = example.split(' ')
print(example)
print()
print(' '.join([snowball_stemmer.stem(word) for word in wordlist]))

Process mining is a family of techniques in the field of process management that support the analysis of business processes based on event logs. During process mining, specialized data mining algorithms are applied to event log data in order to identify trends, patterns and details contained in event logs recorded by an information system. Process mining aims to improve process efficiency and understanding of processes.

process mine is a famili of techniqu in the field of process manag that support the analysi of busi process base on event logs. during process mining, special data mine algorithm are appli to event log data in order to identifi trends, pattern and detail contain in event log record by an inform system. process mine aim to improv process effici and understand of processes.


As you can see, the stemming procedure does not (always) (just) chop off the word. Modern stemmers also modify a little bit the root of the word in a further step of normalization (this cannot be considered lemmatization, because converging to the lemma is not the goal here).
## Lemmatization
You have also seen lemmatization: transforming a token in its lemma (base form). In `nltk` you can use the `WordNet` lemmatizer. Let's see the results of lemmatization as opposed to stemming, and compare them with the original text.

In [None]:
# Lemmatization

# import nltk
# nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

example = 'Process mining is a family of techniques in the field of process management that support the analysis of business processes based on event logs. During process mining, specialized data mining algorithms are applied to event log data in order to identify trends, patterns and details contained in event logs recorded by an information system. Process mining aims to improve process efficiency and understanding of processes.'
wordlist = example.split(' ')
print(example)
print()
print(' '.join([snowball_stemmer.stem(word) for word in wordlist]))
print()
print(' '.join([lemmatizer.lemmatize(word) for word in wordlist]))

The problem of lemmatization is way more complicated than stemming: and as you can see, as a result most lemmatizers are way more conservative than stemmers, in order to avoid introducting errors.

The `WordNet` stemmer, however, accepts some indications of what the word really is. It is possible to pass a part-of-speech tag as parameter, and depending on how a word is interpreted, the lemma is different. The default behaviour of the `WordNet` lemmatizer is to consider everything a `NOUN`. Let's see how the result changes passing a `VERB` tag:

In [None]:
print(lemmatizer.lemmatize('loving'))
print(lemmatizer.lemmatize('loving', 'v'))