# **Working with Text Data**

This notebook is based on the tutorial Scikit-learn « Working with Text Data ». The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics.
we will see how to: load the file contents and the categories, extract feature vectors suitable for machine learning, train a linear model to perform categorization, use a grid search strategy to find a good configuration of both the feature extraction components and the classifier.

# **1. Installation nécessaires**

In [1]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

# **2. Dataset**
The dataset `fetch_20newsgroups` is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.



In [3]:
# In order to get faster execution times for this first example, we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [4]:
# We can now load the list of files matching those categories as follows:
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)

In [5]:
# Dataset length
len(newsgroups_train.data) # len(newsgroups.filenames)

2257

In [6]:
# Categories
newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [7]:
# Print the first one
newsgroups_train.data[0].split("\n")

['From: sd345@city.ac.uk (Michael Collier)',
 'Subject: Converting images to HP LaserJet III?',
 'Nntp-Posting-Host: hampton',
 'Organization: The City University',
 'Lines: 14',
 '',
 'Does anyone know of a good way (standard PC application/PD utility) to',
 'convert tif/img/tga files into LaserJet III format.  We would also like to',
 'do the same, converting to HPGL (HP plotter) files.',
 '',
 'Please email any response.',
 '',
 'Is this the correct group?',
 '',
 'Thanks in advance.  Michael.',
 '-- ',
 'Michael Collier (Programmer)                 The Computer Unit,',
 'Email: M.P.Collier@uk.ac.city                The City University,',
 'Tel: 071 477-8000 x3769                      London,',
 'Fax: 071 477-8565                            EC1V 0HB.',
 '']

Les algorithmes d'apprentissage supervisé ont besoin d'une étiquette de catégorie pour chaque document de l'ensemble de formation.

Pour des raisons de rapidité et d'efficacité, scikit-learn charge l'attribut target sous la forme d'un tableau d'entiers correspondant à l'index du nom de la catégorie dans la liste target_names. L'identifiant entier de la catégorie de chaque échantillon est stocké dans l'attribut target :

In [8]:
# To get category names
for t in newsgroups_train.target[:10]:
  print(newsgroups_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


In [9]:
print("catégorie du premier exemple: ", newsgroups_train.target[0])
print("catégories de notre sous-ensemble: ", newsgroups_train.target_names)
print("catégorie du premier exemple traduit: ", newsgroups_train.target_names[newsgroups_train.target[0]])

catégorie du premier exemple:  1
catégories de notre sous-ensemble:  ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
catégorie du premier exemple traduit:  comp.graphics


**Exercice : choisissez une autre catégorie et explorez ses exemples.**

In [10]:
### écrire ici !!!

# **3. Extracting features from text files**
In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

## **3.1 Bag of Words (BOW)**
The most intuitive way to do so is to use a bags of words representation:
For each document #i, 

1. Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
2. For each document `#i`, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

If `n_samples` == 10000, storing `X` as a NumPy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers.

Fortunately, most values in X will be zeros since for a given document less than a few thousand distinct words will be used. For this reason we say that bags of words **are typically high-dimensional sparse datasets**. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

`scipy.sparse` matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors:

In [11]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newsgroups_train.data)

print("Shape of the bag-of-words matrix:", X_train_counts.shape)

Shape of the bag-of-words matrix: (2257, 35788)


CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has built a dictionary of feature indices:


In [12]:
count_vect.vocabulary_.get(u'algorithm')

4690

La valeur de l'indice d'un mot dans le vocabulaire est liée à sa fréquence dans l'ensemble du corpus de formation.

## **3.2 Using TF-IDF Transformer**

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using TfidfTransformer:

In [13]:
tfidf_transformer = TfidfTransformer()

# adapter notre estimateur aux données
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)

# transformer notre matrice de comptage en une représentation tf-idf
X_train_tf = tf_transformer.transform(X_train_counts)

print("Shape of the TF-IDF matrix:", X_train_tf.shape)

Shape of the TF-IDF matrix: (2257, 35788)


In [14]:
# nous pouvons également faire ceci
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

# **4. Training a classifier**
Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier, and the one most suitable for word counts is the multinomial variant:

In [15]:
clf = MultinomialNB().fit(X_train_tf, newsgroups_train.target)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call `transform` instead of `fit_transform` on the transformers, since they have already been fit to the training set:

In [16]:
docs_new = ['Medicine is advancing fast', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'Medicine is advancing fast' => sci.med
'OpenGL on the GPU is fast' => comp.graphics


# **5. Building a pipeline**

In order to make the vectorizer => transformer => classifier easier to work with, `scikit-learn` provides a `Pipeline` class that behaves like a compound classifier:

In [17]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [18]:
text_clf.fit(newsgroups_train.data, newsgroups_train.target)

In [19]:
# same behaviour
predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, newsgroups_train.target_names[category]))

'Medicine is advancing fast' => sci.med
'OpenGL on the GPU is fast' => comp.graphics


# **6. Model evaluation**
Evaluating the predictive accuracy of the model is equally easy:

In [20]:
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
docs_test = newsgroups_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == newsgroups_test.target)

0.8348868175765646

Let's try a different model:

In [21]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

text_clf.fit(newsgroups_train.data, newsgroups_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == newsgroups_test.target)

0.9101198402130493

In [22]:
# Metrics to evaluate the results
from sklearn import metrics

print("classification_report: \n")
print(metrics.classification_report(newsgroups_test.target, predicted,
    target_names=newsgroups_test.target_names))

print("confusion_matrix: \n")
metrics.confusion_matrix(newsgroups_test.target, predicted)

classification_report: 

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.80      0.87       319
         comp.graphics       0.87      0.98      0.92       389
               sci.med       0.94      0.89      0.91       396
soc.religion.christian       0.90      0.95      0.93       398

              accuracy                           0.91      1502
             macro avg       0.91      0.91      0.91      1502
          weighted avg       0.91      0.91      0.91      1502

confusion_matrix: 



array([[256,  11,  16,  36],
       [  4, 380,   3,   2],
       [  5,  35, 353,   3],
       [  5,  11,   4, 378]], dtype=int64)

# **7. Hyperparameter Tuning**
We’ve already encountered some parameters such as use_idf in the `TfidfTransformer`. Classifiers tend to have many parameters as well; e.g., `MultinomialNB` includes a smoothing parameter alpha and `SGDClassifier` has a penalty parameter alpha and configurable loss and penalty terms in the objective function (see the module documentation, or use the Python help function to get a description of these).

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams, with or without idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:

In [41]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

In [42]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

In [49]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)

In [37]:
# For time reasons, only the first 400 examples are considered
gs_clf = gs_clf.fit(newsgroups_train.data[:400], newsgroups_train.target[:400])

In [40]:
newsgroups_train.target_names[gs_clf.predict(['Lighning is manufactured by Adobe'])[0]]

'comp.graphics'

The object’s `best_score_` and `best_params_` attributes store the best mean score and the parameters setting corresponding to that score:

In [28]:
gs_clf.best_score_

0.9175000000000001

In [29]:
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)


In [30]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [31]:
import pandas as pd
pd.DataFrame(gs_clf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__alpha,param_tfidf__use_idf,param_vect__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.302126,0.035317,0.044225,0.006861,0.01,True,"(1, 1)","{'clf__alpha': 0.01, 'tfidf__use_idf': True, '...",0.9,0.8625,0.8875,0.9125,0.9,0.8925,0.016956,4
1,0.990567,0.08568,0.129244,0.012035,0.01,True,"(1, 2)","{'clf__alpha': 0.01, 'tfidf__use_idf': True, '...",0.8875,0.85,0.9,0.925,0.925,0.8975,0.027839,3
2,0.29391,0.034979,0.070258,0.032508,0.01,False,"(1, 1)","{'clf__alpha': 0.01, 'tfidf__use_idf': False, ...",0.7,0.7125,0.775,0.8,0.825,0.7625,0.048734,8
3,0.926672,0.091797,0.105911,0.012893,0.01,False,"(1, 2)","{'clf__alpha': 0.01, 'tfidf__use_idf': False, ...",0.6875,0.7,0.8,0.8125,0.8375,0.7675,0.061543,7
4,0.309937,0.050483,0.054092,0.016545,0.001,True,"(1, 1)","{'clf__alpha': 0.001, 'tfidf__use_idf': True, ...",0.9,0.9,0.925,0.925,0.9375,0.9175,0.015,1
5,0.800466,0.067817,0.143174,0.060161,0.001,True,"(1, 2)","{'clf__alpha': 0.001, 'tfidf__use_idf': True, ...",0.9125,0.8875,0.8875,0.925,0.95,0.9125,0.023717,2
6,0.233976,0.01074,0.046231,0.007493,0.001,False,"(1, 1)","{'clf__alpha': 0.001, 'tfidf__use_idf': False,...",0.825,0.7625,0.75,0.8,0.825,0.7925,0.031225,6
7,0.826545,0.084812,0.069015,0.015922,0.001,False,"(1, 2)","{'clf__alpha': 0.001, 'tfidf__use_idf': False,...",0.8,0.8,0.8,0.85,0.9,0.83,0.04,5


# **8. Exercices**

### **Exercice 1 : Load different categories**
Modify the category to upload `sci.space` together with another category of your choice. Train the model and test performances.

In [32]:
## écrire ici !!

### **Exercice 2 : Test features extraction**
Test with different values of `ngram_range` in `CountVectorizer` and check the impact on the model performances

In [33]:
## écrire ici !!

### **Exercice 3 : Test a different classifier**
Replace `MultinomialNB` with `SVC` (Support Vector Classifier). Compare the results.


In [34]:
## écrire ici !!