# Text Categorization Demo
As well as Scikit-Learn, this demo uses NLTK, one of the main modules for working with language data.

You can learn more about NLTK here: https://www.nltk.org/

In [1]:
import nltk

First we download the necessary NLTK data: the Reuters collection of news articles and NLTK's list of stopwords, i.e. words like 'the' and 'of' that don't, by themselves, convey much about what documents are about. 

In [2]:
nltk.download('reuters')
nltk.download('stopwords')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\jmmck\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jmmck\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Having made sure that the data is downloaded, we can go ahead and import the reuters and stopwords objects that will make it easier to work with them.

In [3]:
from nltk.corpus import reuters
from nltk.corpus import stopwords

Load Reuters categories and file IDs

In [4]:
categories = reuters.categories()
fileids = reuters.fileids()

In [5]:
print(categories[:10])  # Show the first ten categories
len(categories)

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee']


90

In [6]:
print(fileids[:5])  # Show the first 5 file ids

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833']


How many documents do we have?

In [7]:
len(fileids)

10788

Show the raw text of the first file.

In [8]:
reuters.raw('test/14826')



Let's see what category it has been assigned:

In [9]:
reuters.categories('test/14826')

['trade']

Extract documents and their respective categories. This gives us a (text, category) tuple for every news article.

In [10]:
documents = [(reuters.raw(fileid), reuters.categories(fileid)[0])
             for fileid in fileids]

Let's seet the first 5 (text, category) tuples:

In [11]:
documents[:5]

  'trade'),
 ("CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STOCKS\n  A survey of 19 provinces and seven cities\n  showed vermin consume between seven and 12 pct of China's grain\n  stocks, the China Daily said.\n      It also said that each year 1.575 mln tonnes, or 25 pct, of\n  China's fruit output are left to rot, and 2.1 mln tonnes, or up\n  to 30 pct, of its vegetables. The paper blamed the waste on\n  inadequate storage and bad preservation methods.\n      It said the government had launched a national programme to\n  reduce waste, calling for improved technology in storage and\n  preservation, and greater production of additives. The paper\n  gave no further details.\n  \n\n",
  'grain'),
 ("JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWARDS\n  The Ministry of International Trade and\n  Industry (MITI) will revise its long-term energy supply/demand\n  outlook by August to meet a forecast downtrend in Japanese\n  energy demand, ministry officials said.\n      MITI is expected to lo

Check that we still have the same number of documents.

In [12]:
len(documents)

10788

In [13]:
print(documents[0][0],
      documents[0][1])

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on April 17, in
  retaliation for Japan's alleged failure to stick to a pact not
  to sell semiconductors on world markets at below cost.
      Unofficial Japanese estimates put the impact of the tariffs
  at 10 billion dlrs and spokesmen for major electronics firms
  said they would virtually halt exports

Split into texts and labels

In [14]:
texts, labels = zip(*documents)

# Using zip above is equivalent to:
# texts = [document[0] for document in documents]
# labels = [document[1] for document in documents]

In [15]:
print(texts[0])
print(labels[0])

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on April 17, in
  retaliation for Japan's alleged failure to stick to a pact not
  to sell semiconductors on world markets at below cost.
      Unofficial Japanese estimates put the impact of the tariffs
  at 10 billion dlrs and spokesmen for major electronics firms
  said they would virtually halt exports

## Preprocess texts
In this case, we will remove stopwords as they are very common but don't much influence the text categorization. In other NLP tasks, such as text understanding, the stopwords may be crucial.

Punkt is a pre-trained model for tokenizing text into sentences and words. Tokenization involves breaking down a text into smaller pieces, such as words or sentences, which can then be analyzed or processed further.

In [16]:
#nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\jmmck\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Let's see how the first news article gets tokenized.

In [17]:
# Tokens correspond roughly (but not exactly) to words.
for token in nltk.word_tokenize(texts[0]):
    print(token)

ASIAN
EXPORTERS
FEAR
DAMAGE
FROM
U.S.-JAPAN
RIFT
Mounting
trade
friction
between
the
U.S.
And
Japan
has
raised
fears
among
many
of
Asia
's
exporting
nations
that
the
row
could
inflict
far-reaching
economic
damage
,
businessmen
and
officials
said
.
They
told
Reuter
correspondents
in
Asian
capitals
a
U.S.
Move
against
Japan
might
boost
protectionist
sentiment
in
the
U.S.
And
lead
to
curbs
on
American
imports
of
their
products
.
But
some
exporters
said
that
while
the
conflict
would
hurt
them
in
the
long-run
,
in
the
short-term
Tokyo
's
loss
might
be
their
gain
.
The
U.S.
Has
said
it
will
impose
300
mln
dlrs
of
tariffs
on
imports
of
Japanese
electronics
goods
on
April
17
,
in
retaliation
for
Japan
's
alleged
failure
to
stick
to
a
pact
not
to
sell
semiconductors
on
world
markets
at
below
cost
.
Unofficial
Japanese
estimates
put
the
impact
of
the
tariffs
at
10
billion
dlrs
and
spokesmen
for
major
electronics
firms
said
they
would
virtually
halt
exports
of
products
hit
by
the
new
taxes
.
``
W

We will use the English language stopwords for stopword removal.

In [18]:
stop_words = set(stopwords.words('english'))

Preprocess texts by:
- Converting them to lower case
- Rejecting stopwords
- Rejecting words that are not alphanumeric

In [19]:
preprocessed_texts = [' '.join([word for word in nltk.word_tokenize(text.lower())
                                if
                                word.isalnum()
                                and
                                word not in stop_words
                                ])
                      for text in texts]

We should still have the same number of texts

In [20]:
len(preprocessed_texts)

10788

We can now see how our preprocessing has affected one of the texts

In [21]:
preprocessed_texts[0]



Notice in the example that that we have the word 'fear' and the word 'fears'. We could use stemming or lemmatization to reduce words to their roots.

### Stemming
Stemming chops off the ends of words to reduce them to their root form, which might not always be a real word. For example:

- "running" → "run"
- "runner" → "run"
- "happiness" → "happi"

Stemming uses simple, rule-based methods (like removing common suffixes such as "-ing," "-ed," or "-es") without understanding the meaning of the word.

Stemming is fast and straightforward but can sometimes produce incorrect results because it doesn’t consider the context or grammatical rules. For instance:

- "design" and "designation" might both stemmed to "design," which is an over-generalizaztion.
- "better" might be stemmed to "bett," which isn’t meaningful, because a stemming rule removes "er" endings.

### Lemmatization
Lemmatization reduces words to their lemma, which is the base or dictionary form of the word. Unlike stemming, the result is always a valid word. For example:

- "running" → "run"
- "was" → "be"
- "better" → "good"

Lemmatization uses a vocabulary and grammar rules to understand the context and part of speech of a word. This makes it more accurate than stemming, but also more complex and slower.

Because lemmatization relies on linguistic knowledge, it is better at handling irregular words and meanings. For example, it knows that "better" is the comparative form of "good" and will normalize it correctly.

## Convert texts to feature vectors

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_texts)

We now have a TF-IDF matrix with one row for every document, and one column for every word. TF stands for Term Frequency and IDF stands for Inverse Document Frequency. TF-IDF values tend to increase as the word (or term) appears more frequently in a document and tends to decrease where the term appears in more documents. The highest values would be for a term that occurs frequently in a document but does not occur in many documents.    

X is represented as a sparse vector (one in which most of the values are zeros).

When we print it, the format is:<br>(document number, term number), TF-IDF value

In [24]:
print(X)

  (0, 9344)	0.024370226024915873
  (0, 9087)	0.036280507355199325
  (0, 28275)	0.022781195002879544
  (0, 28173)	0.031226219565106267
  (0, 16545)	0.0313061015713933
  (0, 8798)	0.02417835268429823
  (0, 17020)	0.045696049474906124
  (0, 13939)	0.02196857515962073
  (0, 14952)	0.05452042525481219
  (0, 16030)	0.05452042525481219
  (0, 24189)	0.041287184377645496
  (0, 16746)	0.03884555290629168
  (0, 21995)	0.036513993943071914
  (0, 7851)	0.07671010637141087
  (0, 20552)	0.029595167816311944
  (0, 21616)	0.03937586487344196
  (0, 10590)	0.02921465716318497
  (0, 3107)	0.06270964846104592
  (0, 17554)	0.04265400271943684
  (0, 28863)	0.04312675778563214
  (0, 20433)	0.03195329370050054
  (0, 7916)	0.030741780975692117
  (0, 24964)	0.03884555290629168
  (0, 16492)	0.037137029869158816
  (0, 9262)	0.04092726249587176
  :	:
  (10786, 23811)	0.12517266933486068
  (10786, 24027)	0.08601425309085554
  (10786, 17798)	0.11496455402115062
  (10786, 164)	0.10794184692695007
  (10786, 18526)	0.13

In [25]:
# Get the terms
feature_names = vectorizer.get_feature_names_out()

# Show every thousandth term 
feature_names[::1000]

array(['000', '780', 'allegiance', 'austrian', 'bnl', 'carvalho',
       'colony', 'credibly', 'develops', 'economic', 'explosives',
       'franchiser', 'grumman', 'huashan', 'intractable', 'labrador',
       'maintain', 'missile', 'nonbroadcast', 'padding', 'policymaker',
       'quicken', 'reprimanded', 'sardinia', 'sir', 'stockmarket',
       'telequest', 'tylan', 'voluntarily', 'zoran'], dtype=object)

In [26]:
no_of_docs, no_of_features = X.shape

no_of_docs, no_of_features

(10788, 29013)

In [27]:
feature_names[9340]

'encouraging'

## Train a Text Classifier

First we prepare our training data and test data

In [28]:
from sklearn.model_selection import train_test_split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    labels,
                                                    test_size=0.3,
                                                    random_state=42)

Check how many labels in our training set

In [30]:
len(y_train)

7551

We should have a corresponding number of rows and one feature per term in our X_train

In [31]:
X_train.shape

(7551, 29013)

Here we train a Naive Bayes classifier, commonly used in text classification demos

In [32]:
from sklearn.naive_bayes import MultinomialNB

In [33]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

## Test the Naive Bayes Classifier

In [34]:
from sklearn.metrics import accuracy_score, classification_report

In [35]:
y_pred = nb_classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.651220265678097
Classification Report:
                 precision    recall  f1-score   support

            acq       0.56      0.94      0.70       711
           alum       0.00      0.00      0.00        14
         barley       0.00      0.00      0.00        10
            bop       0.00      0.00      0.00        35
        carcass       0.00      0.00      0.00        18
     castor-oil       0.00      0.00      0.00         2
          cocoa       0.00      0.00      0.00        24
    coconut-oil       0.00      0.00      0.00         1
         coffee       1.00      0.03      0.05        39
         copper       0.00      0.00      0.00        19
           corn       1.00      0.16      0.28        75
         cotton       0.00      0.00      0.00        12
            cpi       0.00      0.00      0.00        31
            cpu       0.00      0.00      0.00         2
          crude       0.94      0.68      0.79       149
            dfl       0.00      0.00

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positives.

Formula: Precision = TP / (TP + FP)

- TP (True Positives): The number of correct positive predictions.
- FP (False Positives): The number of incorrect positive predictions.

​Precision answers the question, "Of all the documents that were classified as a particular category, how many were actually in that category?" High precision indicates a low false positive rate.

### Recall
Recall (or sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class.

Formula: Recall = TP / (TP + FN) 

- TP (True Positives): The number of correct positive predictions.
- FN (False Negatives): The number of actual positives that were not correctly identified.

Recall answers the question, "Of all the documents that actually belong to a particular category, how many were correctly identified as such?" High recall indicates a low false negative rate.

### F1-Score
The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both concerns.

Formula: F1-score = 2 × [ Precision × Recall / (Precision + Recall) ]

The F1-score is useful when you need a balance between precision and recall. It is especially important when you have an uneven class distribution. It also possible to weight precision or recall more heavily than the other if you have a preference. E.g., in most web searches you might weight precision more highly if you're satisfied with a single documment that answers your question. When searching for legal precedents, you might value recall more highly because you don't want to miss something that could be relevant to a case. The Wikipedia article gives more details on how to modify the formula: https://en.wikipedia.org/wiki/Precision_and_recall 

### Support
Support is the number of documents in the corresponding class. It is useful for understanding the context of these metrics.

## Training a Support Vector Machine (SVM) Classifier
SVMs generally perform very well for text categorization with TF-IDF vectors.

In [36]:
from sklearn.svm import SVC

In [37]:
svm_classifier = SVC()
svm_classifier.fit(X_train, y_train)

## Test the SVM classifier

In [38]:
svm_y_pred = svm_classifier.predict(X_test)
print("Support Vector Machine Classifier")
print("Accuracy:", accuracy_score(y_test, svm_y_pred))
print("Classification Report:")
print(classification_report(y_test, svm_y_pred))

Support Vector Machine Classifier
Accuracy: 0.8554216867469879
Classification Report:
                 precision    recall  f1-score   support

            acq       0.76      0.98      0.85       711
           alum       1.00      0.36      0.53        14
         barley       0.80      0.80      0.80        10
            bop       0.83      0.57      0.68        35
        carcass       1.00      0.39      0.56        18
     castor-oil       0.00      0.00      0.00         2
          cocoa       0.96      0.92      0.94        24
    coconut-oil       0.00      0.00      0.00         1
         coffee       0.97      0.87      0.92        39
         copper       1.00      0.74      0.85        19
           corn       0.96      0.64      0.77        75
         cotton       1.00      0.67      0.80        12
            cpi       0.67      0.58      0.62        31
            cpu       0.00      0.00      0.00         2
          crude       0.80      0.91      0.85       149
 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of random variables (our Xs) under consideration. It can simplify the dataset without losing significant information, making it easier to visualize and process. It can also improve classification accuracy.

For this example we will use Singular Value Decomposition (SVD). This article is accessible and helpful for getting an understanding SVD: 
https://www.ibm.com/think/topics/latent-semantic-analysis


In [39]:
from sklearn.decomposition import TruncatedSVD

In [40]:
svd = TruncatedSVD(
    n_components=100,   # Reduce from 29005 to 100 dimensions
    random_state=42,
    )
X_train_svd = svd.fit_transform(X_train)
X_test_svd = svd.transform(X_test)

Let's check that we do indeed now have 100 dimensions.

In [41]:
X_train_svd.shape

(7551, 100)

Notice that our matrix is no longer sparse.

In [42]:
X_train_svd

array([[ 6.78064090e-01,  2.43919400e-02, -2.22954311e-02, ...,
         1.22081615e-02,  2.60899294e-02,  3.50220243e-02],
       [ 5.18981407e-02,  1.03080549e-01, -2.73910852e-02, ...,
         3.30937290e-02, -1.46573520e-04,  2.40227293e-03],
       [ 7.86173867e-02,  1.82453179e-01,  3.38059313e-02, ...,
        -4.61500910e-03, -2.74544211e-02,  1.81255110e-02],
       ...,
       [ 3.83281277e-02,  1.35469987e-01,  3.06066221e-03, ...,
        -1.45224719e-02, -5.46362361e-02, -2.98572894e-02],
       [ 5.58475473e-01, -6.93406136e-02,  5.78420991e-02, ...,
         2.06017080e-03, -4.45040568e-03,  6.93878547e-03],
       [ 7.00182357e-01, -1.41622861e-01, -5.86343373e-02, ...,
         1.36972786e-02,  3.86051943e-02,  1.65160860e-02]])

We'll now train a new support vector machine (SVM) classifier with our dataset reduced using singular value decomposition (SVD).

In [43]:
svm_classifier_svd = SVC()
svm_classifier_svd.fit(X_train_svd, y_train)

Finally, let's evaluate our new classifier.

In [44]:
svm_y_pred_svd = svm_classifier_svd.predict(X_test_svd)
print("Support Vector Machine Classifier with SVD")
print("Accuracy:", accuracy_score(y_test, svm_y_pred_svd))
print("Classification Report:")
print(classification_report(y_test, svm_y_pred_svd))

Support Vector Machine Classifier with SVD
Accuracy: 0.8631448872412728
Classification Report:
                 precision    recall  f1-score   support

            acq       0.88      0.97      0.92       711
           alum       0.31      0.29      0.30        14
         barley       0.58      0.70      0.64        10
            bop       0.66      0.54      0.59        35
        carcass       0.60      0.33      0.43        18
     castor-oil       0.00      0.00      0.00         2
          cocoa       0.92      1.00      0.96        24
    coconut-oil       0.00      0.00      0.00         1
         coffee       0.88      0.90      0.89        39
         copper       0.79      0.79      0.79        19
           corn       0.88      0.65      0.75        75
         cotton       1.00      0.50      0.67        12
            cpi       0.53      0.65      0.58        31
            cpu       0.00      0.00      0.00         2
          crude       0.78      0.92      0.84   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


As well as being much faster, we also see a marginal improvement in accuracy.

To begin exploring this large and advanced topic further, see: https://scikit-learn.org/stable/modules/decomposition.html