In [8]:
import nltk
from nltk.corpus import reuters

The ```raw``` method returns a string of raw text from an NLTK corpus.  Let's get the raw text for the first article in the Reuters corpus.

In [9]:
rawtext = reuters.raw('test/14826')
print(rawtext)

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on April 17, in
  retaliation for Japan's alleged failure to stick to a pact not
  to sell semiconductors on world markets at below cost.
      Unofficial Japanese estimates put the impact of the tariffs
  at 10 billion dlrs and spokesmen for major electronics firms
  said they would virtually halt exports

The ```words``` method does tokenization for an NLTK corpus.

In [10]:
words_simplistic = reuters.words('test/14826')
len(words_simplistic)

899

However, ```words``` uses ```WordPunctTokenizer``` is a simplistic tokenizer based on punctuation.

In [11]:
print(words_simplistic[:100])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.', 'They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.', 'But', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', '-']


It is preferable use the ```word_tokenize``` method which uses the more sophisticated ```TreebankTokenizer```.

In [12]:
words = nltk.word_tokenize(rawtext)
len(words)

816

In [13]:
print(words[:100])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U.S.-JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U.S.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'s", 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far-reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.', 'They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U.S.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U.S.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.', 'But', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long-run', ',', 'in', 'the', 'short-term', 'Tokyo', "'s", 'loss', 'might', 'be', 'their', 'gain', '.', 'The', 'U.S.', 'Has', 'said', 'it']


Now let's build a dictionary of words to record the count of each word that is used in this article.  Python dictionaries are (behind the scenes) implemented using hash tables.

In [14]:
wordFreqDict = {}
type(wordFreqDict)

dict

In [15]:
for word in words:
    if word in wordFreqDict:
        wordFreqDict[word] += 1
    else:
        wordFreqDict[word] = 1

In [16]:
wordFreqDict['Japan']

12

In [17]:
len(wordFreqDict)

387

A collection of objects and their associated counts is called a ***multiset*** or a ***bag***.  When applied to the count of words in a document, this is called the ***bag of words*** model for a text document.  The bag of words does not see the order that the words are used, only their frequencies.  NLP tasks such as document retrieval and classification are commonly implemented using a bag of words model.

What happens if we try to query a term that was not in the article?

In [18]:
wordFreqDict['volcano']

KeyError: 'volcano'

There is a subclass of the ```dict``` type called ```Counter``` that more naturally handles multisets.

In [19]:
from collections import Counter
issubclass(Counter,dict)

True

In [20]:
wordFreqCounter = Counter(words)
wordFreqCounter['Japan']

12

In [21]:
wordFreqCounter['volcano']

0

You might recall that NLTK had its own function called ```FreqDist``` that also implements term counts.

In [23]:
wordFreqNLTK = nltk.FreqDist(words)
type(wordFreqNLTK)

nltk.probability.FreqDist

In fact the ```nltk.probability.FreqDist``` type is a subclass of ```Counter```.

In [25]:
issubclass(nltk.probability.FreqDist, Counter)

True

In [26]:
wordFreqNLTK

FreqDist({'the': 32, '.': 31, 'of': 30, ',': 29, 'to': 26, 'said': 16, 'a': 14, 'trade': 13, 'U.S.': 13, 'in': 13, ...})

**Scikit-learn** has its own tools for extracting word frquencies from a corpus of texts. The tools are powerful but take some getting used to.<br>

```CountVectorizer``` acts on a corpus (list of raw text documents), does its own tokenization, and then returns matrix of word counts.  Each row represents a document and each column represents a word in the vocabulary of the corpus.<br>

However, since such a matrix is usually quite sparse, the matrix gets represented as a ***sparse matrix*** by storing a list of only the nonzero entries.<br>

In the following example, we'll just pass a list of only one document to ```CountVectorizer```, so we'll get back a sparse matrix with only one row.<br>

The below link is a good reference:<br>
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction


In [27]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
wordSKcount = count_vect.fit_transform([rawtext])
wordSKcount

<1x351 sparse matrix of type '<class 'numpy.int64'>'
	with 351 stored elements in Compressed Sparse Row format>

In [28]:
print(wordSKcount)

  (0, 97)	1
  (0, 93)	1
  (0, 334)	1
  (0, 332)	1
  (0, 189)	1
  (0, 90)	1
  (0, 194)	1
  (0, 151)	1
  (0, 161)	1
  (0, 175)	1
  (0, 280)	1
  (0, 190)	1
  (0, 255)	1
  (0, 80)	2
  (0, 240)	1
  (0, 251)	1
  (0, 117)	1
  (0, 32)	1
  (0, 203)	1
  (0, 348)	1
  (0, 235)	1
  (0, 81)	1
  (0, 291)	1
  (0, 187)	1
  (0, 96)	1
  :	:
  (0, 71)	1
  (0, 261)	1
  (0, 305)	5
  (0, 205)	1
  (0, 104)	1
  (0, 25)	1
  (0, 209)	30
  (0, 179)	1
  (0, 17)	2
  (0, 110)	1
  (0, 248)	1
  (0, 133)	3
  (0, 20)	16
  (0, 306)	37
  (0, 41)	4
  (0, 120)	1
  (0, 320)	15
  (0, 198)	1
  (0, 260)	1
  (0, 155)	13
  (0, 122)	3
  (0, 76)	2
  (0, 109)	1
  (0, 103)	3
  (0, 26)	2


The last entry means that in row 0 (representing the first document) and column 26 (representing one of the words in the vocabulary) has a count of 2.  What word does column 26 represent?

In [29]:
count_vect.get_feature_names()[26]

'asian'

And how do we go from the vocabulary word to its column index?

In [30]:
print(count_vect.vocabulary_.get('asian'))  # Indexed by lower-case

26


In [31]:
print(count_vect.vocabulary_.get('u.s.')) # 'u.s.' is missing from the vocabulary

None


Let's print the entire dictionary of vocabulary words and their column indices.  (These are not word counts!)

In [32]:
print(count_vect.vocabulary_)



We can use Scikit-learn's ```CountVectorizer``` to process an entire list of documents.  Let's first build a list of raw text documents from the Reuters corpus.

In [33]:
rawtextList = [reuters.raw(id) for id in reuters.fileids()]
len(rawtextList)

10788

Recall the first article was about U.S.-Japan trade frictions.  What is the next article about?

In [34]:
print(rawtextList[1])

CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STOCKS
  A survey of 19 provinces and seven cities
  showed vermin consume between seven and 12 pct of China's grain
  stocks, the China Daily said.
      It also said that each year 1.575 mln tonnes, or 25 pct, of
  China's fruit output are left to rot, and 2.1 mln tonnes, or up
  to 30 pct, of its vegetables. The paper blamed the waste on
  inadequate storage and bad preservation methods.
      It said the government had launched a national programme to
  reduce waste, calling for improved technology in storage and
  preservation, and greater production of additives. The paper
  gave no further details.
  




Now we apply ```CountVectorizer``` to the entire corpus of 10,788 documents (articles).

In [35]:
count_vect = CountVectorizer()
countMatrix = count_vect.fit_transform(rawtextList)
countMatrix

<10788x30916 sparse matrix of type '<class 'numpy.int64'>'
	with 785208 stored elements in Compressed Sparse Row format>

We have 10,788 rows corresponding to each document, and 30,916 columns corresponding to a vocabulary of size 30,916 words or tokens, and a total of 785,208 nonzero entries in the frequency matrix.

Let's import Scikit-learn's function for cosine distance.

In [36]:
from sklearn.metrics.pairwise import cosine_similarity

We can find the cosine distance between the first article on U.S.-Japan trade relations and the second article on rats eating Chinese grain and rotting fruit.  A cosine distance of **1.0** would be as close as possible, i.e. the word count vectors would point in the same direction (even if of different magnitude), i.e. the articles would have the same relative frequency of words.  A cosine distance of **0.0** would mean that the word count vectors are orthogonal, i.e. the articles would have no overlapping terms.

In [37]:
cosine_similarity(countMatrix[0],countMatrix[1])

array([[0.56441791]])

Let's make a list of all cosine distances to the first article on U.S.-Japan trade relations.

In [38]:
cosine_similarity(countMatrix[0],countMatrix)

array([[1.        , 0.56441791, 0.68430926, ..., 0.01616928, 0.01901003,
        0.0155811 ]])

We extract the first row of the array and convert it to a list.

In [39]:
cosList = cosine_similarity(countMatrix[0],countMatrix)[0].tolist()

```enumerate``` is a useful object for iterating over lists and their indices.

In [50]:
L = ["apple","banana","orange"]
list(enumerate(L))

[(0, 'apple'), (1, 'banana'), (2, 'orange')]

Now let's reverse sort by cosine distance and also print out the indices of the corresponding documents.

In [40]:
sorted(((e,i) for i,e in enumerate(cosList)), reverse=True)[:7]

[(1.0000000000000018, 0),
 (0.8441797447249986, 3605),
 (0.8255584044323317, 3311),
 (0.8226145598212536, 3781),
 (0.8225794792829114, 8967),
 (0.8189882935052504, 3244),
 (0.8179586571941766, 3464)]

This is the article that is closest in cosine distance to the first article.

In [41]:
print(rawtextList[3605])

ECONOMIC SPOTLIGHT - U.S. CONGRESS RAPS JAPAN
  The U.S. Congress is making Japan,
  with its enormous worldwide trade surplus, the symbol of the
  U.S. trade crisis and the focus of its efforts to turn around
  America's record trade deficit.
      "Japan has come to symbolize what we fear most in trade: the
  challenge to our high technology industries, the threat of
  government nutured competition, and the multitude of barriers
  to our exports," Senate Democratic Leader Robert Byrd said.
      "If we can find a way to come to terms with Japan over trade
  problems, we can manage our difficulties with other countries,"
  the West Virginia Democrat said at a Senate Finance Committee
  hearing on the trade bill.
      Byrd and House Speaker Jim Wright, a Texas Democrat, have
  made trade legislation a priority this year and a wide-ranging
  bill is being readied for probable House approval next month.
      Japan's bilateral trade surplus jumped from 12 billion dlrs
  in 1980 to 62 b

Let's repeat this exercise for the second article concerning Chinese grain.

In [42]:
cosList2 = cosine_similarity(countMatrix[1],countMatrix)[0].tolist()
sorted(((e,i) for i,e in enumerate(cosList2)), reverse=True)[:7]

[(1.0, 1),
 (0.6550537786773223, 8111),
 (0.6188024107937812, 10312),
 (0.6148202576210995, 6256),
 (0.6130022570999106, 4162),
 (0.6092796480128674, 1441),
 (0.6068330093175274, 219)]

In [43]:
print(rawtextList[8111])

CHINA TO IMPORT MORE GRAIN IN 1987
  China's grain imports will rise
  in 1987 because of a serious drought and increasing demand, but
  will be not be as large as in the past, Chinese officials and
  Japanese traders told Reuters.
      They said foreign exchange constraints and national policy
  would not allow a return to large-scale imports, which peaked
  at 16.15 mln tonnes in 1982.
      An agricultural official of the Shanghai government put
  maximum imports at about 10 mln tonnes this year, against 7.73
  mln in 1986 and 5.97 mln in 1985.
      Officials said grain imports rose in 1986 because of a poor
  harvest and rising domestic demand, but remained below exports,
  which rose to 9.42 mln tonnes from 9.33 mln in 1985.
      "China is short of foreign exchange," the Shanghai official
  said. "We cannot rely on imports, even at current low world
  prices. Only if there is a major disaster will we become a
  major importer."
      A Japanese trader in Peking said Chinese gra

One disadvantage of using word counts for assessing the similarity of documents is that it tends to overemphasize very frequent words like "the" and "of", and overemphasize less common words that may have more *semantic* (i.e. meaningful) content.  We can do this by dividing by a weighting that is related to the frequency of the word, effectively reducing the weight of the most frequent terms.  The resulting statistic is called the ***term frequency - inverse document frequency***, or ***tf-idf***.<br>

Let $\operatorname{tf}(t,d)$ be the raw count of term $t$ in document $d$, and let $\operatorname{df}(t)$ be the number of documents that contain term $t$.  Then a typical choice (and ```sklearn```'s default) is to let the inverse docuemnt frequency be<br>
$$\operatorname{idf}(t) = \log \frac{N_d + 1}{\operatorname{df}(t) + 1} + 1$$
where $N_d$ is the total number of documents.  The $+1$'s are used for smoothing and prevent division by zero and other nastiness.

We can use ```TfidfVectorizer``` to create a feature matrix of *tf-idf* statistics.

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
tfidfMatrix = tfidf_vect.fit_transform(rawtextList)
tfidfMatrix

<10788x30916 sparse matrix of type '<class 'numpy.float64'>'
	with 785208 stored elements in Compressed Sparse Row format>

In [45]:
print(count_vect.vocabulary_.get('asian'))
print(count_vect.vocabulary_.get('the'))
print(tfidf_vect.vocabulary_.get('asian'))
print(tfidf_vect.vocabulary_.get('the'))


3384
27902
3384
27902


In the first article, the term *the* appears 18.5 times more often than the term *asian*.  But after applying the inverse document frequency weighting, the tf-idf statistic for *the* is only 4.2 times the statistic for *asian*, because *the* is such a common word throughout the entire corpus.

In [46]:
print(countMatrix[0,27902]/countMatrix[0,3384])
print(tfidfMatrix[0,27902]/tfidfMatrix[0,3384])

18.5
4.203716670286382


Now we can compare the similarity ranking to the first article using raw term counts versus the tf-idf statistics.

In [47]:
sorted(((e,i) for i,e in enumerate(cosList)), reverse=True)[:7]  # This uses raw term counts.

[(1.0000000000000018, 0),
 (0.8441797447249986, 3605),
 (0.8255584044323317, 3311),
 (0.8226145598212536, 3781),
 (0.8225794792829114, 8967),
 (0.8189882935052504, 3244),
 (0.8179586571941766, 3464)]

In [48]:
cosList_tfidf = cosine_similarity(tfidfMatrix[0],tfidfMatrix)[0].tolist()
sorted(((e,i) for i,e in enumerate(cosList_tfidf)), reverse=True)[:7]

[(0.9999999999999998, 0),
 (0.5413540058297731, 3464),
 (0.5310550431375045, 3605),
 (0.5048206799420962, 8967),
 (0.4960254505076737, 3781),
 (0.49107228481411275, 1118),
 (0.4845059876298639, 4635)]