# text mining (nlp) with python

**Author:** Ties de Kok ([Personal Website](http://www.tiesdekok.com))  
**Last updated:** 24 Oct 2017  
**Python version:** Python 3.5  
**License:** MIT License  

**Note:** Some features (like the ToC) will only work if you run it locally, use Binder, or use nbviewer by clicking this link: 

# *Introduction*

This notebook contains code examples to get you started with Natural Language Processing (NLP) / Text Mining for Research and Data Science purposes.  

In the large scheme of things there are roughly 4 steps:  

1. Identify a data source  
2. Gather the data  
3. Process the data  
4. Analyze the data  

This notebook only discusses step 3 and 4. If you want to learn more about step 2 see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch). 

# *Elements / topics that are discussed in this notebook: *


<img style="float: left" src="https://i.imgur.com/c3aCZLA.png" width="40%" /> 

# *Table of Contents*  <a id='toc'></a>

* [Primer on NLP tools](#tool_primer)     
* [Process + Clean text](#proc_clean)   
    * [Normalization](#normalization)
        * [Deal with unwanted characters](#unwanted_char)
        * [Sentence segmentation](#sentence_seg)   
        * [Word tokenization](#word_token)
        * [Lemmatization & Stemming](#lem_and_stem) 
    * [Language modeling](#lang_model) 
        * [Part-of-Speech tagging](#pos_tagging) 
        * [Uni-Gram & N-Grams](#n_grams) 
        * [Stop words](#stop_words) 
* [Direct feature extraction](#feature_extract) 
    * [Feature search](#feature_search) 
        * [Entity recognition](#entity_recognition) 
        * [Pattern search](#pattern_search) 
    * [Text evaluation](#text_eval) 
        * [Language](#language) 
        * [Dictionary counting](#dict_counting) 
        * [Readability](#readability) 
* [Represent text numerically](#text_numerical) 
    * [Bag of Words](#bows) 
        * [TF-IDF](#tfidf) 
    * [Word Embeddings](#word_embed) 
        * [Word2Vec](#Word2Vec) 
        * [Doc2Vec](#Doc2Vec) 
        * [FastText](#FastText) 
* [Statistical models](#stat_models) 
    * ["Traditional" machine learning](#trad_ml) 
        * [Supervised](#trad_ml_supervised) 
            * [Naïve Bayes](#trad_ml_supervised_nb) 
            * [Support Vector Machines (SVM)](#trad_ml_supervised_svm) 
        * [Unsupervised](#trad_ml_unsupervised) 
            * [Latent Dirichilet Allocation (LDA)](#trad_ml_unsupervised_lda) 
            * [pyLDAvis](#trad_ml_unsupervised_pyLDAvis) 
* [Model Selection and Evaluation](#trad_ml_eval) 
* [Neural Networks](#nn_ml)

# <span style="text-decoration: underline;">Primer on NLP tools</span><a id='tool_primer'></a> [(to top)](#toc)

There are many tools available for NLP purposes.  
The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview.  

Besides build-in Python functionality I will use / demonstrate the following packages:

**Standard NLP libraries**:
1. `Spacy` and the higher-level wrapper `Textacy` 
2. `NLTK` and the higher-level wrapper `TextBlob`

*Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!*

**Standard machine learning library**:

1. `scikit learn`

**Topic modelling libraries**:

1. `Gensim` 
2. `FastText`

**Specific task libraries**:

There are many, just a couple of examples:

1. `pyLDAvis` for visualizing LDA)
2. `langdetect` for detecting languages
3. `fuzzywuzzy` for fuzzy text matching
4. `textstat` to calculate readability statistics

** Neural Network (+ Deep Learning) libraries**:

1. `Tensorflow`  
2. `PyTorch`  
3. `Keras`

# <span style="text-decoration: underline;">Get some example data</span><a id='example_data'></a> [(to top)](#toc)

There are many example datasets available to play around with, see for example this great repository:  
https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table

The data that I will use for most of the examples is the "Reuter_50_50 Data Set" that is used for author identification experiments. 

See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50  

### Download and load the data

Can't follow what I am doing here? Please see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch).

In [1]:
import requests, zipfile, io, os

*Download and extract the zip file with the data *

In [2]:
if not os.path.exists('C50test'):
    r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip")
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

*Load the data into memory*

In [3]:
folder_dict = {'test' : 'C50test', 'train' : 'C50train'}
text_dict = {'test' : {}, 'train' : {}}

In [4]:
for label, folder in folder_dict.items():
    authors = os.listdir(folder)
    for author in authors:
        text_files = os.listdir(os.path.join(folder, author))
        for file in text_files:
            with open(os.path.join(folder, author, file), 'r') as text_file:
                text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))

*Note: the text comes pre-split per sentence, for the sake of example I undo this doing `' '.join(text_file.readlines()`*

In [5]:
text_dict['test']['TimFarrand'][0]

'Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997. The shares fell 6p to 781p on the news.\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  \n Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment but until we get a decision they will be held back."\n Whatever the MMC decides many analysts expect Lang to defer a decision until after the next general election which will be called by May 22.\n "They will probably try to defer the decision until after the election. I don\'t think they want the negative PR of having a large number of people fired," said Wakley.  \n If the deal does not go throu

# <span style="text-decoration: underline;">Process + Clean text</span><a id='proc_clean'></a> [(to top)](#toc)

## Convert the text into a NLP representation

We can use the text directly, but if want to use packages like `spacy` and `nltk` / `textblob` we first have to convert the text into a corresponding object.  

### Spacy

In [6]:
from spacy.en import English
parser = English()

Convert all text in the "test" sample to a `spacy` `doc` object using `parser()`:

In [7]:
spacy_text = {}
for author, text_list in text_dict['test'].items():
    spacy_text[author] = [parser(text) for text in text_list]

In [8]:
type(spacy_text['TimFarrand'][0])

spacy.tokens.doc.Doc

### NLTK

In [9]:
import nltk

We can apply basic `nltk` operations directly to the text so we don't need to convert first. 

### TextBlob

In [10]:
from textblob import TextBlob

Convert all text in the "test" sample to a `TextBlob` object using `TextBlob()`:

In [11]:
textblob_text = {}
for author, text_list in text_dict['test'].items():
    textblob_text[author] = [TextBlob(text) for text in text_list]

In [12]:
type(textblob_text['TimFarrand'][0])

textblob.blob.TextBlob

## <span style="text-decoration: underline;">Normalization</span><a id='normalization'></a> [(to top)](#toc)

The goal of **text normalization** describes the task of transforming the text into a different form.  

This can imply many things, I will show a couple of things below:

### <span style="text-decoration: underline;">Deal with unwanted characters</span><a id='unwanted_char'></a> [(to top)](#toc)

You will often notice that there are characters that you don't want in your text.  

Let's look at this sentence for example:

> "Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

You notice that there are some `\` and `\n` in there. These are used to define how a string should be displayed, if we print this text we get:  

In [13]:
text_dict['test']['TimFarrand'][0][:298]

"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

In [14]:
print(text_dict['test']['TimFarrand'][0][:298])

Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers


If we want to analyze text we often don't care about the visual representation. They might actually cause problems!  

** So how do we remove them? **

In many cases it is sufficient to simply use the `.replace()` function:

In [15]:
text_dict['test']['TimFarrand'][0][:298].replace('\n', '').replace('\\', '')

"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts. Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

Sometimes, however, the problem arrises because of encoding / decoding problems.  

In those cases you can usually do something like:  

In [16]:
problem_sentence = 'This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!'
print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))

b"This is some  text that has to be cleaned! it's annoying!"


### <span style="text-decoration: underline;">Sentence segmentation</span><a id='sentence_seg'></a> [(to top)](#toc)

Sentence segmentation means the task of splitting up the piece of text by sentence.  

You could do this by splitting on the `.` symbol, but dots are used in many other cases as well so it is not very robust:

In [22]:
text_dict['test']['TimFarrand'][0][:550].split('.')

["Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts",
 '\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997',
 ' The shares fell 6p to 781p on the news',
 '\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers',
 '  \n Dermott Carr, an analyst at Nikko said, "the mark']

It is better to use a more sophisticated implementation such as the one by `Spacy`:

In [18]:
example_paragraph = spacy_text['TimFarrand'][0]

In [28]:
sentence_list = [s for s in example_paragraph.sents]
sentence_list[:5]

[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ,
 "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  
  ,
 Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment but until we get a decision they will be held back."
  ]

Notice that the returned object is still a `spacy` object:

In [29]:
type(sentence_list[0])

spacy.tokens.span.Span

Apply to all texts (for use later on):

In [30]:
spacy_sentences = {}
for author, text_list in spacy_text.items():
    spacy_sentences[author] = [[s for s in text.sents] for text in text_list]

In [32]:
spacy_sentences['TimFarrand'][0][:3]

[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ]

### <span style="text-decoration: underline;">Word tokenization</span><a id='word_token'></a> [(to top)](#toc)

Word tokenization means to split the sentence (or text) up into words.

In [34]:
example_sentence = spacy_sentences['TimFarrand'][0][0]
example_sentence

Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
 

A word is called a `token` in this context (hence `tokenization`), using `spacy`:

In [73]:
token_list = [token for token in example_sentence]
token_list[0:15]

[Shares,
 in,
 brewing,
 -,
 to,
 -,
 leisure,
 group,
 Bass,
 Plc,
 are,
 likely,
 to,
 be,
 held]

### <span style="text-decoration: underline;">Lemmatization & Stemming</span><a id='lem_and_stem'></a> [(to top)](#toc)

In some cases you want to convert a word (i.e. token) into a more general representation.  

For example: convert "car", "cars", "car's", "cars'" all into the word `car`.

This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal).  

**Spacy**

Space offers build-in functionality for lemmatization:

In [74]:
lemmatized = [token.lemma_ for token in example_sentence]
lemmatized[0:15]

['share',
 'in',
 'brewing',
 '-',
 'to',
 '-',
 'leisure',
 'group',
 'bass',
 'plc',
 'be',
 'likely',
 'to',
 'be',
 'hold']

**NLTK**

Using the NLTK libary we can also use the more aggressive Porter Stemmer

In [49]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [75]:
stemmed = [stemmer.stem(token.text) for token in example_sentence]
stemmed[0:15]

['share',
 'in',
 'brew',
 '-',
 'to',
 '-',
 'leisur',
 'group',
 'bass',
 'plc',
 'are',
 'like',
 'to',
 'be',
 'held']

**Compare**:

In [79]:
for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):
    print(original, ' | ', lemma, ' | ', stem)

Shares  |  share  |  share
in  |  in  |  in
brewing  |  brewing  |  brew
-  |  -  |  -
to  |  to  |  to
-  |  -  |  -
leisure  |  leisure  |  leisur
group  |  group  |  group
Bass  |  bass  |  bass
Plc  |  plc  |  plc
are  |  be  |  are
likely  |  likely  |  like
to  |  to  |  to
be  |  be  |  be
held  |  hold  |  held


As you can see, the Porter Stemmer is the most aggressive. In my experience it is usually best to use lemmatization instead of a stemmer. 

## <span style="text-decoration: underline;">Language modeling</span><a id='lang_model'></a> [(to top)](#toc)

Text is inherently structured in complex ways, we can often use some of this underlying structure. 

### <span style="text-decoration: underline;">Part-of-Speech tagging</span><a id='pos_tagging'></a> [(to top)](#toc)

Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc. 

Using `Spacy`:

In [83]:
pos_list = [(token, token.pos_) for token in example_sentence]
pos_list[0:10]

[(Shares, 'NOUN'),
 (in, 'ADP'),
 (brewing, 'NOUN'),
 (-, 'PUNCT'),
 (to, 'ADP'),
 (-, 'PUNCT'),
 (leisure, 'NOUN'),
 (group, 'NOUN'),
 (Bass, 'PROPN'),
 (Plc, 'PROPN')]

### <span style="text-decoration: underline;">Uni-Gram & N-Grams</span><a id='n_grams'></a> [(to top)](#toc)

Obviously a sentence is not a random collection of words, the sequence of words has information value.  

A simple way to incorporate some of this sequence is by using what is called `n-grams`.  
An `n-gram` is nothing more than a a combination of `N` words into one "word" (a uni-gram is just one word).  

So we can convert `"Sentence about flying cars"` into a list of bigrams:

> Sentence-about, about-flying, flying-cars

Using `NLTK`:

In [90]:
bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]
bigram_list[10:15]

['are-likely', 'likely-to', 'to-be', 'be-held', 'held-back']

### <span style="text-decoration: underline;">Stop words</span><a id='stop_words'></a> [(to top)](#toc)

Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence.  

The primary example are stop words.  

Sometimes you can improve the accuracy of your model by removing stop words.

Using `Spacy`:

In [95]:
no_stop_words = [token for token in example_sentence if not token.is_stop]

In [100]:
no_stop_words[:10]

[Shares, brewing, -, -, leisure, group, Bass, Plc, likely, held]

In [99]:
token_list[:10]

[Shares, in, brewing, -, to, -, leisure, group, Bass, Plc]

## Use the awesome `textacy` package to save time

`Textacy` is a high-level wrapper built on `spaCy`, it has many cool features!

See their GitHub page for details: https://github.com/chartbeat-labs/textacy

In [102]:
import textacy

### We can almost completely wrap all of our pre-processing into one `textacy` command:

http://textacy.readthedocs.io/en/latest/api_reference.html#module-textacy.preprocess

In [104]:
example_text = text_dict['test']['TimFarrand'][0]

In [136]:
cleaned_text = textacy.preprocess_text(example_text, lowercase=True, fix_unicode=True, no_punct=True)

** Let's use the above + textacy to do some basic preprocessing **

1. Split into sentences
2. Apply lemmatizer
3. Clean up the sentence using `textacy`

Note: using textacy does slow down the processing but usually it is worth the wait

In [144]:
spacy_text_clean = {}
for author, text_list in text_dict['test'].items():
    lst = []
    for text in text_list:
        sentences = [' '.join([token.lemma_ for token in s]) for s in parser(text).sents]
        sentences_cleaned = [textacy.preprocess_text(sen, lowercase=True, fix_unicode=True, no_punct=True) for sen in sentences]
        lst.append([parser(sen) for sen in sentences_cleaned])
    spacy_text_clean[author] = lst

In [145]:
spacy_text_clean['TimFarrand'][0][:3]

[share in brewing to leisure group bass plc be likely to be hold back until britain s trade and industry secretary ian lang decide whether to allow pron propose merge with brewer carlsberg tetley say analyst,
 earlier lang announce the bass deal would be refer to the monoplies and mergers commission which be due to report before march 24 1997,
 the share fall 6p to 781p on the news]

# <span style="text-decoration: underline;">Direct feature extraction</span><a id='feature_extract'></a> [(to top)](#toc)

We now have pre-processed our text into something that we can use.  

The easiest thing we can do now is what I label "direct feature extraction"

## <span style="text-decoration: underline;">Feature search</span><a id='feature_search'></a> [(to top)](#toc)

There are many features that you can extract from the text.

### <span style="text-decoration: underline;">Entity recognition</span><a id='entity_recognition'></a> [(to top)](#toc)

It is often useful / relevant to extract entities that are mentioned in a piece of text. 

Example using `textacy`

In [165]:
example_sentence = spacy_text_clean['TimFarrand'][0][8]
example_sentence

if the deal do not go through analyst calculate the maximum loss to bass of 60 million with most sum centre on the 30 40 million range

Extract numeric entities:

In [202]:
for sen in spacy_text_clean['TimFarrand'][4]:
    print([(token.text, token.ent_type_) for token in sen if token.ent_type_ != ''])

[('a', 'PERCENT'), ('48', 'PERCENT'), ('percent', 'PERCENT'), ('1487', 'CARDINAL'), ('million', 'CARDINAL'), ('2464', 'CARDINAL'), ('million', 'CARDINAL'), ('november', 'DATE'), ('1995', 'DATE')]
[('the', 'DATE'), ('year', 'DATE')]
[('the', 'DATE'), ('current', 'DATE'), ('year', 'DATE')]
[('the', 'DATE'), ('year', 'DATE'), ('end', 'DATE'), ('27', 'CARDINAL'), ('1996', 'DATE'), ('312', 'CARDINAL'), ('million', 'CARDINAL'), ('239', 'CARDINAL'), ('million', 'CARDINAL')]
[('107', 'PERCENT'), ('percent', 'PERCENT'), ('1113', 'CARDINAL'), ('million', 'CARDINAL')]
[('471', 'MONEY'), ('million', 'MONEY'), ('the', 'DATE'), ('11', 'DATE'), ('month', 'DATE'), ('18', 'MONEY'), ('million', 'MONEY')]
[('145', 'CARDINAL'), ('150', 'CARDINAL'), ('million', 'CARDINAL')]
[('597p', 'CARDINAL'), ('pron', 'ORG')]
[('one', 'CARDINAL')]
[('1997', 'DATE')]
[]
[]
[('britain', 'GPE'), ('annual', 'DATE'), ('december', 'DATE')]
[('170', 'CARDINAL'), ('million', 'CARDINAL'), ('199798', 'DATE'), ('about', 'CARDINAL

### <span style="text-decoration: underline;">Pattern search</span><a id='pattern_search'></a> [(to top)](#toc)

Using the build-in `re` (regular expression) library you can pattern match anything you want.  

I will not go into details about regular expressions but see here for a tutorial:  
https://regexone.com/references/python  

In [206]:
import re

**Example:**  

If a sentence contains the word 'million' return True, otherwise return False

In [208]:
for sen in spacy_text_clean['TimFarrand'][2]:
    TERM = 'million'
    contains = True if re.search('million', sen.text) else False
    if contains:
        print(sen)

analyst forecast for pretax profit range from 218 to 232 million stg after restructure cost up from 206 million last time
a restructure cost of some 35 million be anticipate with the bulk of pron or about 25 million stem from the closure of pron small production plant in france
cadbury s us drink business should turn in about 112 million stg in trading profit against 59 million in the first half of 1995 due entirely to the contribution of dr pepper
campbell estimate uk beverage will contribute 47 million stg in operating profit down from 50 million last time
broadly analyst expect a pretty flat performance from the group s confectionery business with a consensus forecast of 110 million stg for operate profit
on average analyst calculate beverage will chip in trading profit of 150 million
after the sale of pron 51 percent stake in the coca cola amp schweppes beverages ccsb operation to coca cola enterprises in june for 620 million stg many analyst want to see a clear statement of strate

## <span style="text-decoration: underline;">Text evaluation</span><a id='text_eval'></a> [(to top)](#toc)

Besides search for features there are also many ways to analyze the text as a whole.  

Let's, for example, evaluate the following paragraph:

In [215]:
example_paragraph = ' '.join([x.text for x in spacy_text_clean['TimFarrand'][2]])
example_paragraph[:500]

'soft drink and confectionery group cadbury schweppes plc expect to report a solid eight percent rise in first half profit on wednesday face question about the performance of pron 7up soft drink in the us one of the main question will be the success or otherwise of the relaunch of the 7up brand say mark duffy food manufacturing analyst at sbc warburg competitor sprite own by coca cola have see an agressive marketing push and be rank the fast grow brand in the us after cadbury s dr pepper analyst '

### <span style="text-decoration: underline;">Language</span><a id='language'></a> [(to top)](#toc)

Using the `langdetect` package it is easy to detect the language of a piece of text

In [216]:
from langdetect import detect

In [217]:
detect(example_paragraph)

'en'

### <span style="text-decoration: underline;">Readability</span><a id='readability'></a> [(to top)](#toc)

Using the `textstat` package we can compute various readability metrics

https://github.com/shivam5992/textstat

In [218]:
from textstat.textstat import textstat

In [221]:
print(textstat.flesch_reading_ease(example_paragraph))
print(textstat.smog_index(example_paragraph))
print(textstat.flesch_kincaid_grade(example_paragraph))
print(textstat.coleman_liau_index(example_paragraph))
print(textstat.automated_readability_index(example_paragraph))
print(textstat.dale_chall_readability_score(example_paragraph))
print(textstat.difficult_words(example_paragraph))
print(textstat.linsear_write_formula(example_paragraph))
print(textstat.gunning_fog(example_paragraph))
print(textstat.text_standard(example_paragraph))

-610.94
0
269.6
10.42
344.4
40.86
133
54.0
285.3213352685051
269th and 270th grade


### <span style="text-decoration: underline;">Dictionary counting</span><a id='dict_counting'></a> [(to top)](#toc)

One of the most common techniques that researchers currently use (at least in Accounting research) are simple metrics based on counting words in a dictionary.  
This technique is, for example, very prevalent in sentiment analysis (counting positive and negative words).  

In essence this technique is very simple to program:

In [229]:
word_dictionary = ['soft', 'first', 'most', 'be']

In [230]:
for word in word_dictionary:
    print(word, example_paragraph.count(word))

soft 3
first 3
most 1
be 25


Getting the total number of words is also easy:

In [236]:
len(parser(example_paragraph))

689

# <span style="text-decoration: underline;">Represent text numerically</span><a id='text_numerical'></a> [(to top)](#toc)

Blabla

## <span style="text-decoration: underline;">Bag of Words</span><a id='bows'></a> [(to top)](#toc)

Blabla

### <span style="text-decoration: underline;">TF-IDF</span><a id='tfidf'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">Word Embeddings</span><a id='word_embed'></a> [(to top)](#toc)

Blabla

### <span style="text-decoration: underline;">Word2Vec</span><a id='Word2Vec'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Doc2Vec</span><a id='Doc2Vec'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">FastText</span><a id='FastText'></a> [(to top)](#toc)

# <span style="text-decoration: underline;">Statistical models</span><a id='stat_models'></a> [(to top)](#toc)

Blabla

## <span style="text-decoration: underline;">"Traditional" machine learning</span><a id='trad_ml'></a> [(to top)](#toc)

Blabla

## <span>Supervised</span><a id='trad_ml_supervised'></a> [(to top)](#toc)

Blabla

### <span>Naïve Bayes</span><a id='trad_ml_supervised_nb'></a> [(to top)](#toc)

### <span>Support Vector Machines (SVM)</span><a id='trad_ml_supervised_svm'></a> [(to top)](#toc)

## <span>Unsupervised</span><a id='trad_ml_unsupervised'></a> [(to top)](#toc)

Blabla

### <span>Latent Dirichilet Allocation (LDA)</span><a id='trad_ml_unsupervised_lda'></a> [(to top)](#toc)

### <span>pyLDAvis</span><a id='trad_ml_unsupervised_pyLDAvis'></a> [(to top)](#toc)

## <span>Model Selection and Evaluation</span><a id='trad_ml_eval'></a> [(to top)](#toc)

Blabla

## <span style="text-decoration: underline;">Neural Networks</span><a id='nn_ml'></a> [(to top)](#toc)

Refer to `tensorflow` and `pytorch` notebooks