After reading this , you’ll know some basic techniques to extract features from some text, so you can use these features as input for machine learning models.

# What is NLP (Natural Language Processing)?
NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.
For example, we can use NLP to create systems like speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing and so on.
Nowadays, most of us have smartphones that have speech recognition. These smartphones use NLP to understand what is said. Also, many people use laptops which operating system has a built-in speech recognition.

In [1]:
!pip install nltk



# Introduction to the NLTK library for Python
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to many corpora and lexical resources. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.
We’ll use this toolkit to show some basics of the natural language processing field. For the examples below, I’ll assume that we have imported the NLTK toolkit.
 We can do this like this: import nltk.

# The Basics of NLP for Text
In this article, we’ll cover the following topics:
1. Sentence Tokenization
2. Word Tokenization
3. Text Lemmatization and Stemming
4. Stop Words
5. Regex
6. Bag-of-Words
7. TF-IDF

# 1. Sentence Tokenization
Sentence tokenization (also called sentence segmentation) is the problem of dividing a string of written language into its component sentences. The idea here looks very simple. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.
However, even in English, this problem is not trivial due to the use of full stop character for abbreviations. When processing plain text, tables of abbreviations that contain periods can help us to prevent incorrect assignment of sentence boundaries. In many cases, we use libraries to do that job for us, so don’t worry too much for the details for now.

## Example:
Let’s look a piece of text about a famous board game called backgammon.
> Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.

In [2]:
import nltk
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

Backgammon is one of the oldest known board games.

Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.

It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.



# 2. Word Tokenization
Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.
However, we still can have problems if we only split by space to achieve the wanted results. Some English compound nouns are variably written and sometimes they contain a space. In most cases, we use a library to achieve the wanted results, so again don’t worry too much for the details.
# Example:
Let’s use the sentences from the previous step and see how we can apply word tokenization on them. We can use the nltk.word_tokenize function.

In [3]:
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']

['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']



# Text Lemmatization and Stemming
For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.
> The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

# Examples:
* am, are, is => be
* dog, dogs, dog’s, dogs’ => dog
The result of this mapping applied on a text will be something like that:
* the boy’s dogs are different sizes => the boy dog be differ size
Stemming and lemmatization are special cases of normalization. However, they are different from each other.

> Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The difference is that a stemmer operates without knowledge of the context, and therefore cannot understand the difference between words which have different meaning depending on part of speech. But the stemmers also have some advantages, they are easier to implement and usually run faster. Also, the reduced “accuracy” may not matter for some applications.
# Examples
1. The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
2. The word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization.
3. The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

After we know what’s the difference, let’s see some examples using the NLTK tool.

In [4]:
import nltk
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mahendra.chouhan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive



# Stop words
![img](https://miro.medium.com/max/513/1*kMf7dZW4jTyaq1hxjA0pgg.png)

Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.
Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.
The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code: nltk.download(“stopwords”). Once we complete the downloading, we can load the stopwords package from the nltk.corpus and use it to load the stop words.


In [5]:
import nltk
nltk.download('stopwords')
  
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mahendra.chouhan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let’s see how we can remove the stop words from a sentence.


In [6]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


If you’re not familiar with the [list comprehensions in Python](https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40). Here is another way to achieve the same result.


In [7]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = []
for word in words:
    if word not in stop_words:
        without_stop_words.append(word)

print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


However, keep in mind that list comprehensions are faster because they are optimized for the Python interpreter to spot a predictable pattern during looping.
You might wonder why we convert our list into a set. Set is an abstract data type that can store unique values, without any particular order. The search operation in a set is much faster than the search operation in a list. For a small number of words, there is no big difference, but if you have a large number of words it’s highly recommended to use the set type.
If you want to learn more about the time consuming between the different operations for the different data structures you can look at this awesome cheat sheet.

# Regex
![img](https://miro.medium.com/max/658/1*l_EB11yQfbZsKLFr8ZckuQ.jpeg)
A regular expression, regex, or regexp is a sequence of characters that define a search pattern. Let’s see some basics.
* . - match any character except newline
* \w - match word
* \d - match digit
* \s - match whitespace
* \W - match not word
* \D - match not digit
* \S - match not whitespace
* [abc] - match any of a, b, or c
* [^abc] - not match a, b, or c
* [a-g] - match a character between a & g

> Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually, patterns will be expressed in Python code using this raw string notation.

Source: https://docs.python.org/3/library/re.html?highlight=regex

We can use regex to apply additional filtering to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.
In Python, the re module provides regular expression matching operations similar to those in Perl. We can use the re.sub function to replace the matches for a pattern with a replacement string. Let’s see an example when we replace all non-words with the space character.

In [8]:

import re
sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."
pattern = r"[^\w]"
print(re.sub(pattern, " ", sentence))


The development of snowboarding was inspired by skateboarding  sledding  surfing and skiing 


A regular expression is a powerful tool and we can create much more complex patterns. If you want to learn more about regex I can recommend you to try these 2 web apps: [regexr](https://regexr.com/), [regex101](https://regex101.com/).

# Bag-of-words
![img](https://miro.medium.com/max/320/1*RPezKXGUUwla-JP52OnZxA.png)
Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called feature extraction.
The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.
To use this model, we need to:
1. Design a vocabulary of known words (also called tokens)
2. Choose a measure of the presence of known words
Any information about the order or structure of words is discarded. That’s why it’s called a bag of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.

The intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

# Example
Let’s see what are the steps to create a bag-of-words model. In this example, we’ll use only four sentences to see how this model works. In the real-world problems, you’ll work with much bigger amounts of data.
## 1. Load the data
![img](https://miro.medium.com/max/320/1*JTi6Bnodv2sui50F96v7-Q.png)
Let’s say that this is our data and we want to load it as an array.


I like this movie, it's funny.

I hate this movie.

This was awesome! I like it.

Nice one. I love it.


In [9]:
# with open("simple movie reviews.txt", "r") as file:
#     documents = file.read().splitlines()
documents = ["I like this movie, it's funny.",
             'I hate this movie.',
             'This was awesome! I like it.',
             'Nice one. I love it.']    
print(documents)

["I like this movie, it's funny.", 'I hate this movie.', 'This was awesome! I like it.', 'Nice one. I love it.']


# 2. Design the Vocabulary

Let’s get all the unique words from the four loaded sentences ignoring the case, punctuation, and one-character tokens. These words will be our vocabulary (known words).
We can use the CountVectorizer class from the sklearn library to design our vocabulary. We’ll see how we can use it after reading the next step, too.
# 3. Create the Document Vectors

Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model. The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.
Now, let’s see how we can create a bag-of-words model using the mentioned above CountVectorizer class.


In [10]:
# Import the libraries we need
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 2. Design the Vocabulary
# The default token pattern removes tokens of a single character. That's why we don't have the "I" and "s" tokens in the output
count_vectorizer = CountVectorizer()

# Step 3. Create the Bag-of-Words Model
bag_of_words = count_vectorizer.fit_transform(documents)

# Show the Bag-of-Words Model as a pandas DataFrame
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)

Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0,1,0,1,1,0,1,0,0,1,0
1,0,0,1,0,0,0,1,0,0,1,0
2,1,0,0,1,1,0,0,0,0,1,1
3,0,0,0,1,0,1,0,1,1,0,0


# output
![img](https://miro.medium.com/max/566/1*f5e9vn4EZB8zNSLWO0dn-A.png)
Here are our sentences. Now we can see how the bag-of-words model works.
![img](https://miro.medium.com/max/453/1*LtMJ1qSiIuEzZDqB-RQbjw.png)
# Additional Notes on the Bag of Words Model
![img](https://miro.medium.com/max/320/1*JvmcnIYVAzxHYdrxtmMa3Q.png)
The complexity of the bag-of-words model comes in deciding how to design the vocabulary of known words (tokens) and how to score the presence of known words.
Designing the Vocabulary
When the vocabulary size increases, the vector representation of the documents also increases. In the example above, the length of the document vector is equal to the number of known words.

In some cases, we can have a huge amount of data and in this cases, the length of the vector that represents a document might be thousands or millions of elements. Furthermore, each document may contain only a few of the known words in the vocabulary.
Therefore the vector representations will have a lot of zeros. These vectors which have a lot of zeros are called sparse vectors. They require more memory and computational resources.

We can decrease the number of the known words when using a bag-of-words model to decrease the required memory and computational resources. We can use the text cleaning techniques we’ve already seen in this article before we create our bag-of-words model:

* Ignoring the case of the words
* Ignoring punctuation
* Removing the stop words from our documents
* Reducing the words to their base form (Text Lemmatization and Stemming)
* Fixing misspelled words

Another more complex way to create a vocabulary is to use grouped words. This changes the scope of the vocabulary and allows the bag-of-words model to get more details about the document. This approach is called n-grams.

An n-gram is a sequence of a number of items (words, letter, numbers, digits, etc.). In the context of text corpora, n-grams typically refer to a sequence of words. A unigram is one word, a bigram is a sequence of two words, a trigram is a sequence of three words etc. The “n” in the “n-gram” refers to the number of the grouped words. Only the n-grams that appear in the corpus are modeled, not all possible n-grams.

# Example
Let’s look at the all bigrams for the following sentence:
The office building is open today
All the bigrams are:
* the office
* office building
* building is
* is open
* open today
The bag-of-bigrams is more powerful than the bag-of-words approach.

# Scoring Words
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach - the binary approach (1 for presence, 0 for absence).
Some additional scoring methods are:
* Counts. Count the number of times each word appears in a document.
* Frequencies. Calculate the frequency that each word appears in document out of all the words in the document.

# TF-IDF
One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much “informational gain” to the model compared with some rarer and domain-specific words. One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called TF-IDF.

TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.
The TF-IDF scoring value increases proportionally to the number of times a word appears in the document, but it is offset by the number of documents in the corpus that contain the word.
Let’s see the formula used to calculate a TF-IDF score for a given term x within a document y.

![img](https://miro.medium.com/max/875/1*V9ac4hLVyms79jl65Ym_Bw.png)

Now, let’s split this formula a little bit and see how the different parts of the formula work.

## Term Frequency (TF): 
a scoring of the frequency of the word in the current document.
![img](https://miro.medium.com/max/579/1*V3qfsHl0t-bV5kA0mlnsjQ.png)
## Inverse Term Frequency (ITF): 
a scoring of how rare the word is across documents.
![img](https://miro.medium.com/max/556/1*wvPGL02y36QL7-tdG1BT1A.png)
## Inverse Term Frequency (ITF): 
a scoring of how rare the word is across documents.
![img](https://miro.medium.com/max/556/1*wvPGL02y36QL7-tdG1BT1A.png)

Finally, we can use the previous formulas to calculate the TF-IDF score for a given term like this:
![img](https://miro.medium.com/max/368/1*D2UA6xj9KqcH6amzVj5Y5g.png)


# Example
In Python, we can use the TfidfVectorizer class from the sklearn library to calculate the TF-IDF scores for given documents. Let’s use the same sentences that we have used with the bag-of-words example.



In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(documents)

# Show the Model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame(values.toarray(), columns = feature_names)

Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0.0,0.571848,0.0,0.365003,0.450852,0.0,0.450852,0.0,0.0,0.365003,0.0
1,0.0,0.0,0.702035,0.0,0.0,0.0,0.553492,0.0,0.0,0.4481,0.0
2,0.539445,0.0,0.0,0.344321,0.425305,0.0,0.0,0.0,0.0,0.344321,0.539445
3,0.0,0.0,0.0,0.345783,0.0,0.541736,0.0,0.541736,0.541736,0.0,0.0


# Output
![img](https://miro.medium.com/max/875/1*dPXb0hL5GluQCf9jMY1YDQ.png)

Again, I’ll add the sentences here for an easy comparison and better understanding of how this approach is working.

![img](https://miro.medium.com/max/453/1*LtMJ1qSiIuEzZDqB-RQbjw.png)

# Summary
* NLP is used to apply machine learning algorithms to text and speech.
* NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data
* Sentence tokenization is the problem of dividing a string of written language into its component sentences
* Word tokenization is the problem of dividing a string of written language into its component words
* The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
* Stop words are words which are filtered out before or after processing of text. They usually refer to the most common words in a language.
* A regular expression is a sequence of characters that define a search pattern.
* The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.
* TF-IDF is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

