# <div align="center">Text Preprocessing with NLTK</div>

Import nltk package

In [1]:
import nltk

Download the NLTK copora and models

In [2]:
# uncomment and run cell to download the libraries if you don't have
# nltk.download()

Let's import the libraries we will use for text processing

In [3]:
from nltk.tokenize import word_tokenize, sent_tokenize # spliting string into substrings
from nltk.corpus import wordnet # for synonyms
from nltk.corpus import stopwords # for removing stop words
from nltk.stem import PorterStemmer # for stemming a word
from nltk import WordNetLemmatizer
from nltk import pos_tag
from nltk.chunk import ne_chunk
from nltk.tree import Tree
import string # to remove punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
import pandas as pd

We will start with the given phrase ....

In [4]:
text="""Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, 
and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to 
process and analyze large amounts of natural language data. The history of natural language processing (NLP) generally started 
in the 1950s, although work can be found from earlier periods. In 1950's, Alan Turing published an article titled 
"Computing Machinery and Intelligence "which proposed what is now called the Turing test as a criterion of @intelligence goes"""
text

'Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, \nand artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to \nprocess and analyze large amounts of natural language data. The history of natural language processing (NLP) generally started \nin the 1950s, although work can be found from earlier periods. In 1950\'s, Alan Turing published an article titled \n"Computing Machinery and Intelligence "which proposed what is now called the Turing test as a criterion of @intelligence goes'

##### <div align="center">Counting the number of characters in a text</div>

In [5]:
len(text)

628

In [6]:
#Returns all characters from index 0 to 10
text[0:10]

'Natural la'

In [7]:
# Selects a character at index 10-1 which is index 9
text[9]

'a'

##### <div align="center">Removing Punctuation</div>

Print available punctuation recognized by Python

In [8]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Remove punctuation

In [9]:
"".join([t for t in text if t not in string.punctuation])

'Natural language processing NLP is a subfield of linguistics computer science information engineering \nand artificial intelligence concerned with the interactions between computers and human natural languages in particular how to program computers to \nprocess and analyze large amounts of natural language data The history of natural language processing NLP generally started \nin the 1950s although work can be found from earlier periods In 1950s Alan Turing published an article titled \nComputing Machinery and Intelligence which proposed what is now called the Turing test as a criterion of intelligence goes'

Converting text to lower case

In [10]:
text="".join([t.lower() for t in text ])
print(text)

natural language processing (nlp) is a subfield of linguistics, computer science, information engineering, 
and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to 
process and analyze large amounts of natural language data. the history of natural language processing (nlp) generally started 
in the 1950s, although work can be found from earlier periods. in 1950's, alan turing published an article titled 
"computing machinery and intelligence "which proposed what is now called the turing test as a criterion of @intelligence goes


##### <div align="center">Tokenization</div>
Tokenization is the prcess of spliting a text into constituent substring. We can split a text by sentence or words.

Tokenize text to words.
The function contains the following arguments word_tokenize(text, language='english', preserve_line=False)

In [11]:
tokenized_text=word_tokenize(text,language='english', preserve_line=False)
print(tokenized_text) # select only the first 20 words from the list using [0:20]

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'the', 'history', 'of', 'natural', 'language', 'processing', '(', 'nlp', ')', 'generally', 'started', 'in', 'the', '1950s', ',', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', '.', 'in', '1950', "'s", ',', 'alan', 'turing', 'published', 'an', 'article', 'titled', "''", 'computing', 'machinery', 'and', 'intelligence', '``', 'which', 'proposed', 'what', 'is', 'now', 'called', 'the', 'turing', 'test', 'as', 'a', 'criterion', 'of', '@', 'intelligence', 'goes']


Tokenize text to sentences.
The function contains the following arguments sent_tokenize(text, language='english', preserve_line=False)

In [12]:
sent_tokenize(text,language='english')

['natural language processing (nlp) is a subfield of linguistics, computer science, information engineering, \nand artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to \nprocess and analyze large amounts of natural language data.',
 'the history of natural language processing (nlp) generally started \nin the 1950s, although work can be found from earlier periods.',
 'in 1950\'s, alan turing published an article titled \n"computing machinery and intelligence "which proposed what is now called the turing test as a criterion of @intelligence goes']

Tokenie words in each sentence

In [13]:
[word_tokenize(t)[0:5] for t in sent_tokenize(text,language='english')] #  display only the first 5 words in each sentence using [0:5]

[['natural', 'language', 'processing', '(', 'nlp'],
 ['the', 'history', 'of', 'natural', 'language'],
 ['in', '1950', "'s", ',', 'alan']]

##### <div align="center">Removing stop words</div>
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words.

Show stop words from the text

In [14]:
stopwords.words("english")[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Remove stop words from the entire text

In [15]:
text=[w for w in tokenized_text if not w in stopwords.words('english')]
print(text)

['natural', 'language', 'processing', '(', 'nlp', ')', 'subfield', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', '(', 'natural', ')', 'languages', ',', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', '.', 'history', 'natural', 'language', 'processing', '(', 'nlp', ')', 'generally', 'started', '1950s', ',', 'although', 'work', 'found', 'earlier', 'periods', '.', '1950', "'s", ',', 'alan', 'turing', 'published', 'article', 'titled', "''", 'computing', 'machinery', 'intelligence', '``', 'proposed', 'called', 'turing', 'test', 'criterion', '@', 'intelligence', 'goes']


##### <div align="center">Text Normalization</div>
##### 1. Stemming
Stemming is the process of reducing a word into its root/base e.g sleeping to sleep, eating to eat.

Stemming a single word

In [16]:
PorterStemmer().stem('Natural')

'natur'

In [17]:
print(PorterStemmer().stem('goes'))
print(PorterStemmer().stem('going'))
print(PorterStemmer().stem('go'))

goe
go
go


In [18]:
stemmed_text = [PorterStemmer().stem(word) for word in text]
print(stemmed_text)

['natur', 'languag', 'process', '(', 'nlp', ')', 'subfield', 'linguist', ',', 'comput', 'scienc', ',', 'inform', 'engin', ',', 'artifici', 'intellig', 'concern', 'interact', 'comput', 'human', '(', 'natur', ')', 'languag', ',', 'particular', 'program', 'comput', 'process', 'analyz', 'larg', 'amount', 'natur', 'languag', 'data', '.', 'histori', 'natur', 'languag', 'process', '(', 'nlp', ')', 'gener', 'start', '1950', ',', 'although', 'work', 'found', 'earlier', 'period', '.', '1950', "'s", ',', 'alan', 'ture', 'publish', 'articl', 'titl', "''", 'comput', 'machineri', 'intellig', '``', 'propos', 'call', 'ture', 'test', 'criterion', '@', 'intellig', 'goe']


##### 2. Lemmatization
Lemmatization is the process of reducing a word into its base/root but taking into consideration the morphological analysis of the word. Unlike stemming which cuts off the ending or starting characters of the word.

Lemmatize a single word

In [19]:
WordNetLemmatizer().lemmatize('Natural')

'Natural'

In [20]:
print(WordNetLemmatizer().lemmatize('goes'))
print(WordNetLemmatizer().lemmatize('go'))
print(WordNetLemmatizer().lemmatize('going'))

go
go
going


Lemmatize the entire text

In [21]:
lemmatized_text=[WordNetLemmatizer().lemmatize(t) for t in text]
print(lemmatized_text)

['natural', 'language', 'processing', '(', 'nlp', ')', 'subfield', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'artificial', 'intelligence', 'concerned', 'interaction', 'computer', 'human', '(', 'natural', ')', 'language', ',', 'particular', 'program', 'computer', 'process', 'analyze', 'large', 'amount', 'natural', 'language', 'data', '.', 'history', 'natural', 'language', 'processing', '(', 'nlp', ')', 'generally', 'started', '1950s', ',', 'although', 'work', 'found', 'earlier', 'period', '.', '1950', "'s", ',', 'alan', 'turing', 'published', 'article', 'titled', "''", 'computing', 'machinery', 'intelligence', '``', 'proposed', 'called', 'turing', 'test', 'criterion', '@', 'intelligence', 'go']


##### <div align="center">Synonyms and Antonyms</div>

Synonyms<br>
Synonym is a word or phrase that means exactly or nearly the same as another word or phrase in the same language, for example shut is a synonym of close.

In [22]:
syn=wordnet.synsets('country')
syn # Returns a list of synonyms for the above word

[Synset('state.n.04'),
 Synset('country.n.02'),
 Synset('nation.n.02'),
 Synset('country.n.04'),
 Synset('area.n.01')]

Return the first synonym

In [23]:
syn[0].name()

'state.n.04'

definistion of the word

In [24]:
syn[0].definition()

'a politically organized body of people under a single government'

Example of how the sysnonym word has been used in sentence

In [25]:
syn[0].examples()

['the state has elected a new president',
 'African nations',
 "students who had come to the nation's capitol",
 "the country's largest manufacturer",
 'an industrialized land']

Get the synonym word using lemmas function

In [26]:
syn[0].lemmas()[0].name()

'state'

Antonyms<br/>
Antonyms is a word opposite in meaning to another (e.g. bad and good ).""

In [27]:
synonyms = [] 
antonyms = [] 
  
for syn in wordnet.synsets("happy"): 
    for l in syn.lemmas(): 
        synonyms.append(l.name()) 
        if l.antonyms(): 
            antonyms.append(l.antonyms()[0].name()) 
            
print('Synonyms for the word country\n',synonyms,'\nAntonyms for the the synonyms\n',antonyms)


Synonyms for the word country
 ['happy', 'felicitous', 'happy', 'glad', 'happy', 'happy', 'well-chosen'] 
Antonyms for the the synonyms
 ['unhappy']


##### <div align="center">Part of Speech Tagging (POS Tagging)</div>
<br >POS Tagging is the process of assigining the gramatical class (tag) to words in a text.
Technique for POS Tagging;
1. Rule-based tagging. Used for ambiguous words.
2. Stochastic tagging. This is a probabilistic approach where the tag is computed from the probability of occurence of the word category 
3. Hidden MArkov Model (HMM). This technique asigns tags to words by assessing the hiddent state of tags in the text.

In [28]:
pos=pos_tag(tokenized_text)
pos[0:10]

[('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('(', '('),
 ('nlp', 'JJ'),
 (')', ')'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('subfield', 'NN'),
 ('of', 'IN')]

##### <div align="center">Named Entity Recognition (NER)</div>
<br>This is the process of classifying text that either belongs to person, location, organization, quantities e.t.c. There are various technique for the NER such as chunking and Stanford NER library.

NER using chunking technique

In [29]:
ner_text="Donald Trump is the president of the USA"

Getting NER Tree

In [30]:
ner_text=word_tokenize(ner_text)
ner_text=pos_tag(ner_text)
named_entities = ne_chunk(ner_text)
print(named_entities)

(S
  (PERSON Donald/NNP)
  (ORGANIZATION Trump/NNP)
  is/VBZ
  the/DT
  president/NN
  of/IN
  the/DT
  (ORGANIZATION USA/NNP))


Iterating through the NER Tree 

In [31]:
entities=[]
for ne in named_entities.subtrees():
#     if ne.label()=='ORGANIZATION': // use this to return only entities of specific class e.g ORGANIZATION, PERSON, LOCATION e.t.c
        entities.append(ne)
entities

[Tree('S', [Tree('PERSON', [('Donald', 'NNP')]), Tree('ORGANIZATION', [('Trump', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('president', 'NN'), ('of', 'IN'), ('the', 'DT'), Tree('ORGANIZATION', [('USA', 'NNP')])]),
 Tree('PERSON', [('Donald', 'NNP')]),
 Tree('ORGANIZATION', [('Trump', 'NNP')]),
 Tree('ORGANIZATION', [('USA', 'NNP')])]

Create a nice list of NER for easy manipulation

In [32]:
# Parse named entities from tree
ne = []
for subtree in named_entities:
    if type(subtree) == Tree:
        ne_label = subtree.label()
        ne_string = " ".join([token for token, pos in subtree.leaves()])
        ne.append((ne_string, ne_label))
            
ne

[('Donald', 'PERSON'), ('Trump', 'ORGANIZATION'), ('USA', 'ORGANIZATION')]

##### <div align='center'>Vectorization</div>
This is the process of encoding text data into integers (vectors) where a compter model can understand and process.
Techniques for Text Vectorization Include;
1. Bag of Words - Count Vectorizations
2. Term Frequency - Inverse Document Frequency (TF-IDF)
3. Hashing Vectorization
4. Word Embedding - Word2Vec
5. Sentence Embedding - Sent2Vec
6. Document Embedding - Doc2Vec
7. Character Embedding - Char2Vec

--- 

##### <div align='center'>Document Term Matrix with Count Vectorizations</div>

Document term matrix is is a mathematical matrix that describes the frequency of terms that occur in a collection of document. Each row represents  document and columns represents terms.<br><br>
For the Count Vectorization the results is a sparse matrix with the count frequency of occurence of each word/term in each document.

In [33]:
corpus=['My name is Sammy.',
       'Donald is the President of USA.',
       'London is a city in England.'
       'It\'s winter in London.']

vectorizer=CountVectorizer()
x=vectorizer.fit_transform(corpus)

In [34]:
# Get unique words in the corpus
print(vectorizer.get_feature_names())

['city', 'donald', 'england', 'in', 'is', 'it', 'london', 'my', 'name', 'of', 'president', 'sammy', 'the', 'usa', 'winter']


In [35]:
# Get sparse matrix of the terms/words.
print(x.toarray())

[[0 0 0 0 1 0 0 1 1 0 0 1 0 0 0]
 [0 1 0 0 1 0 0 0 0 1 1 0 1 1 0]
 [1 0 1 2 1 1 2 0 0 0 0 0 0 0 1]]


In [36]:
# saving results of the document term matrix to a pandas dataframe
count_vectorizer_df=pd.DataFrame(x.toarray(),columns=vectorizer.get_feature_names())
count_vectorizer_df.head()

# The word/term city occures once in the third document while the words/terms in and london occures twice in the third document

Unnamed: 0,city,donald,england,in,is,it,london,my,name,of,president,sammy,the,usa,winter
0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0
1,0,1,0,0,1,0,0,0,0,1,1,0,1,1,0
2,1,0,1,2,1,1,2,0,0,0,0,0,0,0,1


In [37]:
# Fit the vectorizer to return n_gram words using the bi_gram technique
cv2=CountVectorizer(ngram_range=(2,2))
x2=cv2.fit_transform(corpus)
print(cv2.get_feature_names())
print(x2.toarray())

['city in', 'donald is', 'england it', 'in england', 'in london', 'is city', 'is sammy', 'is the', 'it winter', 'london is', 'my name', 'name is', 'of usa', 'president of', 'the president', 'winter in']
[[0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0]
 [0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0]
 [1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1]]


##### <div align='center'>Document Term Matrix with Term Frequency Inverse Document Frequency (tf-idf)</div><br>
The term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [38]:
corpus=['My name is Sammy.',
       'Donald is the President of USA.',
       'London is a city in England.'
       'It\'s winter in London.']

tf_idf_vectorizer=TfidfVectorizer() # Initialize tf-idf
x=tf_idf_vectorizer.fit_transform(corpus)
x

<3x15 sparse matrix of type '<class 'numpy.float64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [39]:
# Get unique words in the corpus
print(tf_idf_vectorizer.get_feature_names())

['city', 'donald', 'england', 'in', 'is', 'it', 'london', 'my', 'name', 'of', 'president', 'sammy', 'the', 'usa', 'winter']


In [40]:
# Get sparse matrix of the terms/words. tf-idf returns the weithed values for each term in the corpus unlike countvectorizer that returns the count of terms in a corpus
print(x.toarray())

[[0.         0.         0.         0.         0.32274454 0.
  0.         0.54645401 0.54645401 0.         0.         0.54645401
  0.         0.         0.        ]
 [0.         0.43238509 0.         0.         0.2553736  0.
  0.         0.         0.         0.43238509 0.43238509 0.
  0.43238509 0.43238509 0.        ]
 [0.28456871 0.         0.28456871 0.56913741 0.16807086 0.28456871
  0.56913741 0.         0.         0.         0.         0.
  0.         0.         0.28456871]]


In [41]:
# Creating a dataframe of tf-idf document term matrix
tf_idf_df=pd.DataFrame(x.toarray(),columns=tf_idf_vectorizer.get_feature_names())
tf_idf_df.head()

# Each word/term is assigned a threshold. The words that occure frequently have lower weight making them less important unlike rare words that have high weights making them ore important.

Unnamed: 0,city,donald,england,in,is,it,london,my,name,of,president,sammy,the,usa,winter
0,0.0,0.0,0.0,0.0,0.322745,0.0,0.0,0.546454,0.546454,0.0,0.0,0.546454,0.0,0.0,0.0
1,0.0,0.432385,0.0,0.0,0.255374,0.0,0.0,0.0,0.0,0.432385,0.432385,0.0,0.432385,0.432385,0.0
2,0.284569,0.0,0.284569,0.569137,0.168071,0.284569,0.569137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.284569


##### <div align='center'>Document Term Matrix with Hashing Vecorization</div><br>
This approach converts tokens into sparse matrix by hashing them. It requires low memory but there can be a collision where several tokens are hashed to the same space. This approach is useful in streaming and parallel pipeline.

In [42]:
corpus=['My name is Sammy.',
       'Donald is the President of USA.',
       'London is a city in England.'
       'It\'s winter in London.']

hashing_vectorizer=HashingVectorizer() 
x=hashing_vectorizer.fit_transform(corpus)
x

<3x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [43]:
# Get sparse matrix
print(x.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [44]:
# Creating a dataframe of sparse document term matrix
tf_idf_df=pd.DataFrame(x.toarray())
tf_idf_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1048566,1048567,1048568,1048569,1048570,1048571,1048572,1048573,1048574,1048575
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
