# <div align="center">Text Preprocessing with NLTK</div>

Import nltk package

In [39]:
import nltk

Download the NLTK copora and models

In [40]:
# uncomment and run cell to download the libraries if you don't have
# nltk.download()

Let's import the libraries we will use for text processing

In [41]:
from nltk.tokenize import word_tokenize, sent_tokenize # spliting string into substrings
from nltk.corpus import wordnet # for synonyms
from nltk.corpus import stopwords # for removing stop words
from nltk.stem import PorterStemmer # for stemming a word
from nltk import WordNetLemmatizer
import string # to remove punctuation

We will start with the given phrase ....

In [42]:
text="""Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, 
and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to 
process and analyze large amounts of natural language data. The history of natural language processing (NLP) generally started 
in the 1950s, although work can be found from earlier periods. In 1950's, Alan Turing published an article titled 
"Computing Machinery and Intelligence "which proposed what is now called the Turing test as a criterion of @intelligence goes"""
text

'Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, \nand artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to \nprocess and analyze large amounts of natural language data. The history of natural language processing (NLP) generally started \nin the 1950s, although work can be found from earlier periods. In 1950\'s, Alan Turing published an article titled \n"Computing Machinery and Intelligence "which proposed what is now called the Turing test as a criterion of @intelligence goes'

##### <div align="center">Counting the number of characters in a text</div>

In [43]:
len(text)

628

In [44]:
#Returns all characters from index 0 to 10
text[0:10]

'Natural la'

In [45]:
# Selects a character at index 10-1 which is index 9
text[9]

'a'

##### <div align="center">Removing Punctuation</div>

Print available punctuation recognized by Python

In [46]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Remove punctuation

In [47]:
"".join([t for t in text if t not in string.punctuation])

'Natural language processing NLP is a subfield of linguistics computer science information engineering \nand artificial intelligence concerned with the interactions between computers and human natural languages in particular how to program computers to \nprocess and analyze large amounts of natural language data The history of natural language processing NLP generally started \nin the 1950s although work can be found from earlier periods In 1950s Alan Turing published an article titled \nComputing Machinery and Intelligence which proposed what is now called the Turing test as a criterion of intelligence goes'

Converting text to lower case

In [48]:
text="".join([t.lower() for t in text ])
print(text)

natural language processing (nlp) is a subfield of linguistics, computer science, information engineering, 
and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to 
process and analyze large amounts of natural language data. the history of natural language processing (nlp) generally started 
in the 1950s, although work can be found from earlier periods. in 1950's, alan turing published an article titled 
"computing machinery and intelligence "which proposed what is now called the turing test as a criterion of @intelligence goes


##### <div align="center">Tokenization</div>
Tokenization is the prcess of spliting a text into constituent substring. We can split a text by sentence or words.

Tokenize text to words.
The function contains the following arguments word_tokenize(text, language='english', preserve_line=False)

In [67]:
tokenized_text=word_tokenize(text,language='english', preserve_line=False)
tokenized_text # select only the first 20 words from the list using [0:20]

TypeError: expected string or bytes-like object

Tokenize text to sentences.
The function contains the following arguments sent_tokenize(text, language='english', preserve_line=False)

In [50]:
sent_tokenize(text,language='english')

['natural language processing (nlp) is a subfield of linguistics, computer science, information engineering, \nand artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to \nprocess and analyze large amounts of natural language data.',
 'the history of natural language processing (nlp) generally started \nin the 1950s, although work can be found from earlier periods.',
 'in 1950\'s, alan turing published an article titled \n"computing machinery and intelligence "which proposed what is now called the turing test as a criterion of @intelligence goes']

Tokenie words in each sentence

In [51]:
[word_tokenize(t)[0:5] for t in sent_tokenize(text,language='english')] #  display only the first 5 words in each sentence using [0:5]

[['natural', 'language', 'processing', '(', 'nlp'],
 ['the', 'history', 'of', 'natural', 'language'],
 ['in', '1950', "'s", ',', 'alan']]

##### <div align="center">Removing stop words</div>
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words.

Show stop words from the text

In [52]:
stopwords.words("english")[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Remove stop words from the entire text

In [53]:
text=[w for w in tokenized_text if not w in stopwords.words('english')]
print(text)

['natural', 'language', 'processing', '(', 'nlp', ')', 'subfield', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', '(', 'natural', ')', 'languages', ',', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', '.', 'history', 'natural', 'language', 'processing', '(', 'nlp', ')', 'generally', 'started', '1950s', ',', 'although', 'work', 'found', 'earlier', 'periods', '.', '1950', "'s", ',', 'alan', 'turing', 'published', 'article', 'titled', "''", 'computing', 'machinery', 'intelligence', '``', 'proposed', 'called', 'turing', 'test', 'criterion', '@', 'intelligence', 'goes']


##### <div align="center">Text Normalization</div>
##### 1. Stemming
Stemming is the process of reducing a word into its root/base e.g sleeping to sleep, eating to eat.

Stemming a single word

In [54]:
PorterStemmer().stem('Natural')

'natur'

In [55]:
print(PorterStemmer().stem('goes'))
print(PorterStemmer().stem('going'))
print(PorterStemmer().stem('go'))

goe
go
go


In [56]:
stemmed_text = [PorterStemmer().stem(word) for word in text]
print(stemmed_text)

['natur', 'languag', 'process', '(', 'nlp', ')', 'subfield', 'linguist', ',', 'comput', 'scienc', ',', 'inform', 'engin', ',', 'artifici', 'intellig', 'concern', 'interact', 'comput', 'human', '(', 'natur', ')', 'languag', ',', 'particular', 'program', 'comput', 'process', 'analyz', 'larg', 'amount', 'natur', 'languag', 'data', '.', 'histori', 'natur', 'languag', 'process', '(', 'nlp', ')', 'gener', 'start', '1950', ',', 'although', 'work', 'found', 'earlier', 'period', '.', '1950', "'s", ',', 'alan', 'ture', 'publish', 'articl', 'titl', "''", 'comput', 'machineri', 'intellig', '``', 'propos', 'call', 'ture', 'test', 'criterion', '@', 'intellig', 'goe']


##### 2. Lemmatization
Lemmatization is the process of reducing a word into its base/root but taking into consideration the morphological analysis of the word. Unlike stemming which cuts off the ending or starting characters of the word.

Lemmatize a single word

In [57]:
WordNetLemmatizer().lemmatize('Natural')

'Natural'

In [58]:
print(WordNetLemmatizer().lemmatize('goes'))
print(WordNetLemmatizer().lemmatize('go'))
print(WordNetLemmatizer().lemmatize('going'))

go
go
going


Lemmatize the entire text

In [59]:
lemmatized_text=[WordNetLemmatizer().lemmatize(t) for t in text]
print(lemmatized_text)

['natural', 'language', 'processing', '(', 'nlp', ')', 'subfield', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'artificial', 'intelligence', 'concerned', 'interaction', 'computer', 'human', '(', 'natural', ')', 'language', ',', 'particular', 'program', 'computer', 'process', 'analyze', 'large', 'amount', 'natural', 'language', 'data', '.', 'history', 'natural', 'language', 'processing', '(', 'nlp', ')', 'generally', 'started', '1950s', ',', 'although', 'work', 'found', 'earlier', 'period', '.', '1950', "'s", ',', 'alan', 'turing', 'published', 'article', 'titled', "''", 'computing', 'machinery', 'intelligence', '``', 'proposed', 'called', 'turing', 'test', 'criterion', '@', 'intelligence', 'go']


##### <div align="center">Synonyms and Antonyms</div>

Synonyms<br>
Synonym is a word or phrase that means exactly or nearly the same as another word or phrase in the same language, for example shut is a synonym of close.

In [60]:
syn=wordnet.synsets('country')
syn # Returns a list of synonyms for the above word

[Synset('state.n.04'),
 Synset('country.n.02'),
 Synset('nation.n.02'),
 Synset('country.n.04'),
 Synset('area.n.01')]

Return the first synonym

In [61]:
syn[0].name()

'state.n.04'

definistion of the word

In [62]:
syn[0].definition()

'a politically organized body of people under a single government'

Example of how the sysnonym word has been used in sentence

In [63]:
syn[0].examples()

['the state has elected a new president',
 'African nations',
 "students who had come to the nation's capitol",
 "the country's largest manufacturer",
 'an industrialized land']

Get the synonym word using lemmas function

In [64]:
syn[0].lemmas()[0].name()

'state'

Antonyms<br/>
Antonyms is a word opposite in meaning to another (e.g. bad and good ).""

In [65]:
synonyms = [] 
antonyms = [] 
  
for syn in wordnet.synsets("happy"): 
    for l in syn.lemmas(): 
        synonyms.append(l.name()) 
        if l.antonyms(): 
            antonyms.append(l.antonyms()[0].name()) 
            
print('Synonyms for the word country\n',synonyms,'\nAntonyms for the the synonyms\n',antonyms)


Synonyms for the word country
 ['happy', 'felicitous', 'happy', 'glad', 'happy', 'happy', 'well-chosen'] 
Antonyms for the the synonyms
 ['unhappy']


##### Vectorization
This is the process of encoding text data into integers (vectors) where a compter model can understand and process.
Techniques for Text Vectorization Include;
1. Count Vectorizations - Document Term
2. Term Frequency - Inverse Document Frequency (TF-IDF)
3. Bag of Word (BoW)
4. Continous Bag of Word (CBoW)
5. Skip-Gram
6. N-grams
7. Word Embedding - Word2Vec
8. Sentence Embedding - Sent2Vec
9. Document Embedding - Doc2Vec
10. Character Embedding - Char2Vec

--- 
Each of the technique will be covered in details in different session