# Text Preprocessing
1. [Noise Removal](#NoiseRemoval)
2. [Tokenization](#Tokenization)
3. [Normalization](#Normalization)
    * Lower-case Letters
    * Stopword Removal
    * Stemming
    * Lemmatization
    * Part-of-Speed (POS) Tagging
    * Lemmatization with POS

In [1]:
# text 
nlp_wiki = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and 
artificial intelligence concerned with the interactions between computers and human language, 
in particular how to program computers to process and analyze large amounts of natural language data. 
The goal is a computer capable of "understanding" the contents of documents, including the contextual 
nuances of the language within them. The technology can then accurately extract information 
and insights contained in the documents as well as categorize and organize the documents themselves.
I am, as well as we are, the best!
"""

<a name="NoiseRemoval"> </a>
## Noise removal
```sub()``` from ```re``` library  

It takes three arguments:
1. **Pattern** *a regex. There must be an ```r``` precding the string to indicate it is a raw string which treats ```\``` as literals*
2. **Replacement Text** *replaces all the matches in the input string*
3. **Input** *the string that will be edited*

Tasks:
1. Remove **punctuation**
2. Remove **extra whitespace**

In [2]:
# noise removal
import re

# remove puntucation
nlp_no_punc = re.sub(r'[\.\,\"\(\)\!]', '', nlp_wiki)

# removing whitespace
nlp_no_punc_whitespace = re.sub(r'\s{2}', ' ', nlp_no_punc)

# print text without punctuation
print(nlp_no_punc)
print("")
# print text without punc and whitespace
print(nlp_no_punc_whitespace)


Natural language processing NLP is a subfield of linguistics computer science and 
artificial intelligence concerned with the interactions between computers and human language 
in particular how to program computers to process and analyze large amounts of natural language data 
The goal is a computer capable of understanding the contents of documents including the contextual 
nuances of the language within them The technology can then accurately extract information 
and insights contained in the documents as well as categorize and organize the documents themselves
I am as well as we are the best



Natural language processing NLP is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data The goal is a computer capable of understanding the contents of documents including the contextual nuances of the lan

In [3]:
# import nltk to Jupyter Notebook
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\10inm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

<a name="Tokenization"> </a>
## Tokenization
```word_tokenize()``` from ```nltk.tokenize```

Convert the string into a **list of tokens** (individual words).

In [4]:
#nlkt.download()
from nltk.tokenize import word_tokenize

# split the text into individual words
nlp_wiki_tokenized = word_tokenize(nlp_no_punc_whitespace)

# print the list of words
print(nlp_wiki_tokenized)

['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'subfield', 'of', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', 'understanding', 'the', 'contents', 'of', 'documents', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', 'I', 'am', 'as', 'well', 'as', 'we', 'are', 'the', 'best']


<a name="Normalization"> </a>
## Normalization

### Lower-case words
```lower()```

In [5]:
# for word in the list of words lowercase the word
nlp_wiki_lower = [word.lower() for word in nlp_wiki_tokenized]

# print the new list
print(nlp_wiki_lower) 

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', 'the', 'goal', 'is', 'a', 'computer', 'capable', 'of', 'understanding', 'the', 'contents', 'of', 'documents', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', 'the', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', 'i', 'am', 'as', 'well', 'as', 'we', 'are', 'the', 'best']


### Remove stopwords
```stopwords``` from ```nltk.corpus```

In [6]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\10inm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# remove stopwords
from nltk.corpus import stopwords

# define set of English stopwords
stop_words = set(stopwords.words('english'))

# remove stopwords from text
nlp_wiki_no_stop = [word for word in nlp_wiki_lower
                    if word not in stop_words]

# print list of words
print(nlp_wiki_no_stop)

['natural', 'language', 'processing', 'nlp', 'subfield', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', 'language', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', 'goal', 'computer', 'capable', 'understanding', 'contents', 'documents', 'including', 'contextual', 'nuances', 'language', 'within', 'technology', 'accurately', 'extract', 'information', 'insights', 'contained', 'documents', 'well', 'categorize', 'organize', 'documents', 'well', 'best']


### Stemming 
```PorterStemmer``` from ```nltk.stem```

*Removing word affixes*

In [8]:
from nltk.stem import PorterStemmer

# instantiate PorterStemmer
stemmer = PorterStemmer()

# stem the words
nlp_wiki_stemmed = [stemmer.stem(token) for token in nlp_wiki_no_stop]

# print the new list
print(nlp_wiki_stemmed)

['natur', 'languag', 'process', 'nlp', 'subfield', 'linguist', 'comput', 'scienc', 'artifici', 'intellig', 'concern', 'interact', 'comput', 'human', 'languag', 'particular', 'program', 'comput', 'process', 'analyz', 'larg', 'amount', 'natur', 'languag', 'data', 'goal', 'comput', 'capabl', 'understand', 'content', 'document', 'includ', 'contextu', 'nuanc', 'languag', 'within', 'technolog', 'accur', 'extract', 'inform', 'insight', 'contain', 'document', 'well', 'categor', 'organ', 'document', 'well', 'best']


### Lemmatization
```WordNetLemmatizer``` from ```nltk.stem```

*Casting word to their root form*

In [9]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\10inm\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [10]:
from nltk.stem import WordNetLemmatizer

# instantiate the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize the words
nlp_wiki_lemmatized = [lemmatizer.lemmatize(token) 
                       for token in nlp_wiki_no_stop]

# print the new list
print(nlp_wiki_lemmatized)

['natural', 'language', 'processing', 'nlp', 'subfield', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'concerned', 'interaction', 'computer', 'human', 'language', 'particular', 'program', 'computer', 'process', 'analyze', 'large', 'amount', 'natural', 'language', 'data', 'goal', 'computer', 'capable', 'understanding', 'content', 'document', 'including', 'contextual', 'nuance', 'language', 'within', 'technology', 'accurately', 'extract', 'information', 'insight', 'contained', 'document', 'well', 'categorize', 'organize', 'document', 'well', 'best']


```lemmatize``` treats every word as a **noun**!

To take advantage of lemmatization we need to **perform POS-tagging first!**

### Part-of-Speed (POS) Tagging
```wordnet``` from ```nltk.corpus``` *a database for contextualizing words*  
```Counter``` from ```collections``` *a container that stores elements as dictionary keys*

In [11]:
from nltk.corpus import wordnet
from collections import Counter

def pos_tagging(token):
    """It identifies the most probable part of speech of a word."""
    
    # get a set of POS-tagged synonyms for the word
    probable_pos = wordnet.synsets(token)
    
    # instantiate Counter
    pos_counts = Counter()
    
    # count the number of nouns in the syn set
    pos_counts['n'] = len([item for item in probable_pos if item.pos() == 'n'])
    # count the number of verbs in the syn set
    pos_counts['v'] = len([item for item in probable_pos if item.pos() == 'v'])
    # count the number of adjectives in the syn set
    pos_counts['a'] = len([item for item in probable_pos if item.pos() == 'a'])
    # count the number of adverbs in the syn set
    pos_counts['r'] = len([item for item in probable_pos if item.pos() == 'r'])
    
    most_likely_pos = pos_counts.most_common(1)[0][0]
    
    return most_likely_pos

In [12]:
# perform lemmatization along with POS tagging
nlp_wiki_pos_lemmatized = [lemmatizer.lemmatize(token, pos_tagging(token)) for token in nlp_wiki_no_stop]

# compare lemmatization without and with POS tagging
print(nlp_wiki_lemmatized)
print("")
print(nlp_wiki_pos_lemmatized)

['natural', 'language', 'processing', 'nlp', 'subfield', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'concerned', 'interaction', 'computer', 'human', 'language', 'particular', 'program', 'computer', 'process', 'analyze', 'large', 'amount', 'natural', 'language', 'data', 'goal', 'computer', 'capable', 'understanding', 'content', 'document', 'including', 'contextual', 'nuance', 'language', 'within', 'technology', 'accurately', 'extract', 'information', 'insight', 'contained', 'document', 'well', 'categorize', 'organize', 'document', 'well', 'best']

['natural', 'language', 'process', 'nlp', 'subfield', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'concern', 'interaction', 'computer', 'human', 'language', 'particular', 'program', 'computer', 'process', 'analyze', 'large', 'amount', 'natural', 'language', 'data', 'goal', 'computer', 'capable', 'understand', 'content', 'document', 'include', 'contextual', 'nuance', 'language', 'within', 'techno