# Text Preprocessing

Suppose we have textual data available, we need to apply many of preprocessing steps to the data to transform those words into numerical features that works with machine learning algorithms.

The pre-processing steps for the problem depend mainly on the domain and the problem itself. We don't need to apply all the steps for every problem.

Here, we're going to see text preprocessing in Python. We'll use NLTK(Natural Language ToolKit) library here.

In [1]:
# Import necessary libraries
import nltk
import string
import re

### Text Lowercase

We do lowercase the text to reduce the size of the vocabulary of our text data.

In [2]:
def lowercase_text(text):
    return text.lower()


input_str = "Weather is too Cloudy.Possiblity of Rain is High,Today!!"
lowercase_text(input_str)

'weather is too cloudy.possiblity of rain is high,today!!'

### Remove numbers

We should either remove the numbers or convert those numbers into textual representations. We use regular expressions(re) to remove the numbers.

In [3]:
# For Removing Numbers
def remove_num(text):
    result = re.sub(r'\d+', '', text)
    return result


input_str = "You bought 6 candies from shop, and 4 candies are in home."
remove_num(input_str)

'You bought  candies from shop, and  candies are in home.'

As we mentioned above, you can also convert the numbers into words. This could be done by using the inflect.

In [4]:
!pip install inflect

Defaulting to user installation because normal site-packages is not writeable


In [5]:
# Import the Library
import inflect
q = inflect.engine()

# convert number into text
def convert_num(text):
    # split string into list of text
    temp_string = text.split()
    # initialise empty list
    new_str = []
    
    for word in temp_string:
        # if text is a digit, convert the digit
        # to numbers and append into the new_str list
        
        if word.isdigit():
            temp = q.number_to_words(word)
            new_str.append(temp)
            
        # append the texts as it is
        else:
            new_str.append(word)
    # join the texts of new str to form a string
    temp_str = ' '.join(new_str)
    return temp_str


input_str = "You bought 6 candies from shop, and 4 candies are in home."
convert_num(input_str)

'You bought six candies from shop, and four candies are in home.'

In [6]:
### Remove Punctuation
def rem_punct(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

input_str = "Hey, Are you excited??, After a week, we will be in Shimla!!!"
rem_punct(input_str)

'Hey Are you excited After a week we will be in Shimla'

In [7]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Remove default stopwords:

Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [8]:
# importing nlkt library
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [9]:
# remove stopwords function
def rem_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in string.punctuation]
    filtered_text = [word for word in filtered_text if word not in stop_words]
    return filtered_text


ex_text = "Data is the new oil. A.I is the last invention"
rem_stopwords(ex_text)

['Data', 'new', 'oil', 'A.I', 'last', 'invention']

### Stemming

From Stemming we will process of getting the root from of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We  would create the stem words by removing the pefix of a word. So, stemming a word may not result in actual words.

Boys ----> Boy

going ---> go

If our sentences are not in tokens, then we need to convert it into tokens. After we converting strings of text into tokens, then we can convert those word tokens into their root from. These are the Porter stemmer, the snowball stemmer, and the Lancaster. We usually use Porter stemmer among them.

In [10]:
# importing nltk's porter stemmer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
steml = PorterStemmer()

# stem words in the list of tokenised words
def s_words(text):
    word_tokens = word_tokenize(text)
    stems = [steml.stem(word) for word in word_tokens]
    return stems

text = 'Data is the new revolution in the word, in a day one individual would generate terabytes of data'
s_words(text)

['data',
 'is',
 'the',
 'new',
 'revolut',
 'in',
 'the',
 'word',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individu',
 'would',
 'gener',
 'terabyt',
 'of',
 'data']

### Lemmatization

As stemming, lemmantization do the same but the only difference is that lemmatization ensurea that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK,we use WordLennatizer to get the lemmas of words. We also need to provide a context for the lemmatixzation. So, we added pos(parts-of-speech) as a parameter.

In [16]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...


True

In [17]:
from nltk.stem import wordnet
from nltk.tokenize import word_tokenize
lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')
# lemmatize string
#lemmatize string
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part of speech(pos)
    lemmas = [lemma.lemmatize(word, pos='v') for word in word_tokens]
    return lemmas


text = 'Data is the new revolution in the word, in a day one individual would generate terabytes of data'
lemmatize_word(text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Data',
 'be',
 'the',
 'new',
 'revolution',
 'in',
 'the',
 'word',
 ',',
 'in',
 'a',
 'day',
 'one',
 'individual',
 'would',
 'generate',
 'terabytes',
 'of',
 'data']

### Part of Speech(POS) Tagging

The POS explain you how a word is used in asentence. In the sentence, a word have different contexts and semantic meanings. The basis natural language preprocessing(NLP) models like bag-of-words(bow) fails to identify these  relation the words. For that we use pos tagging to mark a word to its pos tag based on its context in the data. POS is also used to extract relationship between the words.

In [23]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data] Error downloading 'averaged_perceptron_tagger' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-p
[nltk_data]     ages/packages/taggers/averaged_perceptron_tagger.zip>:
[nltk_data]     <urlopen error [WinError 10060] A connection attempt
[nltk_data]     failed because the connected party did not properly
[nltk_data]     respond after a period of time, or established
[nltk_data]     connection failed because connected host has failed to
[nltk_data]     respond>


False

In [24]:
# importing tokenize library
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

# convert text into word_token with their tags
def pos_tagg(text):
    word_tokens = word_tokenize(text)
    return pos_tag(word_tokens)

pos_tagg('Are you afraid of something?')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data] Error downloading 'averaged_perceptron_tagger' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-p
[nltk_data]     ages/packages/taggers/averaged_perceptron_tagger.zip>:
[nltk_data]     <urlopen error [WinError 10060] A connection attempt
[nltk_data]     failed because the connected party did not properly
[nltk_data]     respond after a period of time, or established
[nltk_data]     connection failed because connected host has failed to
[nltk_data]     respond>


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - 'C:\\Users\\ASUS/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\ASUS\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In the above example NNP Stands for Proper noun, PRP stand for personal noun, In as Preposition. We can get all the details pos tags using the Penn Treebank tarset.

In [22]:
# downloading the tarset
nltk.download('tarsets')

# extract information about the tag
nltk.help.upenn_tarset('PRP')

[nltk_data] Error loading tarsets: Package 'tarsets' not found in
[nltk_data]     index


AttributeError: module 'nltk.help' has no attribute 'upenn_tarset'

### Chunking

Chunking is the process of extracting phrases from the Unstructured text and give them more structure to it. We also called them shallow parsing. We can do it on top of pos tagging. It groups words into chunks mainly for noun phrases. Chunking we do by using regular expression.

In [4]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [WinError 10060] A connection attempt failed because
[nltk_data]     the connected party did not properly respond after a
[nltk_data]     period of time, or established connection failed
[nltk_data]     because connected host has failed to respond>


False

In [25]:
# importing libraries
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# here we define chunking function with text and rwegular
# expression respresenting grammer as parameter
def chunking(text, grammar):
    word_tokens = word_tokenize(text)
    
    # label words with pos
    word_pos = pos_tag(word_tokens)
    
    # create chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
    
    # text it on the list of word token with tagged pos
    tree = chunkParser.parse(word_pos)
    
    for subtree in tree.subtrees():
        print(subtree)
    # tree.draw()
    

sentence = 'the little re parrot is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)

LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - 'C:\\Users\\ASUS/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\ASUS\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [3]:
import nltk
nltk.download('all')

[nltk_data] Error loading all: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

In [31]:
!pip install --user -U nltk

