We'll use NLTK (Natural Language ToolKit) library here

In [1]:
# import necessary libraries
import nltk
import string
import re

### 1. Text Lowercase

In [2]:
def lowercase_text(text):
    return text.lower()

input_str = "Weather in Singapore is just too Hot!"
lowercase_text(input_str)

'weather in singapore is just too hot!'

### 2. Remove Numbers

In [3]:
def remove_num(text):
    result = re.sub(r'\d+ ', '', text)
    return result

input_str = "Today is first day of 2024!!"
remove_num(input_str)

'Today is first day of !!'

Another method of converting numbers into words. This could be done by using the inflect library.

In [7]:
import inflect
q = inflect.engine()

# convert number into text
def convert_num(text):
    # split strings into list of texts
    temp_string = text.split()
    # initialize empty list
    new_str = []
    
    for word in temp_string:
        if word.isdigit():
            temp = q.number_to_words(word)
            new_str.append(temp)
        else:
            new_str.append(word)
            
    # join the texts of new_str to form a string
    temp_str = " ".join(new_str)
    return temp_str


input_str = "Today I bought 5 packets of rice, 2 packets of biscuit, 1 full trolly of snacks."
convert_num(input_str)

'Today I bought five packets of rice, two packets of biscuit, one full trolly of snacks.'

### 3. Remove Punctuation

In [11]:
def remove_punct(text):
    translator = str.maketrans('','',string.punctuation)
    return text.translate(translator)

input_str = "Hey, are you excited??? I am really looking forward to go to Japan!!!"
remove_punct(input_str)

'Hey are you excited I am really looking forward to go to Japan'

### 4. Remove Stopwords
**`Stopwords`** are words that do not contribute to the meaning of the sentence. Hence, they can safely removed without causing any change in the meaning of a sentence. The NLTK (Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

Examples of stop words in English are “a,” “the,” “is,” “are,” 

In [13]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

def rem_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text


input_str = "I likes A.I and Machine Learning."
rem_stopwords(input_str)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['I', 'likes', 'A.I', 'Machine', 'Learning', '.']

### 5. Stemming
Stemming is a process of getting the root form of a word. Root of Stem is the part to which inflextional affixes (like -ed, -ize etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, Stemming a word may not result in actual words. 

**Example:** 
            
            Mangoes --> Mango

            Boys    --> Boy 

            going   --> go

If the sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into root form. These are the Porter Stemmer, the Snowball Stemmer, and the Lancaster Stemmer. Usually we use Porter Stemmer among them

In [16]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stem1 = PorterStemmer()

def stem_words(text):
    word_token = word_tokenize(text)
    stems = [stem1.stem(word) for word in word_token]
    return stems

input_str = "Wishing everyone has a great great new year!"
stem_words(input_str)

['wish', 'everyon', 'ha', 'a', 'great', 'great', 'new', 'year', '!']

### 6. Lemmatization
As Stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK (Natural Language Toolkit), we use the Word Lemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we added pos (parts-of-speech) as a parameter.


In [19]:
from nltk.stem import wordnet
from nltk.tokenize import word_tokenize

lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')

# lemmatize string 
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech(pos)
    lemmas = [lemma.lemmatize(word, pos='v') for word in word_tokens]
    return lemmas

input_str = "Wishing everyone has a great great new year!"
lemmatize_word(input_str)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Wishing', 'everyone', 'have', 'a', 'great', 'great', 'new', 'year', '!']

### 7. Part of Speech (POS)
The POS (Parts of Speech) explains you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing (NLP) models like bag-of-words(BOW) fails to identify these relation between the words. For that we use POS tagging to mark a word to its POS tag based on its context in the data. POS is also used to extract relationship between the words.

In [20]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

def pos_taggg(text):
    word_tokens = word_tokenize(text)
    return pos_tag(word_tokens)

input_str = "Wishing everyone has a great great new year!"
pos_taggg(input_str)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Wishing', 'VBG'),
 ('everyone', 'NN'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('great', 'JJ'),
 ('new', 'JJ'),
 ('year', 'NN'),
 ('!', '.')]

* VBG --> verb gerund (judging)
* NN --> Noun
* VBZ --> Verb, present tense not 3rd person singular (wrap)
* DT --> Determiner
* JJ --> Adjective(large)

In [21]:
nltk.download('tagsets')

# extract information about the tag
nltk.help.upenn_tagset('PRP')

PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [25]:
# extract information about the tag
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


### 8. Chuncking
Chunking is the porcess of extracting phrases from the Unstructured text and give them more structure to it. We also called them shallow parsing. We can do it on top of pos tagging. It groups words into chuncks mainly for noun phrases. Chunking we do by using regular expression.

In [28]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

def chuncking(text, grammar):
    word_tokens = word_tokenize(text)
    
    # label words with pos
    word_pos = pos_tag(word_tokens)
    
    # create chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
    
    # test it on the list of word tokens with tagges pos
    tree = chunkParser.parse(word_pos)
    
    for subtree in tree.subtrees():
        print(subtree)
        
sentence = "The little red parrot is flying in the sky"
grammar = "NP:{<DT>?<JJ>*<NN>}"

chuncking(sentence,grammar )

(S
  (NP The/DT little/JJ red/JJ parrot/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP The/DT little/JJ red/JJ parrot/NN)
(NP the/DT sky/NN)


In the example above, we defined the grammar by using the regular expression rule. This rule tells you that NP (Noun Phrase) chunk should be formed whenever the chuncker find the optional **determiner (DJ)** followed by any **no. of adjectives** and then a **Noun (NN)**.

### 9. Named Entity Recognition (NER)
It is used to extract information from unstructured text. It is used to classy the entities which is present in the text into categories like a person, organization, event, places, etc. This will give you a detail knowledge about the text and the relationship between the different entities.

In [38]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

def ner(text):
    word_tokens = word_tokenize(text)
    
    # pos-tagging of words
    word_pos = pos_tag(word_tokens)
    
    # tree of word entities
    print(ne_chunk(word_pos))
    
    
input_str = "Brain Lara scored the highest 400 runs in a test match which played in between WI and England!!"
ner(input_str)

(S
  (PERSON Brain/NNP)
  (PERSON Lara/NNP)
  scored/VBD
  the/DT
  highest/JJS
  400/CD
  runs/NNS
  in/IN
  a/DT
  test/NN
  match/NN
  which/WDT
  played/VBD
  in/IN
  between/IN
  (ORGANIZATION WI/NNP)
  and/CC
  (GPE England/NNP)
  !/.
  !/.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


### 10. Understand Regex