We'll use NLTK (Natural Language ToolKit) library here

In [1]:
# import necessary libraries
import nltk
import string
import re

### 1. Text Lowercase

In [2]:
def lowercase_text(text):
    return text.lower()

input_str = "Weather in Singapore is just too Hot!"
lowercase_text(input_str)

'weather in singapore is just too hot!'

### 2. Remove Numbers

In [3]:
def remove_num(text):
    result = re.sub(r'\d+ ', '', text)
    return result

input_str = "Today is first day of 2024!!"
remove_num(input_str)

'Today is first day of 2024!!'

Another method of converting numbers into words. This could be done by using the inflect library.

In [4]:
import inflect
q = inflect.engine()

# convert number into text
def convert_num(text):
    # split strings into list of texts
    temp_string = text.split()
    # initialize empty list
    new_str = []
    
    for word in temp_string:
        if word.isdigit():
            temp = q.number_to_words(word)
            new_str.append(temp)
        else:
            new_str.append(word)
            
    # join the texts of new_str to form a string
    temp_str = " ".join(new_str)
    return temp_str


input_str = "Today I bought 5 packets of rice, 2 packets of biscuit, 1 full trolly of snacks."
convert_num(input_str)

'Today I bought five packets of rice, two packets of biscuit, one full trolly of snacks.'

### 3. Remove Punctuation

In [5]:
def remove_punct(text):
    translator = str.maketrans('','',string.punctuation)
    return text.translate(translator)

input_str = "Hey, are you excited??? I am really looking forward to go to Japan!!!"
remove_punct(input_str)

'Hey are you excited I am really looking forward to go to Japan'

### 4. Remove Stopwords
**`Stopwords`** are words that do not contribute to the meaning of the sentence. Hence, they can safely removed without causing any change in the meaning of a sentence. The NLTK (Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

Examples of stop words in English are “a,” “the,” “is,” “are,” 

In [6]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

def rem_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text


input_str = "I likes A.I and Machine Learning."
rem_stopwords(input_str)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['I', 'likes', 'A.I', 'Machine', 'Learning', '.']

### 5. Stemming
Stemming is a process of getting the root form of a word. Root of Stem is the part to which inflextional affixes (like -ed, -ize etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, Stemming a word may not result in actual words. 

**Example:** 
            
            Mangoes --> Mango

            Boys    --> Boy 

            going   --> go

If the sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into root form. These are the Porter Stemmer, the Snowball Stemmer, and the Lancaster Stemmer. Usually we use Porter Stemmer among them

In [7]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stem1 = PorterStemmer()

def stem_words(text):
    word_token = word_tokenize(text)
    stems = [stem1.stem(word) for word in word_token]
    return stems

input_str = "Wishing everyone has a great great new year!"
stem_words(input_str)

['wish', 'everyon', 'ha', 'a', 'great', 'great', 'new', 'year', '!']

### 6. Lemmatization
As Stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK (Natural Language Toolkit), we use the Word Lemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we added pos (parts-of-speech) as a parameter.


In [8]:
from nltk.stem import wordnet
from nltk.tokenize import word_tokenize

lemma = wordnet.WordNetLemmatizer()
nltk.download('wordnet')

# lemmatize string 
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech(pos)
    lemmas = [lemma.lemmatize(word, pos='v') for word in word_tokens]
    return lemmas

input_str = "Wishing everyone has a great great new year!"
lemmatize_word(input_str)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Wishing', 'everyone', 'have', 'a', 'great', 'great', 'new', 'year', '!']

### 7. Part of Speech (POS)
The POS (Parts of Speech) explains you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing (NLP) models like bag-of-words(BOW) fails to identify these relation between the words. For that we use POS tagging to mark a word to its POS tag based on its context in the data. POS is also used to extract relationship between the words.

In [9]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

def pos_taggg(text):
    word_tokens = word_tokenize(text)
    return pos_tag(word_tokens)

input_str = "Wishing everyone has a great great new year!"
pos_taggg(input_str)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Wishing', 'VBG'),
 ('everyone', 'NN'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('great', 'JJ'),
 ('new', 'JJ'),
 ('year', 'NN'),
 ('!', '.')]

* VBG --> verb gerund (judging)
* NN --> Noun
* VBZ --> Verb, present tense not 3rd person singular (wrap)
* DT --> Determiner
* JJ --> Adjective(large)

In [10]:
nltk.download('tagsets')

# extract information about the tag
nltk.help.upenn_tagset('PRP')

PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [11]:
# extract information about the tag
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


### 8. Chuncking
Chunking is the porcess of extracting phrases from the Unstructured text and give them more structure to it. We also called them shallow parsing. We can do it on top of pos tagging. It groups words into chuncks mainly for noun phrases. Chunking we do by using regular expression.

In [12]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

def chuncking(text, grammar):
    word_tokens = word_tokenize(text)
    
    # label words with pos
    word_pos = pos_tag(word_tokens)
    
    # create chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
    
    # test it on the list of word tokens with tagges pos
    tree = chunkParser.parse(word_pos)
    
    for subtree in tree.subtrees():
        print(subtree)
        
sentence = "The little red parrot is flying in the sky"
grammar = "NP:{<DT>?<JJ>*<NN>}"

chuncking(sentence,grammar )

(S
  (NP The/DT little/JJ red/JJ parrot/NN)
  is/VBZ
  flying/VBG
  in/IN
  (NP the/DT sky/NN))
(NP The/DT little/JJ red/JJ parrot/NN)
(NP the/DT sky/NN)


In the example above, we defined the grammar by using the regular expression rule. This rule tells you that NP (Noun Phrase) chunk should be formed whenever the chuncker find the optional **determiner (DJ)** followed by any **no. of adjectives** and then a **Noun (NN)**.

### 9. Named Entity Recognition (NER)
It is used to extract information from unstructured text. It is used to classy the entities which is present in the text into categories like a person, organization, event, places, etc. This will give you a detail knowledge about the text and the relationship between the different entities.

In [13]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

def ner(text):
    word_tokens = word_tokenize(text)
    
    # pos-tagging of words
    word_pos = pos_tag(word_tokens)
    
    # tree of word entities
    print(ne_chunk(word_pos))
    
    
input_str = "Brain Lara scored the highest 400 runs in a test match which played in between WI and England!!"
ner(input_str)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\6917\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


(S
  (PERSON Brain/NNP)
  (PERSON Lara/NNP)
  scored/VBD
  the/DT
  highest/JJS
  400/CD
  runs/NNS
  in/IN
  a/DT
  test/NN
  match/NN
  which/WDT
  played/VBD
  in/IN
  between/IN
  (ORGANIZATION WI/NNP)
  and/CC
  (GPE England/NNP)
  !/.
  !/.)


### 10. Understand Regex
* re library in Python is used for string searching and manipulation
* frequently used it for web scraping

**Example of w+ and ^ Expression:**
* **^**  : Here in this expression matches the start of a string
* **w+** : This expression matches for the alphanumeric characters from inside the string


In [14]:
import re

sent = "2024 is stating now!"
r2 = re.findall(r"^\w+", sent)
r3 = re.findall(r"^\w", sent)
print(r2)
print("\n")
print(r3)

['2024']


['2']


The pattern **r"^\w+"** looks for one or more word characters at the beginning of the string.

when the + sign is removed from \w, the output will give only the first character of the first word.

##### 10.1 re.split() Function
* "s" : This expression we use for creating a space in the string

In [15]:
import re 

print(re.split(r'\s', 'We splitted this sentence.'))

['We', 'splitted', 'this', 'sentence.']


In [16]:
print(re.split(r's', 'We splitted this sentence.'))

['We ', 'plitted thi', ' ', 'entence.']


If the '\' is removed from '\s', it will give result like `remove 's'` from the entire sentences. 

##### 10.2 RegEx Methods

###### 10.2.1 Using re.match()

This match function is used to match the RegEx pattern to string with optional flag. Here, in this "w+" and "\W" will match the words starting from "i" and thereafter, anything which is not started with "i" is not identified. For checking match for each element in the list or string, we run the for loop

In [17]:
import re

lists = ['icecream images', 'i immitated', 'inner space', 'Single inferno']

for i in lists:
    q = re.match("(i\w+)\W(i\w+)", i)
    
    if q: 
        print(q.groups())

('icecream', 'images')


The first word starts with the letter 'i' followed by any word character (\w+), then there's a non-word character (\W), and finally another word starting with the letter 'i' followed by any word character (\w+).

###### 10.2.1 Using re.search()
The search() function takes the "pattern" and "text" to scan from our given string and returns the match object when the pattern found or else not match

###### 10.2.1 Using re.findall()
we use re.findall() module is when you want to **iterate over the lines of the file**, it'll do like list all the matches in one go. Here is an example, we would like to fetch email address from the list and we want to fetch all emails from the list, we use re.findall() method

In [18]:
import re

emails = 'cksckscks1223@gmail.com, maxlim 2412@hotmail.com, josephHong@gmail.com, felicia05@hotmail.com'

format = re.findall(r'[\w\.-]+@[\w\.-]+', emails)

for e in format:
    print(e)

cksckscks1223@gmail.com
2412@hotmail.com
josephHong@gmail.com
felicia05@hotmail.com


In [19]:
import re 

aa = """Today is 
Firstday of 
2024"""


q1 = re.findall(r"^\w" , aa)
q2 = re.findall(r"^\w" , aa, re.MULTILINE)
print(q1)
print(q2)

['T']
['T', 'F', '2']


#### 11. Text Normalization
The text normalization means the process of transforming the text into canonical (or standard) form. Like "ok: and "k" can be transformed to "okay", its canonical form. And another example is mapping of near identical words such as "preprocessing", "pre-processing" and "pre processing" to just "preprocessing"

Text Normalization is just too useful for noisy texts such as social media comments, comment to blog posts, text messages, where abbreviations, misspellings, and the use out-of-vocabulary(oov) are prevalent.

#### 12. Word Count
Word Frequency can figure out how many times each tokens appear in the text.  When talking about word frequency, we distiinguished between types and tokens.
* **Types** are the distinct words in a corpus
* **Tokens** are the words inclusing repeats

First, tokenize the sentence by using the tokenizer which uses the non-alphabetic characters as a seperator

In [20]:
from nltk.tokenize.regexp import WhitespaceTokenizer
m = "There is no need to panic. We need to work together, take small yet important measure to conquer it"

In [21]:
tokens = WhitespaceTokenizer().tokenize(m)
print(tokens)

['There', 'is', 'no', 'need', 'to', 'panic.', 'We', 'need', 'to', 'work', 'together,', 'take', 'small', 'yet', 'important', 'measure', 'to', 'conquer', 'it']


In [22]:
print(len(tokens))

19


In [23]:
# using set function. This is because set function will rremove the duplicated value
my_vocab = set(tokens)
print(len(my_vocab))

16


In [24]:
# different tokenizer 
from nltk.tokenize.regexp import WordPunctTokenizer

In [26]:
word = WordPunctTokenizer().tokenize(m)
print(len(word))

21


In [27]:
print(word)

['There', 'is', 'no', 'need', 'to', 'panic', '.', 'We', 'need', 'to', 'work', 'together', ',', 'take', 'small', 'yet', 'important', 'measure', 'to', 'conquer', 'it']


#### 13. Frequency Distribution


In [28]:
import nltk
text = "There is no need to panic. We need to work together, take small yet important measure to conquer it"
freqDist = nltk.FreqDist(word_tokenize(text))
print(freqDist)

<FreqDist with 18 samples and 21 outcomes>


The class FreqDist works like a dictionary where keys are the words in the text and the values are count associated with that word. 

In [29]:
print(freqDist['need'])

2


In [33]:
Keys = freqDist.keys()
print(Keys)
print('\n')
print(type(Keys))

dict_keys(['There', 'is', 'no', 'need', 'to', 'panic', '.', 'We', 'work', 'together', ',', 'take', 'small', 'yet', 'important', 'measure', 'conquer', 'it'])


<class 'dict_keys'>
