#***Natural Language Processing***
*NLP is like teaching computers to understand and talk in human language. It helps them read, write, and make sense of text and speech, so they can interact with us more naturally.*


***Corpus:***
*A large collection of documents used for training NLP models.*

In [6]:
# Example

corpus = ['Iam Lithikhaa','Iam a student at the moment']
print('The corpus: ',corpus)

document = 'Iam Lithikhaa'
print('The Document: ',document)

words = ['Interested','in','machinelearning','field']
print('The Words: ',words)


The corpus:  ['Iam Lithikhaa', 'Iam a student at the moment']
The Document:  Iam Lithikhaa
The Words:  ['Interested', 'in', 'machinelearning', 'field']


***Tokens***

*When processing text, the text is split into these smaller pieces(tokens) to analyze and understand the content better*

In [8]:
!pip install nltk



***punkt***

 *a pre-trained model used for sentence boundary detection also known as sentence tokenization*

In [10]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

sentence = "Iam a student at the moment!!"
tokens = word_tokenize(sentence)
print("Tokens: ", tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokens:  ['Iam', 'a', 'student', 'at', 'the', 'moment', '!', '!']


***Vocabulary***

*The set of unique tokens found in a corpus. This represents the words and symbols that the model or algorithm has been trained to recognize and understand.*

In [14]:
sentence = "Iam a student at the moment in a college!!"

tokens = word_tokenize(sentence)
print("Tokens: ", tokens)

#used to get the unique values using set
Vocabulary = set(tokens)
print("Vocabulary: ", Vocabulary)



Tokens:  ['Iam', 'a', 'student', 'at', 'the', 'moment', 'in', 'a', 'college', '!', '!']
Vocabulary:  {'Iam', 'a', 'in', 'moment', 'college', '!', 'student', 'at', 'the'}


In [17]:
from nltk.tokenize import wordpunct_tokenize

line = """ Sometimes we get bored:( ,so
just do whatever makes you happy."""

wordpunct_tokenize(line)

['Sometimes',
 'we',
 'get',
 'bored',
 ':(',
 ',',
 'so',
 'just',
 'do',
 'whatever',
 'makes',
 'you',
 'happy',
 '.']

*Difference Between Wordpunct and word_tokenize*

In [20]:
text = """ Sometimes    we get bored:( ,so
just do whatever makes you happy."""

# Using wordpunct_tokenize
tokens_wp = wordpunct_tokenize(text)
print("wordpunct_tokenize:", tokens_wp)


# Using word_tokenize
tokens_wt = word_tokenize(text)
print("word_tokenize:", tokens_wt)

wordpunct_tokenize: ['Sometimes', 'we', 'get', 'bored', ':(', ',', 'so', 'just', 'do', 'whatever', 'makes', 'you', 'happy', '.']
word_tokenize: ['Sometimes', 'we', 'get', 'bored', ':', '(', ',', 'so', 'just', 'do', 'whatever', 'makes', 'you', 'happy', '.']


***Sentence Tokenization***

*Splits text into individual sentences based on punctuation marks like periods, exclamation points, or question marks*

In [27]:
text = """ Sometimes    we get bored:( ,so
just do whatever makes you happy."""

sentence = nltk.sent_tokenize(text)
print("wordpunct_tokenize:", sentence)

text = " Sometimes we get bored!. so just do whatever makes you happy."

sentence = nltk.sent_tokenize(text)
print("wordpunct_tokenize:", sentence)



wordpunct_tokenize: [' Sometimes    we get bored:( ,so\njust do whatever makes you happy.']
wordpunct_tokenize: [' Sometimes we get bored!.', 'so just do whatever makes you happy.']


In [32]:

from nltk.tokenize import WhitespaceTokenizer

text = " Sometimes we get bored  !. so just do whatever makes     you happy."
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_tokens = whitespace_tokenizer.tokenize(text)

print("Whitespace Tokenization: ", whitespace_tokens)

Whitespace Tokenization:  ['Sometimes', 'we', 'get', 'bored', '!.', 'so', 'just', 'do', 'whatever', 'makes', 'you', 'happy.']


***NGram Tokenization***

*Creates tokens by combining N consecutive words from the text*.

In [38]:
from nltk.util import ngrams

text = "Sometimes we get bored!. so just do whatever makes you happy. "

word_tokens = word_tokenize(text)
n = 3
ngram_tokens = list(ngrams(word_tokens, n))
print("N-Gram Tokenization : ", ngram_tokens)

N-Gram Tokenization :  [('Sometimes', 'we', 'get'), ('we', 'get', 'bored'), ('get', 'bored', '!'), ('bored', '!', '.'), ('!', '.', 'so'), ('.', 'so', 'just'), ('so', 'just', 'do'), ('just', 'do', 'whatever'), ('do', 'whatever', 'makes'), ('whatever', 'makes', 'you'), ('makes', 'you', 'happy'), ('you', 'happy', '.')]


***Custom Tokenization***

*Tailors tokenization rules based on specific requirements or domain-specific knowledge*.


**re.IGNORECASE**: This flag makes the search case-insensitive, meaning it will match "Heart," "HEART," and "heart" equally

**findall**: To find the match patterns

**literal characters**: are those that are written exactly as they are,
In this string, H, e, l, l, o, ,, , W, o, r, l, d, and ! are all literal characters.

**r**:
r stands for "raw string."
In regular expressions, the backslash (\) is a special character used to escape other characters (e.g., \n for a newline or \d for a digit). When writing regular expressions, you often need to use many backslashes, which can make the pattern hard to read and write.

By using a raw string (indicated by r'...'), Python treats backslashes as literal characters and does not apply special meaning to them.

**For example**:

without_r = "This is a newline character: \\n"

with_r = r"This is a newline character: \n"

In [43]:
'''Example: Suppose you want to match a string that includes either
 "cat" or "dog" followed by "house", but you don’t need to capture "cat" or "dog" separately.'''

import re
pattern = r'(?:cat|dog) house'
text = 'I have a cat house and a dog house.'

matches = re.findall(pattern, text)
print(matches)


# Example: Suppose you want to match the word "cat" but not "catalog".
pattern = r'\bcat\b'
text = 'The cat is on the catalog.'

matches = re.findall(pattern, text)
print(matches)



['cat house', 'dog house']


In [44]:
import re

def custom_tokenize(text):
    medical_terms_pattern = r'(?:\b(?:heart|lung|brain)\b)|(?:\b(?:COPD|MRI|ECG)\b)'
    tokens = re.findall(medical_terms_pattern, text, flags=re.IGNORECASE)
    return tokens
medical_text = "The patient underwent an MRI scan to examine the brain. COPD is a chronic lung disease."
custom_tokens = custom_tokenize(medical_text)
print("Custom Tokenization (Medical Terms and Abbreviations):  ", custom_tokens)

Custom Tokenization (Medical Terms and Abbreviations):   ['MRI', 'brain', 'COPD', 'lung']
