# NLP

Natural Language Processing (NLP) is a field that combines computer science, artificial intelligence and language studies. It helps computers understand, process and create human language in a way that makes sense and is useful. With the growing amount of text data from social media, websites and other sources, NLP is becoming a key tool to gain insights and automate tasks like analyzing text or translating languages.

## NLP Techniques

### 1. Text Processing and Preprocessing 
a) Tokenization: Dividing text into smaller units, such as words or sentences. \
b) Stemming and Lemmatization: Reducing words to their base or root forms. \
c) Stopword Removal: Removing common words (like "and", "the", "is") that may not carry significant meaning. \
d) Text Normalization: Standardizing text, including case normalization, removing punctuation and correcting spelling errors.

## Tokenization in NLP

Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words, or sentences. It helps in improving interpretability of text by different models. Let's understand How Tokenization Works.

![image.png](attachment:image.png)

### Types of Tokenization

#### 1. Word Tokenization
Word tokenization is the most commonly used method where text is divided into individual words. It works well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating" becomes:

Input before tokenization: ["Machine Learning is fascinating"]

Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]

#### 2. Character Tokenization
In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear boundaries. It can also be useful for modelling character-level language.

Example

Input before tokenization: ["You are helpful"]

Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]

#### 3. Sub-word Tokenization
This strikes a balance between word and character tokenization by breaking down text into units that are larger than a single character but smaller than a full word. This is useful when dealing with morphologically rich languages or rare words.

Example

["Time", "table"] 
["Rain", "coat"] 
["Grace", "fully"] 
["Run", "way"] 

Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form words by combining smaller units.

#### 4. Sentence Tokenization
Sentence tokenization is also a common technique used to make a division of paragraphs or large set of sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or processing.

Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is fascinating. Computer Vision handles images. "]

Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine learning is fascinating.", "Computer Vision handles images."]

#### 5. N-gram Tokenization
N-gram tokenization splits words into fixed-sized chunks (size = n) of data.

Input before tokenization: ["Machine learning is powerful"]

Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is', 'powerful')]

### Need of Tokenization
Tokenization is an essential step in text processing and natural language processing (NLP) for several reasons. Some of these are listed below:

Effective Text Processing: Reduces the size of raw text, resulting in easy and efficient statistical and computational analysis.\
Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in ML models.\
Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.\
Text Analysis: Used in sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.\
Vocabulary Management: Generates a list of distinct tokens, Helps manage a corpus's vocabulary.
Task-Specific Adaptation: Adapts to need of particular NLP task, Good for summarization and machine translation.

## Coding

In [1]:
!pip install nltk




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
## Tokenize sentence

In [None]:
import nltk
import os
from nltk.tokenize import sent_tokenize

# Step 4: Tokenize text
text = "Hello everyone. Welcome to the workshop. You are studying NLP article."
sentences = sent_tokenize(text)

# Output
print(sentences)


['Hello everyone.', 'Welcome to the workshop.', 'You are studying NLP article.']


In [12]:
import nltk.data

# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

['Hello everyone.',
 'Welcome to the workshop.',
 'You are studying NLP article.']

In [13]:
import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

['Hola amigo.', 'Estoy bien.']

In [14]:
## Word Tokenization

In [15]:
from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to the workshop."
word_tokenize(text)

['Hello', 'everyone', '.', 'Welcome', 'to', 'the', 'workshop', '.']

In [16]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

In [17]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)

['Let', 's', 'see', 'how', 'it', 's', 'working']

In [19]:
! pip install -U pip setuptools wheel

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Collecting setuptools
  Downloading setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---------------------------------------- 1.8/1.8 MB 12.5 MB/s eta 0:00:00
Downloading setuptools-80.9.0-py3-none-any.whl (1.2 MB)
   ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
   ---------------------------------------- 1.2/1.2 MB 20.0 MB/s eta 0:00:00
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: To modify pip, please run the following command:
c:\Users\SV937CY\OneDrive - EY\Desktop\riotinto\rio\.venv\Scripts\python.exe -m pip install -U pip setuptools wheel


In [20]:
! pip install -U spacy

Collecting spacy
  Downloading spacy-3.8.7-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp311-cp311-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp311-cp311-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp311-cp311-win_a


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---- ----------------------------------- 1.6/12.8 MB 13.9 MB/s eta 0:00:01
     ------------------------------ -------- 10.0/12.8 MB 18.8 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 17.5 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 14.1 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
import spacy

# Creating blank language object then
# tokenizing words of the sentence
nlp = spacy.blank("en")

doc = nlp("GeeksforGeeks is a one stop\
learning destination for geeks.")

for token in doc:
    print(token)

GeeksforGeeks
is
a
one
stoplearning
destination
for
geeks
.


In [24]:
import spacy

# loading modules to the pipeline.
nlp = spacy.load("en_core_web_sm")

# Initialising doc with a sentence.
doc = nlp("If you want to be an excellent programmer \
, be consistent to practice daily on GFG.")

# Using properties of token i.e. Part of Speech and Lemmatization
for token in doc:
    print(token, " | ",
          spacy.explain(token.pos_),
          " | ", token.lemma_)

If  |  subordinating conjunction  |  if
you  |  pronoun  |  you
want  |  verb  |  want
to  |  particle  |  to
be  |  auxiliary  |  be
an  |  determiner  |  an
excellent  |  adjective  |  excellent
programmer  |  noun  |  programmer
,  |  punctuation  |  ,
be  |  auxiliary  |  be
consistent  |  adjective  |  consistent
to  |  particle  |  to
practice  |  verb  |  practice
daily  |  adverb  |  daily
on  |  adposition  |  on
GFG  |  proper noun  |  GFG
.  |  punctuation  |  .


## Lemmatization vs. Stemming

### What is Lemmatization? 
Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. This technique considers the context and the meaning of the words, ensuring that the base form belongs to the language's dictionary. For example, the words "running," "ran," and "runs" are all lemmatized to the lemma "run."

### How Lemmatization Works? 
Lemmatization involves several steps:

Part-of-Speech (POS) Tagging: Identifying the grammatical category of each word (e.g., noun, verb, adjective).\
Morphological Analysis: Analyzing the structure of the word to understand its root form.\
Dictionary Lookup: Using a predefined vocabulary to find the lemma of the word.\
For example, the word "better" would be lemmatized to "good" if it is identified as an adjective, whereas "running" would be lemmatized to "run" if identified as a verb.

### Techniques in Lemmatization
Rule-Based Lemmatization: Uses predefined grammatical rules to transform words. For instance, removing the "-ed" suffix from regular past tense verbs.\
Dictionary-Based Lemmatization: Looks up words in a dictionary to find their base forms.\
Machine Learning-Based Lemmatization: Employs machine learning models trained on annotated corpora to predict the lemma of a word.

### What is Stemming?
Stemming is a more straightforward process that cuts off prefixes and suffixes (i.e., affixes) to reduce a word to its root form. This root form, known as the stem, may not be a valid word in the language. For example, the words "running," "runner," and "runs" might all be stemmed to "run" or "runn," depending on the stemming algorithm used.

### How Stemming Works?
Stemming algorithms apply a series of rules to strip affixes from words. The most common stemming algorithms include:

Porter Stemmer: Uses a set of heuristic rules to iteratively remove suffixes.\
Snowball Stemmer: An extension of the Porter Stemmer with more robust rules.\
Lancaster Stemmer: A more aggressive stemmer that can sometimes over-stem words.\
For example, the words "running", "runner", and "runs" might all be reduced to "run" by a stemming algorithm, but sometimes it might also reduce "arguing" to "argu".

In [1]:
### Practical Implementation: Lemmatization with NLTK in

In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Function to get POS tag
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()
word = "running"
lemma = lemmatizer.lemmatize(word, get_wordnet_pos(word))
print(f"Lemmatized word: {lemma}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Lemmatized word: run


In [4]:
## Stemming with NLTK in Python

In [3]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stem = stemmer.stem(word)
print(f"Stemmed word: {stem}")

Stemmed word: run


In [None]:
## Natural Language Processing with Tokenization and Lemmatization

In [5]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "The striped bats are hanging on their feet for best"

# Tokenize the text
words = nltk.word_tokenize(text)

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

# Function to get the part of speech tag for lemmatization
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

# Print results
print("Original Text: ", text)
print("Tokenized Words: ", words)
print("Stemmed Words: ", stemmed_words)
print("Lemmatized Words: ", lemmatized_words)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Original Text:  The striped bats are hanging on their feet for best
Tokenized Words:  ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
Stemmed Words:  ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmatized Words:  ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


## Removing stop words with NLTK in Python

A stop word is a commonly used word (such as "the", "a", "an", or "in") that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

### Types of Stopwords
Stopwords are frequently occurring words in a language that are frequently omitted from natural language processing (NLP) tasks due to their low significance for deciphering textual meaning. The particular list of stopwords can change based on the language being studied and the context. The following is a broad list of stopword categories:

1) Common Stopwords: These are the most frequently occurring words in a language and are often removed during text preprocessing. Examples include "the," "is," "in," "for," "where," "when," "to," "at," etc.\
2) Custom Stopwords: Depending on the specific task or domain, additional words may be considered as stopwords. These could be domain-specific terms that don't contribute much to the overall meaning. For example, in a medical context, words like "patient" or "treatment" might be considered as custom stopwords.\
3) Numerical Stopwords: Numbers and numeric characters may be treated as stopwords in certain cases, especially when the analysis is focused on the meaning of the text rather than specific numerical values.\
4) Single-Character Stopwords: Single characters, such as "a," "I," "s," or "x," may be considered stopwords, particularly in cases where they don't convey much meaning on their own.\
5) Contextual Stopwords: Words that are stopwords in one context but meaningful in another may be considered as contextual stopwords. For instance, the word "will" might be a stopword in the context of general language processing but could be important in predicting future events.

In [None]:
## Removing stop words with NLTK

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether 
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [None]:
## Removing stop words with SpaCy

In [7]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "There is a pen on the table"

# Process the text using spaCy
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)

Original Text: There is a pen on the table
Text after Stopword Removal: pen table


In [None]:
## Remowing  of common stopwords can be done by geneim and sklearn package also

In [None]:
## Custom stopwords code

In [8]:
text = "The patient was taken for treatment after being admitted to the hospital."
tokens = word_tokenize(text.lower())

custom_stopwords = {"patient", "treatment", "admitted", "hospital"}
filtered_custom = [word for word in tokens if word not in custom_stopwords and word.isalnum()]

print("Custom stopwords removed:")
print(filtered_custom)


Custom stopwords removed:
['the', 'was', 'taken', 'for', 'after', 'being', 'to', 'the']


In [None]:
## Numerical stopwords

In [9]:
text = "In 2023, 1200 patients were treated in batch 5."
tokens = word_tokenize(text.lower())

numerical_stopwords = {token for token in tokens if token.isdigit()}
filtered_numerical = [word for word in tokens if word not in numerical_stopwords and word.isalnum()]

print("Numerical stopwords removed:")
print(filtered_numerical)


Numerical stopwords removed:
['in', 'patients', 'were', 'treated', 'in', 'batch']


In [None]:
## Single character stopwords

In [12]:
text = "x is a variable, and I am learning NLP."
tokens = word_tokenize(text.lower())

single_char_stopwords = {token for token in tokens if len(token) == 1 and token.isalpha()}
filtered_single_char = [word for word in tokens if word not in single_char_stopwords and word.isalnum()]

print("Single-character stopwords removed:")
print(filtered_single_char)


Single-character stopwords removed:
['is', 'variable', 'and', 'am', 'learning', 'nlp']


In [14]:
## Contextual Stopwords

In [13]:
text = "Will you go to the event? The will was signed yesterday."
tokens = word_tokenize(text.lower())

# In this context, we decide to filter out the future-intent word "will"
contextual_stopwords = {"will"}
filtered_contextual = [word for word in tokens if word not in contextual_stopwords and word.isalnum()]

print("Contextual stopwords removed (context-dependent):")
print(filtered_contextual)


Contextual stopwords removed (context-dependent):
['you', 'go', 'to', 'the', 'event', 'the', 'was', 'signed', 'yesterday']
