# Tokenization, Stemming and Lemmatization in the NLP using NLTK library

# Tokenization
- `Tokenization` is the process of converting the paragraph or sentence into the number of `Tokens` or `words` which uses the `string` or `list` of numbers.
- All these words are into the form of number list. Each list represent the each number of words or tokens.

In [1]:
# Import the libraries:
import nltk 
nltk.download()
nltk.download('punkt')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anast\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

- Take a `paragraph` to be a string of characters that may contain spaces, punctuation, and capital letters. 
- Write a `paragraph ` function that takes a `paragraph` as input and returns the number of words in the paragraph.

In [5]:
paragraph = """Artificial Intelligence is a branch of computer science focused on creating machines or software that can perform tasks traditionally requiring human intelligence. These tasks include learning from experience, reasoning, recognizing patterns, understanding natural language, and making decisions. Unlike traditional software that follows explicit rules programmed by humans, AI systems are designed to perceive data, learn from it, and adapt their behavior autonomously. This ability makes AI a fascinating and powerful tool shaping the future.
The concept of AI was first formally proposed in the 1950s, with pioneers like Alan Turing and John von Neumann laying the groundwork. Alan Turing's question, "Can machines think?" and his formulation of the Turing Test became foundational ideas. Since then, AI research has progressed from simple rule-based systems to complex neural networks inspired by the human brain.
Machine Learning enables computers to learn from data without explicit programming for every task. It relies on mathematical models including probability, statistics, linear algebra, and optimization theory. Algorithms optimize a cost or loss function by adjusting parameters to minimize the error in predictions. Techniques like supervised learning (learning from labeled data), unsupervised learning (finding patterns in unlabeled data), and reinforcement learning (learning through rewards and punishments) are fundamental.
Deep Learning, a subset of machine learning, uses artificial neural networks with many layers to model increasingly abstract features from data. The core mathematical operation in neural networks is the weighted sum followed by a nonlinear activation function, enabling the model to approximate complex functions. Training deep networks requires backpropagation, a method using calculus and chain rule to compute gradients of the loss with respect to all weights, enabling efficient optimization via gradient descent.
Natural Language Processing uses statistical and neural models to enable computers to understand, interpret, and generate human language. Techniques involve sequence modeling with recurrent neural networks (RNNs), transformers, and attention mechanisms.
Computer Vision enables machines to interpret visual data through convolutional neural networks (CNNs), which apply convolutional filters to detect edges, textures, and objects."""

- To convert the paragraph into the `sentence` format, we need to split the paragraph into sentences. We can use the `nltk` library `sent_tokenize` function to achieve this.
- `nltk.sent_tokenize`

In [6]:
# Tokenization of the paragraph into sentences:
sentences = nltk.sent_tokenize(paragraph)

- To convert the `paragraph` in to `words`, we use the `nltk` library's `word_tokenize()` function. This function splits the text paragraph into individual words.
- `nltk.word_tokenize`

In [7]:
words = nltk.word_tokenize(paragraph)

In [9]:
# print the all the sentences one by one:
for sentence in sentences:
    print(sentence)

Artificial Intelligence is a branch of computer science focused on creating machines or software that can perform tasks traditionally requiring human intelligence.
These tasks include learning from experience, reasoning, recognizing patterns, understanding natural language, and making decisions.
Unlike traditional software that follows explicit rules programmed by humans, AI systems are designed to perceive data, learn from it, and adapt their behavior autonomously.
This ability makes AI a fascinating and powerful tool shaping the future.
The concept of AI was first formally proposed in the 1950s, with pioneers like Alan Turing and John von Neumann laying the groundwork.
Alan Turing's question, "Can machines think?"
and his formulation of the Turing Test became foundational ideas.
Since then, AI research has progressed from simple rule-based systems to complex neural networks inspired by the human brain.
Machine Learning enables computers to learn from data without explicit programming

In [10]:
# Print all the Words:
print("All the words are: ", words)

All the words are:  ['Artificial', 'Intelligence', 'is', 'a', 'branch', 'of', 'computer', 'science', 'focused', 'on', 'creating', 'machines', 'or', 'software', 'that', 'can', 'perform', 'tasks', 'traditionally', 'requiring', 'human', 'intelligence', '.', 'These', 'tasks', 'include', 'learning', 'from', 'experience', ',', 'reasoning', ',', 'recognizing', 'patterns', ',', 'understanding', 'natural', 'language', ',', 'and', 'making', 'decisions', '.', 'Unlike', 'traditional', 'software', 'that', 'follows', 'explicit', 'rules', 'programmed', 'by', 'humans', ',', 'AI', 'systems', 'are', 'designed', 'to', 'perceive', 'data', ',', 'learn', 'from', 'it', ',', 'and', 'adapt', 'their', 'behavior', 'autonomously', '.', 'This', 'ability', 'makes', 'AI', 'a', 'fascinating', 'and', 'powerful', 'tool', 'shaping', 'the', 'future', '.', 'The', 'concept', 'of', 'AI', 'was', 'first', 'formally', 'proposed', 'in', 'the', '1950s', ',', 'with', 'pioneers', 'like', 'Alan', 'Turing', 'and', 'John', 'von', 'Ne

# Stemming
- Stemming is a process of reducing words to their base or root form. It is a simple and fast method of reducing words to their base form.
- Porter Stemmer is a popular stemming algorithm that is widely used in natural language processing tasks.
- NLTK library in Python has a built-in Porter Stemmer that can be used to stem words.
- For example, the word "running" can be stemmed to "run" using Porter Stemmer.

### Stopwords:

- `stopwords`: a list of words that are ignored in the analysis. These words are common and do not add much value to the analysis. They can be things like "the", "and", "a" etc.

In [11]:
# Import the PosterStemmer:
from nltk.stem import PorterStemmer
# Import the Stopwords:
from nltk.corpus import stopwords

In [18]:
para = "My name is Muhammad Anas. Graduated in Artificial Intelligence from IUB. My experties in Machine learning and Deep learning with python."

In [20]:
sent = nltk.sent_tokenize(para)

In [21]:
# Apply the stemming algorithm to the text:
stemmer = PorterStemmer()
for i in range(len(sent)):
    words = nltk.word_tokenize(sent[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sent[i] = ' '.join(words)

In [22]:
# Print the stemmer:
for sen in sent:
    print(sen)

my name muhammad ana .
graduat artifici intellig iub .
my experti machin learn deep learn python .


# Lemmetization
In **Natural Language Processing (NLP)**, **lemmatization** is the process of reducing a word to its **base or dictionary form** (called the *lemma*), while ensuring that the resulting form is a valid word in the language.

Unlike **stemming**, which just chops off word endings based on simple rules, lemmatization uses **linguistic knowledge** (like vocabulary and morphological analysis) to consider the word’s **part of speech** and produce a meaningful root form.

**Example:**

* *running* → **run**
* *better* → **good**
* *mice* → **mouse**

**Key Points:**

* Requires understanding of **context** (e.g., *"saw"* could be the verb *see* or the noun *saw* — the meaning changes).
* Often uses a **lexicon** or **WordNet** for reference.
* More accurate than stemming but computationally heavier.



In [29]:
# Import the Lemmetization library:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [30]:
text = """Data Analysis is the process of examining data sets to draw conclusions about the information they contain.Python libraries like seaborn, matplotlib and plotly mostly used for this purpose."""

In [33]:
# Tokenize the text:
tokens = nltk.word_tokenize(text)

In [34]:
# Apply the Lemmetization process:
lemmatizer = WordNetLemmatizer()

# Apply the for loop on the tokens:
for i in range(len(tokens)):
    words = nltk.word_tokenize(tokens[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    tokens[i] = ' '.join(words)

In [35]:
for token in tokens:
    print(token)

Data
Analysis


process

examining
data
set

draw
conclusion


information

contain.Python
library
like
seaborn
,
matplotlib

plotly
mostly
used


purpose
.
