# <p style="text-align: center; color: purple; font-weight: bold;">📚✨ What is NLTK Library in Python? ✨📚</p>

![image.png](attachment:image.png)

# What is NLTK Library in Python ?

1. NLTK (Natural Language Toolkit) is a Python library for natural language processing (NLP) tasks.

2. It offers tools for tasks like tokenization, stemming, part-of-speech tagging, and named entity recognition.

3. NLTK provides access to corpora and lexical resources like WordNet.

4. It supports machine learning-based tasks such as text classification and sentiment analysis.

5. NLTK's modular design and documentation make it suitable for beginners and advanced users alike.

6. Widely used in academia, industry, and research for applications like chatbots, search engines, and recommendation systems.

Examples - 

Chatbots: NLTK powers chatbots for customer service and virtual assistants, enabling them to understand and respond to user queries effectively.

Chatbots can be built using various NLP libraries such as
1) NLTK (Natural Language Toolkit)
2) spaCy
3) Rasa NLU (Natural Language Understanding)
4) Transformers (e.g., Hugging Face's library)
5) Dialogflow (formerly API.AI) by Google

In [6]:
# Install NLTK package using pip
!pip install nltk

# Import NLTK package
import nltk

# Download the WordNet corpus from NLTK
nltk.download('wordnet')




[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# What is Tokenization ?

1. Tokenization is the process of breaking text into smaller units called tokens.

2. Tokens are typically words or individual characters, depending on the level of granularity.

3. It helps computers understand and process text by segmenting it into meaningful units.

4. Common tokenization techniques include splitting text into words, sentences, or even characters.

5. Tokenization is essential in natural language processing tasks like text analysis, classification, and language modeling.

6. By breaking text into tokens, computers can perform various operations such as counting words, analyzing sentence structure, and extracting important information from text data.

![image.png](attachment:image.png)

# Algorithm Steps:

1. Convert all characters in the text to lowercase.

2. Remove any punctuation marks.

3. Split the processed text into tokens by whitespace characters.

4. Compile the tokens into a list and return them.

# Purpose: 

1. Segment text data into smaller units called tokens.

2. Enable various natural language processing tasks, including sentiment analysis, text classification, and information retrieval.

3. Facilitate the analysis and processing of text data by breaking it down into tokens.

In [7]:
# Tokenization 

# Importing the word_tokenize function from the nltk.tokenize module
from nltk.tokenize import word_tokenize

# Defining a string containing the text to be tokenized
text = "NLTK is a leading platform for building python programs to work with human language data."

# Tokenizing the text using the word_tokenize function
tokens = word_tokenize(text)

# Printing the resulting tokens
print(tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


![image.png](attachment:image.png)

# What is Stemming ?

1. Stemming is a text processing technique used to reduce words to their base or root form.

2. It helps simplify word variations by removing suffixes or prefixes, allowing computers to treat similar words as the same.

3. For example, words like "running", "runs", and "runner" would all be reduced to the stem "run".

4. Stemming is commonly used in natural language processing tasks like text analysis, search engines, and information retrieval.

5. By reducing words to their stems, stemming improves the efficiency and accuracy of text processing algorithms.

6. While stemming may not always produce valid words, it is useful for tasks where word variations are less important than word meaning or frequency.

# Algorithm:

1. Import the PorterStemmer class from the nltk.stem module.

2. Create an instance of the PorterStemmer class, initializing the stemming object.

3. Define a list of words to be stemmed.

4. Iterate through each word in the list.

5. Apply stemming to each word using the stem() method of the stemming object.

6. Print the stemmed version of each word.


# Purpose:

1. Demonstrate Porter stemming algorithm on word list.

2. Show how stemming reduces words to base form.

3. Highlight importance for text analysis and retrieval.

4. Emphasize efficiency in handling word variations.

# Real life Example:

Imagine you're baking cookies with your friends, and you have ingredients like "bake", "baking", "bakes", and "baker". Now, if we want to simplify these words to their basic form, we'd use stemming. So, stemming would take all these words and change them to their root word, which is "bake". It's like having different shapes of cookies, but all of them are still cookies.

In this example:

"Bake" is the root word.
"Baking", "bakes", and "baker" are variations of the root word.
Stemming helps us treat all these variations as the same thing, just like how all the different shapes of cookies are still cookies.

In [8]:
#Stemming

# Importing the PorterStemmer class from the nltk.stem module
from nltk.stem import PorterStemmer

# Creating an instance of the PorterStemmer class
stemmer = PorterStemmer()

# List of words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]

# Iterating through each word in the list
for word in words:
    
    # Stemming each word using the stemmer object
    print(stemmer.stem(word))

program
program
programm
program
programm


# What is Lemmatization ?

1. Lemmatization is a text processing technique used to reduce words to their base or dictionary form.

2. It's similar to stemming but produces valid words by considering the word's meaning and context.

3. For example, words like "running", "runs", and "ran" would all be lemmatized to the base form "run".

4. Lemmatization helps computers understand the meaning of words in a text better.

5. It's commonly used in natural language processing tasks like text analysis, search engines, and information retrieval.

6. By reducing words to their base forms, lemmatization improves the accuracy of text processing algorithms.

7. Lemmatization considers the context of words, making it more accurate than stemming in many cases.


# Algorithm Steps:

1. Tokenize the text into individual words or tokens.

2. Apply lemmatization to reduce each word to its base or dictionary form.

3. Output the lemmatized text, where each word represents its base form.

4. Normalize words for better text analysis and comparison.

5. Improve accuracy in language processing by considering word meaning and context.

6. Utilize in various language processing applications like text analysis and search engines.

# Purpose: 

1. Normalize words for text analysis and comparison.

2. Improve accuracy in language processing tasks.

3. Enhance understanding by reducing word variations.

4. Applied in various language processing applications.

5. Facilitate semantic analysis for better comprehension.


# Real Life Example: 

Let's say you're writing an essay about different forms of transportation. In your essay, you mention words like "car", "cars", "carrying", and "carried". Now, if you want to find the base or dictionary form of these words, you would use lemmatization.

In this scenario:

The lemma (base form) of "car" remains "car".
The lemma of "cars" is also "car".
The lemma of "carrying" is "carry".
The lemma of "carried" is also "carry".

# Difference between Stemming and Lemmatization

Stemming and lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root form, but they differ in several ways:

1. Output: 

Stemming usually chops off prefixes or suffixes from words to produce the root form, whereas lemmatization returns the base or dictionary form of a word, known as the lemma.

2. Accuracy: 

Stemming may result in the root form that is not an actual word or may not be a valid word in a given context, while lemmatization ensures that the resulting lemma is a valid word.

3. Language Complexity: 

Stemming algorithms are typically simpler and faster compared to lemmatization, which involves more complex linguistic rules and may require access to dictionaries or lexicons.

4. Use Cases:

Stemming is suitable for applications where speed and simplicity are prioritized over precision, such as information retrieval or indexing in search engines. Lemmatization, on the other hand, is preferred for tasks where accuracy and grammatical correctness are crucial, such as text analysis or natural language understanding.








In [11]:
#Lemmatization

# Importing the WordNetLemmatizer class from the nltk.stem module
from nltk.stem import WordNetLemmatizer

# Creating an instance of the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()

# List of words to be lemmatized
words = ["am", "is", "are", "was", "were"]

# Iterating through each word in the list
for word in words:
    # Lemmatizing each word using the lemmatizer object
    # 'pos' parameter specifies the part of speech (POS) of the word
    # 'v' indicates that the word is a verb
    print(lemmatizer.lemmatize(word,pos = 'v'))

be
be
be
be
be


## 🎉 Congratulations! 🎉

We have successfully completed our exploration of the NLTK library in Python! Through this project, we learned how to process and analyze text data, including tokenization, stemming, and lemmatization. You've done an amazing job understanding these concepts and applying them to real-world text processing. Keep up the great work!
