<a href="https://colab.research.google.com/github/AliArabi55/NLP/blob/main/1_Tokenization_and_Text_Preprocessing_with_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Toolkit (NLTK)

Natural Language Toolkit (NLTK) is a powerful Python library that aids in natural language processing tasks. It was developed at the University of Pennsylvania and has become one of the most popular and widely used libraries in NLP.

NLTK provides a wide range of functionalities and resources for text processing, tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and much more. It also includes various corpora, lexical resources, and pre-trained models to help you get started quickly.

To begin using NLTK in your Python environment, you need to install it first. The installation process is straightforward and can be done using pip, the standard package manager for Python. Open your command prompt or terminal and run the following command:

#Tasks

Q1: Tokenize the following text into words using space as a delimiter:
Text: "Natural Language Processing is an exciting field of AI."

Expected Output:
['Natural', 'Language', 'Processing', 'is', 'an', 'exciting', 'field', 'of', 'AI.']
=============================================================================
Q2: Tokenize the following text into sentences:
Text: "Tokenization is the first step in preprocessing. It helps break text into smaller pieces."
Expected Output:
['Tokenization is the first step in preprocessing.', 'It helps break text into smaller pieces.']
==============================================================================
Q3 Perform part-of-speech tagging on the following sentence using the NLTK library:
Sentence: "The quick brown fox jumps over the lazy dog."
==============================================================================
Q4: Write a function that accepts a sentence as input and returns the number of nouns and verbs in the sentence after performing POS tagging.

In [25]:
#!pip install nltk

In [26]:
import nltk

After importing NLTK, you may want to download additional resources like corpora or models depending on your requirements. NLTK provides a convenient way to download these resources using the nltk.download() function.

To download all the available resources at once, you can run:
Note: Alternatively, you can choose specific resources to download by replacing 'all' with their respective identifiers.

In [27]:
#nltk.download('all')

# Tokenization and Text Preprocessing with NLTK and Python

NLTK’s word tokenization allows you to split text into individual words or tokens. This process is essential for analyzing the linguistic structure of a sentence and extracting meaningful information from it. NLTK provides different tokenization methods, including the default word_tokenize() function and alternative options like TreebankWordTokenizer and RegexpTokenizer.

In [28]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download necessary NLTK resources (only required once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Sample text for demonstration






[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#Task 1

In [47]:
#Task 1
text = "Natural Language Processing is an exciting field of AI."

In [48]:

# Tokenization - Word Tokenization
tokens = word_tokenize(text)
print("Word Tokens:")
print(tokens)
print()

Word Tokens:
['Natural', 'Language', 'Processing', 'is', 'an', 'exciting', 'field', 'of', 'AI', '.']



#Task 2

In [49]:

#Task 2
text = "Tokenization is the first step in preprocessing. It helps break text into smaller pieces."

In [50]:
# Tokenization - Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)
print()



Sentences:
['Tokenization is the first step in preprocessing.', 'It helps break text into smaller pieces.']



In [51]:
# Text Preprocessing - Removing Stop Words
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.casefold() not in stop_words]
print("Tokens after removing stop words:")
print(filtered_tokens)
print()



Tokens after removing stop words:
['Natural', 'Language', 'Processing', 'exciting', 'field', 'AI', '.']



In [52]:
# Text Preprocessing - Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print("Stemmed Tokens:")
print(stemmed_tokens)
print()



Stemmed Tokens:
['natur', 'languag', 'process', 'excit', 'field', 'ai', '.']



In [53]:
# Text Preprocessing - Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print("Lemmatized Tokens:")
print(lemmatized_tokens)
print()



Lemmatized Tokens:
['Natural', 'Language', 'Processing', 'exciting', 'field', 'AI', '.']



In [54]:
# Text Preprocessing - Handling Special Characters
special_chars = set(string.punctuation)
filtered_tokens = [token for token in tokens if token not in special_chars]
print("Tokens after handling special characters:")
print(filtered_tokens)

Tokens after handling special characters:
['Natural', 'Language', 'Processing', 'is', 'an', 'exciting', 'field', 'of', 'AI']


# Task 3



### Q3 Perform part-of-speech tagging on the following sentence using the NLTK library:
### Sentence: "The quick brown fox jumps over the lazy dog."

# Part-of-Speech Tagging

In [55]:
import nltk
from nltk.tokenize import word_tokenize

# Download necessary NLTK resources (only required once)
nltk.download('averaged_perceptron_tagger')

# Sample text for demonstration

#Task 3

text = "The quick brown fox jumps over the lazy dog."


#Task 4

#text="Write a function that accepts a sentence as input and returns the number of nouns and verbs in the sentence after performing POS tagging. "


# Tokenize the text into words
tokens = word_tokenize(text)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

# Print the POS tags
for token, pos_tag in pos_tags:
    print(f"{token}: {pos_tag}")

The: DT
quick: JJ
brown: NN
fox: NN
jumps: VBZ
over: IN
the: DT
lazy: JJ
dog: NN
.: .


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Task 4

In [46]:
def nouns_verb_retreiver(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    nouns = sum(1 for word, pos in pos_tags if pos.startswith('N'))
    verbs = sum(1 for word, pos in pos_tags if pos.startswith('V'))
    return nouns, verbs

nouns, verbs = nouns_verb_retreiver(text)
print(f"nouns = {nouns}, verbs = {verbs}")


nouns = 3, verbs = 1
