# Natural Language Processing

## Core Concepts and Techniques

## Tokenization

Tokenization is the process of breaking down a stream of text into smaller units, usually words or subwords. These tokens form the base unit for further analysis.

In [None]:
# Download resources
nltk.download('wordnet')
nltk.download('omw-1.4')

In [1]:
from nltk.tokenize import word_tokenize
text = "Natural Language Processing is fascinating."
tokens = word_tokenize(text)
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']


## Stop Words Removal

Stop words are common words (like "and", "the", "in") that are often removed from text data during preprocessing to reduce noise and focus on more meaningful words.

In [2]:
# Download resources
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\win\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\win\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Import library
from nltk.corpus import stopwords

# Sample sentence
sentence = "This is a simple example showing how stop words are removed from text."

# Tokenize the sentence
words = word_tokenize(sentence)

# Get English stop words
stop_words = set(stopwords.words('english'))

# Filter out stop words
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original:", words)
print("Without Stop Words:", filtered_words)

Original: ['This', 'is', 'a', 'simple', 'example', 'showing', 'how', 'stop', 'words', 'are', 'removed', 'from', 'text', '.']
Without Stop Words: ['simple', 'example', 'showing', 'stop', 'words', 'removed', 'text', '.']


In [4]:
# Sample usage
sentence = "NLTK is a leading platform for building Python programs to work with human language data."

# Tokenize
tokens = word_tokenize(sentence)

# Filter stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


## Stemming and Lemmatization

Both stemming and lemmatization are fundamental techniques in natural language processing (NLP) used to normalize text and reduce words to their base or root forms, thereby improving the performance of text analysis and machine learning models.

* ***Stemming***

Stemming involves the process of chopping off the ends of words to arrive at the root form. It uses heuristic rules to strip suffixes, which can sometimes lead to non-dictionary words.
* ***Lemmatization***

Lemmatization is a more sophisticated method that uses vocabulary and morphological analysis to return the base or dictionary form of a word, known as the lemma. It considers the part of speech (POS) of a word, which leads to more accurate results.

In [5]:
# Download resources
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\win\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\win\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [6]:
# Import library
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Sample words to test
words = ["running", "flies", "better", "eating", "cats", "studies"]

# Compare stemming and lemmatization
print(f"{'Word':<10} {'Stemmed':<10} {'Lemmatized':<12}")
print("-" * 35)
for word in words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos="v")  # pos="v" for verb
    print(f"{word:<10} {stemmed:<10} {lemmatized:<12}")

Word       Stemmed    Lemmatized  
-----------------------------------
running    run        run         
flies      fli        fly         
better     better     better      
eating     eat        eat         
cats       cat        cat         
studies    studi      study       


## Part-of-Speech (POS) Tagging

This technique assigns parts of speech (nouns, verbs, adjectives, etc.) to each token, crucial for syntactic understanding.

* JJ:   Adjective

* NN:   Noun (singular)

* VBZ:   Verb, 3rd person singular present

* VBG:   Verb, gerund or present participle

* .        :   Punctuation

In [16]:
# Downloads
nltk.download('averaged_perceptron_tagger')

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize and tag
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

# Output
print(pos_tags)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\win\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [18]:
# Sample sentence
sentence = "John is playing football in the park with his friends."

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Get part-of-speech tags
pos_tags = nltk.pos_tag(tokens)

# Print the results
print("POS Tags:")
for word, tag in pos_tags:
    print(f"{word:10} → {tag}")

POS Tags:
John       → NNP
is         → VBZ
playing    → VBG
football   → NN
in         → IN
the        → DT
park       → NN
with       → IN
his        → PRP$
friends    → NNS
.          → .
