<span style="font-size:16px; font-weight:bold">Welcome to Natural language processing (NLP) in Python</span><br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

<span style="font-size: 16px; font-weight: bold">Outline:</span><br/>
1️⃣ `Introduction NLP`<br/>
▪ Overview of Natural Language Processing<br/>
▪ Key areas: Text Analysis, Language Generation, Speech Processing, Semantic Understanding<br/>
▪ NLP Challenges<br/>
▪ Introduction to NLTK and spaCy libraries and their capabilities<br/>
▪ NLTK & spaCy Setup<br/>
2️⃣ `Understanding Text Dataset Structure in NLP`<br/>
▪ Corpora<br/>
▪ Corpus<br/>
▪ Document<br/>
▪ Token<br/>
3️⃣ `Semantic, Syntactic, and Sentiment Analysis in NLP`<br/>
▪ Semantic Analysis<br/>
▪ Syntactic Analysis<br/>
▪ Sentiment Analysis<br/>
4️⃣ `Regular Expressions in NLP`<br/>
5️⃣ `Stopwords Removal`<br/>
6️⃣ `Chunking and Parsing in NLP`<br/>

<span style="font-size: 16px; font-weight: bold; color:rgb(255, 251, 18)">1️⃣ Introduction NLP</span><br/>
NLP is a branch of artificial intelligence that allows computers to understand, interpret, and generate human language by combining linguistics, computer science, and machine learning.<br/> 

<span style="font-size:16px; font-weight:bold">Key Areas of NLP:</span><br/>
`Text Analysis:`<br/>
▪ Tokenization: Breaking text into words or sentences.<br/>
▪ Part-of-Speech (POS) Tagging: Identifying grammatical components (e.g., nouns, verbs).<br/>
▪ Named Entity Recognition (NER): Extracting entities like names, dates, or organizations.<br/>
▪ Sentiment Analysis: Determining the emotional tone (positive, negative, neutral).<br/>

`Language Generation:`<br/>
▪ Text Summarization: Condensing long texts into shorter summaries.<br/>
▪ Machine Translation: Converting text between languages (e.g., Google Translate).<br/>
▪ Text Generation: Creating human-like text (e.g., chatbots, story generators).<br/>

`Speech Processing:`<br/>
▪ Speech Recognition: Converting spoken words to text (e.g., Siri, Alexa).<br/>
▪ Text-to-Speech (TTS): Generating spoken language from text.<br/>
▪ Voice Assistants: Combining speech recognition and NLP for interactive systems.<br/>

`Semantic Understanding:`<br/>
▪ Word Embeddings: Representing words as vectors (e.g., Word2Vec, BERT).<br/>
▪ Question Answering: Providing precise answers to user queries.<br/>
▪ Dialogue Systems: Enabling conversational agents to maintain context.<br/>

<span style="font-size:16px; font-weight:bold">NLP Challenges:</span><br/>
▪ `Ambiguity & Context Sensitivity:`<br/>
Human language is often ambiguous, meaning that the same word or sentence can have multiple meanings depending on the context. For example, the word "bank" could refer to a financial institution or the side of a river. NLP systems must use context to resolve these ambiguities and understand the intended meaning.<br/>
▪ `Cultural & Linguistic Nuances:`<br/>
Language varies greatly across different cultures, regions, and social groups. Idioms, slang, humor, and cultural references can be difficult for NLP systems to interpret correctly. Additionally, languages have unique grammatical structures and vocabulary that require specialized handling.<br/>
▪ `Handling Massive Datasets:`<br/>
Modern NLP applications often need to process and analyze huge volumes of text data, such as social media posts, news articles, or customer reviews. Efficient algorithms and scalable infrastructure are necessary to manage, store, and analyze this data in a reasonable amount of time.<br/>
▪ `Continuous Innovation for Greater Accuracy & Relevance:`<br/>
The field of NLP is rapidly evolving, with new models and techniques being developed to improve accuracy and relevance. Staying up-to-date with the latest research, adapting to new data sources, and refining models are ongoing challenges to ensure NLP systems remain effective and reliable.<br/>

<span style="font-size:16px; font-weight:bold;">NLTK and spaCy libraries</span><br/>
▪ [NLTK](https://www.nltk.org/) is a comprehensive open-source Python library designed for *educational* and *research* purposes in natural language processing. It provides robust tools for tasks like tokenization, stemming, lemmatization, parsing, and more, backed by extensive corpora and lexical resources.<br/> 
▪ [spaCy](https://spacy.io/) is an *industrial-strength* NLP library in Python tailored for real-world applications, emphasizing speed and accuracy in tasks such as tokenization, named entity recognition, and dependency parsing. Its modern API and efficient design make it ideal for processing large-scale text data.<br/> 

• `Classification:` Assigning predefined categories or labels to text, such as spam detection, sentiment analysis, or topic categorization.<br/>
• `Tokenization:` Breaking text into smaller units (words, sentences, or phrases)<br/>
• `Stemming:` Reducing words to their root form by removing suffixes/prefixes<br/>
• `Tagging:` Assigning grammatical categories (POS tags) to words<br/>
• `Parsing:` Analyzing sentence structure and grammatical relationships<br/>
• `Semantic Reasoning:` Understanding meaning and relationships between words/concepts<br/>
• `Wrappers:` Interface layers that connect to powerful NLP libraries like spaCy or Stanford NLP<br/>

<span style="font-size:16px; font-weight:bold">NLTK & spaCy Setup:</span><br/>
• `NLTK's Punkt` is an unsupervised sentence tokenizer that segments text into sentences by learning punctuation and abbreviation patterns from data. In contrast, spaCy provides its own statistical sentence segmentation, which is fast and effective for a variety of text types.<br/>
• The `en_core_web_sm model` in spaCy is a small, efficient English model supporting tokenization, part-of-speech tagging, dependency parsing, and named
entity recognition. For greater accuracy and more features, you can use larger models like en_core_web_md, en_core_web_lg, or the transformer-based 
en_core_web_trf.<br/>
• `WordNet` is a comprehensive lexical database within NLTK that organizes English words into synonym sets (synsets) to facilitate semantic analysis; alternatives like BabelNet or ConceptNet extend these capabilities to multilingual and broader semantic relationships.<br/>
• The `Gutenberg` corpus in NLTK is a curated collection of classic literary texts sourced from Project Gutenberg, offering diverse works for exploring historical language patterns and stylistic nuances; alternatives like the Brown Corpus or Reuters Corpus provide additional perspectives for varied text analysis.<br/>

In [None]:
! pip install spacy
! pip install nltk
! pip install regex==2023.10.3
! pip install spacy-wordnet
! python -m spacy download en_core_web_sm


<span style="font-size: 16px; color: rgb(11, 7, 241); font-weight: bold">2️⃣ Understanding Text Dataset Structure in NLP</span><br/>
▪ `Corpora:` A *corpus* (plural: *corpora*) is a large and structured set of texts, often used for linguistic analysis or to train NLP models. A corpus can contain thousands or millions of documents (multiple datasets).<br/>
▪ `Corpus:` Sometimes, the term "corpus" is used to refer to a single collection of documents, while "corpora" refers to multiple such collections (one dataset).<br/>
▪ `Document:` A document is an individual piece of text within a corpus, such as an article, a book, a tweet, or an email (one text).<br/>
▪ `Token:`A token is the smallest unit of text, typically a word, punctuation mark, or symbol, obtained after tokenization. Tokenization is the process of splitting text into these basic units (word, punctuation, etc.).<br/>

In [1]:
from nltk.corpus import webtext # Provides access to the Webtext corpus, useful for training and testing tokenizers

corpus = webtext # The entire webtext corpus
documents = corpus.fileids()   # List all documents (fileids) in the corpus
print("Documents in the webtext corpus:")
for doc in documents:
    print(f"• {doc}")

Documents in the webtext corpus:
• firefox.txt
• grail.txt
• overheard.txt
• pirates.txt
• singles.txt
• wine.txt


In [3]:
document_id = documents[0]  # Select a document (e.g., 'grail.txt')
print(f"Selected Document: {document_id}")

raw_text = corpus.raw(document_id)  # Get the raw text of the document
print(f"\nFirst 300 characters of the document:\n{raw_text[:300]}")

Selected Document: firefox.txt

First 300 characters of the document:
Cookie Manager: "Don't allow sites that set removed cookies to set future cookies" should stay checked
When in full screen mode
Pressing Ctrl-N should open a new browser when only download dialog is left open
add icons to context menu
So called "tab bar" should be made a proper toolbar or given 


In [10]:
from nltk.tokenize import sent_tokenize  # Function for splitting text into sentences using a pre-trained model

sentences = sent_tokenize(raw_text)      # Tokenize the document into sentences
print(f"Number of sentences in the document: {len(sentences)}")
print(f"\nFirst sentence:\n{sentences[0]}")

Number of sentences in the document: 1142

First sentence:
Cookie Manager: "Don't allow sites that set removed cookies to set future cookies" should stay checked
When in full screen mode
Pressing Ctrl-N should open a new browser when only download dialog is left open
add icons to context menu
So called "tab bar" should be made a proper toolbar or given the ability collapse / expand.


In [9]:
from nltk.tokenize import word_tokenize
  
tokens = word_tokenize(raw_text)    # Tokenize the document into words
print(f"Number of tokens in the document: {len(tokens)}")
print(f"First 10 tokens:\n{tokens[:10]}")

Number of tokens in the document: 96120
First 10 tokens:
['Cookie', 'Manager', ':', '``', 'Do', "n't", 'allow', 'sites', 'that', 'set']


<span style="font-size: 16px;  color:rgb(38, 255, 18); font-weight: bold">3️⃣ Semantic, Syntactic, and Sentiment Analysis in NLP</span><br/>
▪ `Semantic Analysis:` Understanding the meaning of words, phrases, and sentences in context. Semantic analysis helps computers grasp what a text is actually talking about, such as identifying relationships between entities or resolving ambiguity in meaning.<br/>
▪ `Syntactic Analysis:` Examining the grammatical structure of sentences. Syntactic analysis (or parsing) helps determine how words are related to each other in a sentence, such as identifying subjects, verbs, and objects, and ensuring the sentence follows the rules of grammar.<br/>
▪ `Sentiment Analysis:` Determining the emotional tone behind a body of text. Sentiment analysis is used to identify whether the sentiment expressed is positive, negative, or neutral, which is especially useful in applications like social media monitoring or customer feedback analysis.<br/>

In [8]:
from nltk.corpus import wordnet  # Provides access to the WordNet lexical database for lemmatization and semantic analysis

word = 'bank'  # Define the word to analyze semantically
synsets = wordnet.synsets(word)  # Retrieve all synsets (senses) for the word from WordNet

print(f"Semantic Analysis: Synsets for '{word}':") 
for syn in synsets:  # Iterate over each synset for the word
    print(f"• {syn.name()}: {syn.definition()}")  # Print the synset name and its definition

Semantic Analysis: Synsets for 'bank':
• bank.n.01: sloping land (especially the slope beside a body of water)
• depository_financial_institution.n.01: a financial institution that accepts deposits and channels the money into lending activities
• bank.n.03: a long ridge or pile
• bank.n.04: an arrangement of similar objects in a row or in tiers
• bank.n.05: a supply or stock held in reserve for future use (especially in emergencies)
• bank.n.06: the funds held by a gambling house or the dealer in some gambling games
• bank.n.07: a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
• savings_bank.n.02: a container (usually with a slot in the top) for keeping money at home
• bank.n.09: a building in which the business of banking transacted
• bank.n.10: a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
• bank.v.01: tip laterally
• bank.v.02: enclose with a bank
• bank.

In [21]:
import nltk
from nltk.tokenize import word_tokenize   # Function for splitting text into words (and punctuation)

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)  # Tokenize the sentence into words

pos_tags = nltk.pos_tag(tokens)  # Perform part-of-speech tagging on the tokens
print(f"Syntactic Analysis: POS Tags for the sentence:\n{pos_tags}")

Syntactic Analysis: POS Tags for the sentence:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


In [12]:
from nltk.corpus import gutenberg  # import the Gutenberg corpus from NLTK
from nltk.sentiment import SentimentIntensityAnalyzer  # import the SentimentIntensityAnalyzer for sentiment analysis

sia = SentimentIntensityAnalyzer()  # create an instance of SentimentIntensityAnalyzer
text = "I love natural language processing! It's amazing and fun."  # sample text for sentiment analysis
sentiment_scores = sia.polarity_scores(text)  # get sentiment polarity scores for the text
print(f"Sentiment Analysis: Sentiment scores for the text:\n{sentiment_scores}")  # print the sentiment scores


# Get synsets (sets of cognitive synonyms) for the word 'great' from WordNet
synsets = wordnet.synsets('great')  # retrieve synsets for the word 'great' from WordNet
print(f"\nWordNet Synsets for 'great':\n{synsets}")  # print the synsets for 'great'

# list available files in the Gutenberg corpus
print(f"\nGutenberg Files:\n{gutenberg.fileids()}")  # print the list of available files in the Gutenberg corpus

Sentiment Analysis: Sentiment scores for the text:
{'neg': 0.0, 'neu': 0.221, 'pos': 0.779, 'compound': 0.9336}

WordNet Synsets for 'great':
[Synset('great.n.01'), Synset('great.s.01'), Synset('great.s.02'), Synset('great.s.03'), Synset('bang-up.s.01'), Synset('capital.s.03'), Synset('big.s.13')]

Gutenberg Files:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


<span style="font-size: 16px; color: rgb(237, 4, 245); font-weight: bold">4️⃣ Regular Expressions in NLP</span><br/>
Regular expressions (regex) are powerful tools used to identify, extract, and manipulate specific patterns within text data.<br/>
▪ `Pattern Matching:` Find and match specific sequences of characters (e.g., email addresses, dates, phone numbers) in text.<br/>
▪ `Substring Extraction:` Pull out relevant parts of text that match a defined pattern.<br/>
▪ `Pattern Replacement/Removal:` Substitute or eliminate text patterns, such as removing unwanted symbols or correcting formats.<br/>
▪ `Noise Filtering:` Clean text by filtering out irrelevant or noisy data, making it more suitable for analysis.<br/>

<span style="font-size: 15.5px; ; font-weight: bold">Character Ranges and Quantifiers:</span><br/> 
▪ `[A-Za-z]` matches any uppercase or lowercase letter.<br/> 
▪ `{2}` means exactly 2 occurrences of the preceding pattern.<br/> 
▪ `\d{3}` matches exactly 3 digits (where `\d` is any digit from 0 to 9).<br/> 
▪  Square brackets `[]` define a set or range of characters to match.<br/> 

<span style="font-size: 15.5px; ; font-weight: bold">Example:</span><br/> 
^[\w\.-]+@([\w-]+\.)+[\w-]{2,4}

`^[\w\.-]+`: Matches the start of the string and then one or more word characters (`\w`), dots (`.`), or hyphens (`-`). This is the username part of the email.<br/>
`@` : Matches the literal '@' symbol.<br/> 
`([\w-]+\.)+` : Matches one or more groups of word characters or hyphens followed by a dot. This covers subdomains and the main domain.<br/> 
`[\w-]{2,4}$` : Matches the top-level domain (TLD) at the end, which must be 2 to 4 word characters or hyphens.<br/>

In [13]:
from nltk.tokenize import RegexpTokenizer  # Import RegexpTokenizer from NLTK

example_text = "The quick brown fox jumps over the lazy dog, 123-252 times!"  # Example text to tokenize
tokenizer = RegexpTokenizer(r'\w+')  # Create a tokenizer that matches words (alphanumeric sequences)
tokens = tokenizer.tokenize(example_text)  # Tokenize the example text using the regex tokenizer
print("NLTK Regex Tokens:", tokens)

NLTK Regex Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '123', '252', 'times']


In [14]:
import re  # Import the regular expressions module
text_emails = "Contact us at admin.support_34@example.com or sales-dep@company.org for inquiries."  # Example text containing email addresses

# Define a regular expression pattern to match email addresses:
# [a-zA-Z0-9._%+-]+   : Matches one or more allowed characters in the username part (letters, digits, dot, underscore, percent, plus, hyphen)
# @                   : Matches the '@' symbol
# [a-zA-Z0-9.-]+      : Matches one or more allowed characters in the domain name (letters, digits, dot, hyphen)
# \.                  : Matches a literal dot before the domain extension
# [a-zA-Z]{2,}        : Matches the domain extension (at least two letters)
email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

emails = re.findall(email_pattern, text_emails)  # Find all email addresses in the text using the regex pattern
print("Detected Emails:", emails)

Detected Emails: ['admin.support_34@example.com', 'sales-dep@company.org']


In [15]:
import spacy  # Import the spaCy library for advanced NLP tasks
from spacy.matcher import Matcher  # Import the Matcher class for pattern matching in spaCy

nlp = spacy.load("en_core_web_sm")  # Load the small English model
matcher = Matcher(nlp.vocab)        # Create a Matcher object

# Define a pattern to match words that start with a capital letter
pattern = [{"TEXT": {"REGEX": "^[A-Z][a-z]+"}}]
matcher.add("CAPITALIZED_WORD", [pattern])

text = "Alice and Bob went to New York City last Friday." # Example sentence
doc = nlp(text)  # Process the text
matches = matcher(doc)  # Apply the matcher to the doc

# Print the matched tokens
print("Capitalized words found in the sentence:")
for match_id, start, end in matches:
    span = doc[start:end]
    print("•", span.text)

Capitalized words found in the sentence:
• Alice
• Bob
• New
• York
• City
• Friday


<span style="font*size: 16px; color: rgb(6, 168, 243); font-weight: bold">5️⃣ Stopwords Removal</span><br/>
Stopwords are common words (such as "the", "is", "in", "and") that are often removed from text before processing, as they typically do not carry significant meaning. Filtering out stopwords helps reduce noise and improves the efficiency of text analysis.

In [16]:
from nltk.corpus import stopwords

text = "This is an example sentence showing off the stop words filtration."
words = nltk.word_tokenize(text)  # tokenize the text into words using nltk
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]  # filter out stopwords
print("Filtered Words:", filtered_words)

Filtered Words: ['example', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']


In [17]:
nlp = spacy.load("en_core_web_sm")  # Load the small English model
doc = nlp(text)  # Process the text with spaCy

# Remove stopwords
filtered_tokens = [token.text for token in doc if not token.is_stop]  # Create a list of tokens that are not stopwords

print("Original Text:")
print(text)
print("\nAfter Stopwords Removal:")  # Print the label for filtered text
print(" ".join(filtered_tokens))  # Print the filtered text without stopwords


Original Text:
This is an example sentence showing off the stop words filtration.

After Stopwords Removal:
example sentence showing stop words filtration .


<span style="font-size: 16px; color: rgb(12, 238, 125); font-weight: bold">Part-of-Speech (POS) Tagging in NLP</span><br/>
POS tagging is the process of labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. This helps computers understand the grammatical structure and meaning of text.<br/>

<span style="font-size: 15.5px; font-weight: bold">Common POS Tags and Examples:</span><br/>
▪ `Noun (N):` Names of people, places, things, or ideas.<br/>
▪ `Verb (V):` Words that express actions or states.<br/>
▪ `Adjective (ADJ):` Words that describe nouns.<br/>
▪ `Adverb (ADV):` Words that modify verbs, adjectives, or other adverbs.<br/>
▪ `Preposition (P):` Words that show relationships between nouns or pronouns and other words (at, on, in, from, with, near, between, about, under).<br/>
▪ `Conjunction (CON):` Words that connect clauses, sentences, or words (and, or, but, because, so, yet, unless, since, if).<br/>
▪ `Pronoun (PRO):` Words that replace nouns (you, I, we, they, he, she, it, me, us, them, him, her, this).<br/>
▪ `Interjection (INT):` Words or phrases that express emotion or exclamation (Ouch! Wow! Great! Help! Oh! Hey! Hi!).<br/>

In [22]:
sentence = "The quick brown fox jumps over the lazy dog."

tokens = nltk.word_tokenize(sentence) # Tokenize the sentence
pos_tags = nltk.pos_tag(tokens) # POS tagging using NLTK
print(f"POS Tags using NLTK:/n{pos_tags}")

doc = nlp(sentence) # POS tagging using spaCy
spacy_pos_tags = [(token.text, token.pos_) for token in doc]
print(f"\nPOS Tags using spaCy:{spacy_pos_tags}")

POS Tags using NLTK:/n[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]

POS Tags using spaCy:[('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', 'PUNCT')]


<span style="font-size: 16px; color: rgb(171, 12, 245); font-weight: bold">6️⃣ Chunking and Parsing in NLP</span><br/>
Chunking and parsing are essential techniques in NLP for understanding the structure and meaning of sentences.<br/>
▪ `Group words into meaningful chunks:` Chunking segments sentences into groups of words (such as noun or verb phrases) that function together as a unit.<br/>
▪ `Identify phrase boundaries:` These methods help determine where phrases begin and end, making it easier to extract relevant information.<br/>
▪ `Analyze grammatical structure:` Parsing examines the grammatical relationships between words and phrases, revealing how sentences are constructed.<br/>
▪ `Reveal sentence hierarchy:` Parsing uncovers the hierarchical structure of a sentence, showing how smaller chunks combine to form larger grammatical units.<br/>

In [23]:
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize and POS tag the sentence
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

# Define a chunk grammar for noun phrases (NP)
# The pattern "NP: {<DT>?<JJ>*<NN>}" can be broken down as follows:
# - NP: This is the label for the chunk (Noun Phrase).
# - { ... }: The curly braces enclose the pattern to match.
# - <DT>? : An optional determiner (e.g., "the", "a"). The question mark means it may appear 0 or 1 time.
# - <JJ>* : Zero or more adjectives (e.g., "quick", "brown"). The asterisk means any number of adjectives can appear.
# - <NN>  : A singular noun (e.g., "fox", "dog"). This is required at the end of the pattern.
# This grammar will match sequences like "the quick brown fox" or "lazy dog" as noun phrases.
grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)   # Create a RegexpParser object and parse the tagged tokens

# Parse the POS-tagged sentence to get chunks
tree = cp.parse(pos_tags)

# Print the chunked structure
print("Chunked Sentence Structure:")
print(tree)

tree.pretty_print()

Chunked Sentence Structure:
(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)
                                S                                          
     ___________________________|_______________________________            
    |        |     |            NP               NP             NP         
    |        |     |     _______|________        |       _______|______     
jumps/VBZ over/IN ./. The/DT quick/JJ brown/NN fox/NN the/DT lazy/JJ dog/NN



In [24]:
sen = nlp(sentence)  # Process the sentence using spaCy's NLP pipeline
spacy_noun_chunks = [chunk.text for chunk in sen.noun_chunks]  # Extract noun chunks from the processed sentence using spaCy
print(f"spaCy Noun Chunks List:\n{spacy_noun_chunks}")

# Create a list of noun chunks using NLTK (from the previous chunked tree)
nltk_noun_chunks = []  # Initialize an empty list to store NLTK noun chunks
for subtree in tree.subtrees():  # Iterate over all subtrees in the chunked tree
    if subtree.label() == 'NP':  # Check if the subtree is labeled as a noun phrase (NP)
        chunk = " ".join(word for word, pos in subtree.leaves())  # Join the words in the noun phrase chunk
        nltk_noun_chunks.append(chunk)  # Add the noun phrase chunk to the list
print(f"\nNLTK Noun Chunks List:\n{nltk_noun_chunks}")

spaCy Noun Chunks List:
['The quick brown fox', 'the lazy dog']

NLTK Noun Chunks List:
['The quick brown', 'fox', 'the lazy dog']


<span style="font-size: 16px; color: #eb5e28; font-weight: bold">Hypernyms and Hyponyms in NLP</span><br/>
▪ `Hypernyms` are words that denote a broad category or general class (e.g., "animal" is a hypernym of "dog").<br/>
▪ `Hyponyms` are words that represent a more specific instance within a category (e.g., "poodle" is a hyponym of "dog").<br/>
Recognizing hypernyms and hyponyms helps build a *semantic hierarchy*, allowing NLP systems to understand relationships between general and specific terms.<br/>
This enhances *lexical organization* and improves tasks like information retrieval, question answering, and semantic search by enabling systems to group, relate, and infer meaning from word relationships.<br/>

In [25]:
word = "dog"  # Define the word to look up in WordNet
synsets = wordnet.synsets(word, pos=wordnet.NOUN)  # Get all noun synsets for the word

if synsets:  # Check if any synsets were found
    syn = synsets[0]  # Take the first synset as an example
    print(f"Synset for '{word}': {syn.name()} - {syn.definition()}")  # Print the synset name and definition
    
    # Get hypernyms (more general terms)
    hypernyms = syn.hypernyms()  # Retrieve hypernyms for the synset
    print("\nHypernyms:")
    for h in hypernyms:  # Iterate over each hypernym
        print(f"• {h.name()} - {h.definition()}")  # Print the hypernym name and definition
    
    # Get hyponyms (more specific terms)
    hyponyms = syn.hyponyms()  # Retrieve hyponyms for the synset
    print("\nHyponyms:")
    for h in hyponyms[:5]:  # Show only first 5 hyponyms for brevity
        print(f"• {h.name()} - {h.definition()}")  # Print the hyponym name and definition
else:  # If no synsets were found
    print(f"No synsets found for '{word}'.")  # Print a message indicating no synsets found

Synset for 'dog': dog.n.01 - a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds

Hypernyms:
• canine.n.02 - any of various fissiped mammals with nonretractile claws and typically long muzzles
• domestic_animal.n.01 - any of various animals that have been tamed and made fit for a human environment

Hyponyms:
• basenji.n.01 - small smooth-haired breed of African origin having a tightly curled tail and the inability to bark
• corgi.n.01 - either of two Welsh breeds of long-bodied short-legged dogs with erect ears and a fox-like head
• cur.n.01 - an inferior dog or one of mixed breed
• dalmatian.n.02 - a large breed having a smooth white coat with black or brown spots; originated in Dalmatia
• great_pyrenees.n.01 - bred of large heavy-coated white dogs resembling the Newfoundland


<span style="font-size: 16px; color: #cbaf89; font-weight: bold">Named Entity Recognition (NER)</span><br/>
NER is a fundamental task in NLP that involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, and more.<br/>
NER helps extract structured information from unstructured text, enabling applications like information extraction, question answering, and knowledge graph construction.<br/>
Modern NLP libraries like spaCy and NLTK provide built-in tools for performing NER efficiently.<br/>

In [26]:
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."

nlp = spacy.load("en_core_web_sm")  # Load the small English model
doc = nlp(text)

print("spaCy NER Results:")
for ent in doc.ents:
    print(f"• {ent.text} ({ent.label_})")

spaCy NER Results:
• Barack Obama (PERSON)
• Hawaii (GPE)
• 44th (ORDINAL)
• the United States (GPE)


<span style="font-size: 16px;font-weight:bold"> Stemming & Lemmatization</span><br/>
Stemming and Lemmatization are two fundamental techniques in NLP used to reduce words to their root or base forms.<br/>

**Stemming:**<br/>
▪ Stemming is the process of removing suffixes (and sometimes prefixes) from words to obtain their stem or root form.<br/>
▪ The resulting stem may not always be a valid word in the language, but it helps group together words with similar meanings (e.g., "playing", "played", "plays" → "play").<br/>
▪ Stemming algorithms are typically rule-based and fast, but can be less accurate.<br/>

**Lemmatization:**<br/>
▪ Lemmatization reduces words to their base or dictionary form, known as the lemma.<br/>
▪ Unlike stemming, lemmatization considers the context and part of speech of a word, ensuring that the root form is a valid word (e.g., "better" → "good", "running" → "run").<br/>
▪ Lemmatization is generally more accurate but may require more computational resources and linguistic knowledge.<br/>

**Workflow for Stemming and Lemmatization:**<br/>
▪ `Lowercasing:` Convert all text to lowercase for consistency.<br/>
▪ `Tokenization:` Split text into individual words (tokens).<br/>
▪ `Stemming/Lemmatization:` Apply stemming or lemmatization to each token to obtain root forms.<br/>
▪ `Reconstruction (Optional):` Reconstruct the processed tokens back into text for further analysis.<br/>

These techniques are commonly used in text preprocessing to normalize words, improve search results, and enhance the performance of NLP models.<br/>

**Difference between Stemming and Lemmatization (with Lancaster Stemmer):**<br/>

The main difference between stemming and lemmatization is that stemming crudely removes word suffixes to arrive at a root form, which may not be a valid word, while lemmatization reduces words to their dictionary form (lemma), considering context and part of speech.<br/>

**Stemming (Lancaster):**<br/>
▪ The Lancaster stemmer is more aggressive than the Porter stemmer, often producing shorter stems.<br/>
▪ Example: "playing", "played", "plays" → "play" (Porter), but Lancaster may reduce further.<br/>

**Lemmatization:**<br/>
▪ Lemmatization always returns a valid word (lemma) and is context-aware.<br/>
▪ Example: "better" → "good" (with POS), "running" → "run".<br/>

**Comparison Table:**<br/>
| Word      | Lancaster Stem | Porter Stem | Lemma      |
|-----------|:--------------|:------------|:-----------|
| playing   | play           | play        | playing    |
| played    | play           | play        | played     |
| plays     | play           | play        | play       |
| better    | bet            | better      | better     |
| running   | run            | run         | running    |

The Lancaster stemmer can be too aggressive for some applications, while lemmatization is more accurate but slower.


In [27]:
import pandas as pd
from nltk.stem import PorterStemmer, WordNetLemmatizer

documents = [
    "Cats are running",
    "Dogs played outside",
]

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()  # Create a stemmer object using the Porter algorithm
lemmatizer = WordNetLemmatizer() # Create a lemmatizer object using WordNet

# Tokenize, Stem, and Lemmatize each document
results = []  # Initialize an empty list to store results for each document
for doc in documents:  # Iterate over each document in the documents list
    tokens = word_tokenize(doc.lower())  # Tokenize the document after converting it to lowercase
    stems = [stemmer.stem(token) for token in tokens]  # Apply stemming to each token
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]  # Apply lemmatization to each token
    results.append({  # Append a dictionary with original text, tokens, stems, and lemmas to the results list
        "original": doc,  # Store the original document text
        "tokens": tokens,  # Store the list of tokens
        "stems": stems,  # Store the list of stemmed tokens
        "lemmas": lemmas  # Store the list of lemmatized tokens
    })

# Display results in a DataFrame
df = pd.DataFrame(results)
df.head()

Unnamed: 0,original,tokens,stems,lemmas
0,Cats are running,"[cats, are, running]","[cat, are, run]","[cat, are, running]"
1,Dogs played outside,"[dogs, played, outside]","[dog, play, outsid]","[dog, played, outside]"


In [28]:
# Compare stemming and lemmatization for a few words
sample_words = ["playing", "played", "plays", "better", "running", "feet"]
comparison = []
for word in sample_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    comparison.append({"word": word, "stem": stem, "lemma": lemma})

print("\nStemming vs Lemmatization Comparison:")
df = pd.DataFrame(comparison)
df.head()


Stemming vs Lemmatization Comparison:


Unnamed: 0,word,stem,lemma
0,playing,play,playing
1,played,play,played
2,plays,play,play
3,better,better,better
4,running,run,running


In [29]:
# Example: Compare stemming and lemmatization for a few words
sample_words = ["playing", "played", "plays", "better", "running", "feet"]
comparison = []
for word in sample_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    comparison.append({"word": word, "stem": stem, "lemma": lemma})

print("\nStemming vs Lemmatization Comparison:")
df = pd.DataFrame(comparison)
df.head()


Stemming vs Lemmatization Comparison:


Unnamed: 0,word,stem,lemma
0,playing,play,playing
1,played,play,played
2,plays,play,play
3,better,better,better
4,running,run,running


In [30]:
# Load the small English model in spaCy
nlp = spacy.load("en_core_web_sm")

# Example sentences
spacy_docs = [
    "Cats are running",
    "Dogs played outside",
]

# Process each document with spaCy
spacy_results = []
for doc in spacy_docs:
    spacy_doc = nlp(doc)
    tokens = [token.text for token in spacy_doc]
    lemmas = [token.lemma_ for token in spacy_doc]
    pos = [token.pos_ for token in spacy_doc]
    spacy_results.append({
        "original": doc,
        "tokens": tokens,
        "lemmas": lemmas,
        "pos": pos
    })

# Display results in a DataFrame
spacy_df = pd.DataFrame(spacy_results)
print("\nspaCy Lemmatization and POS Tagging:")
spacy_df


spaCy Lemmatization and POS Tagging:


Unnamed: 0,original,tokens,lemmas,pos
0,Cats are running,"[Cats, are, running]","[cat, be, run]","[NOUN, AUX, VERB]"
1,Dogs played outside,"[Dogs, played, outside]","[dog, play, outside]","[NOUN, VERB, ADV]"


<span style="color:rgb(255, 0, 157); font-size: 16.5px; font-weight: bold">Tokenization Concepts</span><br/>
Tokenization is the process of breaking down text into smaller units called tokens. It's a fundamental step in NLP that helps computers understand and process human language by converting text into a format they can work with.<br/>

<span style="font-size: 16.5px; font-weight: bold">Types of tokenization:</span><br/>
▪ `Sentence Tokenization:` Splits text into individual sentences, useful for document-level analysis<br/>
▪ `Word Tokenization:` Breaks text into individual words, essential for word-level processing<br/>
▪ `Regex Tokenization:` Uses regular expressions to extract specific patterns from text<br/>
▪ `Treebank Tokenization:` Follows Penn Treebank conventions for standardized word tokenization<br/>
▪ `WordPunct Tokenization:` Separates words and punctuation into distinct tokens<br/>
▪ `Whitespace Tokenization:` Splits text based on spaces, the simplest form of tokenization<br/>
▪ `Character Tokenization:` Breaks text into individual characters, useful for character-level analysis<br/>

The [Webtext Corpus](https://paperswithcode.com/dataset/webtext) is a high-quality dataset that can be used to train custom tokenizers. It contains diverse text samples that help create robust tokenization models capable of handling various text patterns and formats.

In [31]:
from nltk.tokenize import WordPunctTokenizer     # Class that splits text into word and punctuation tokens separately

txt = "I am learning Natural Language Processing. I'm learning Python programming. It is very user friendly. I'm ready to start coding."

# Using sent_tokenize to split text into sentences
# This is useful when you need to process text at the sentence level
sent_tok = sent_tokenize(txt)
print(f"Sentence tokenization:\n{sent_tok}")

# Using word_tokenize to split text into individual words
# This is useful for word-level analysis and processing
word_tok = word_tokenize(txt)
print(f"\nWord tokenization:\n{word_tok}")

# Using RegexpTokenizer to extract only word characters
# This is useful when you want to remove punctuation and keep only alphanumeric characters
# The pattern r"\w+" matches sequences of alphanumeric characters (letters, digits, and underscores).
# It is used here to extract only "word" tokens, ignoring punctuation and spaces.
tok = RegexpTokenizer(r"\w+")
print(f"\nRegex tokenization (words only):\n{tok.tokenize(txt)}")

# Using TreebankWordTokenizer for standard word tokenization
# This follows the Penn Treebank tokenization conventions
tree_tok = nltk.TreebankWordTokenizer()
print(f"\nTreebankWordTokenizer:\n{tree_tok.tokenize(txt)}")

# Using WordPunctTokenizer to split text into words and punctuation
# This is useful when you need to preserve punctuation as separate tokens
punkt_tok = WordPunctTokenizer()
print(f"\nWordPunctTokenizer:\n{punkt_tok.tokenize(txt)}")

# Using simple whitespace tokenization
# This is the most basic form of tokenization, splitting on spaces
print(f"\nWhitespace tokenization:\n{txt.split()}")

# Using character-level tokenization
# This is useful for character-level analysis or when working with non-standard text
print(f"\nCharacter tokenization:\n{list(txt)}")


Sentence tokenization:
['I am learning Natural Language Processing.', "I'm learning Python programming.", 'It is very user friendly.', "I'm ready to start coding."]

Word tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.', 'I', "'m", 'learning', 'Python', 'programming', '.', 'It', 'is', 'very', 'user', 'friendly', '.', 'I', "'m", 'ready', 'to', 'start', 'coding', '.']

Regex tokenization (words only):
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'I', 'm', 'learning', 'Python', 'programming', 'It', 'is', 'very', 'user', 'friendly', 'I', 'm', 'ready', 'to', 'start', 'coding']

TreebankWordTokenizer:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing.', 'I', "'m", 'learning', 'Python', 'programming.', 'It', 'is', 'very', 'user', 'friendly.', 'I', "'m", 'ready', 'to', 'start', 'coding', '.']

WordPunctTokenizer:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.', 'I', "'", 'm', 'learning', 'Python', 'programming', '.', 'It',

<span style="font-size: 16.5px; font-weight: bold; color:rgb(7, 213, 240)">Custom Tokenizer Training</span><br>

In [32]:
import nltk.data                 # Used for loading NLTK resources and models

# Load the pre-trained English Punkt tokenizer model
punkt_tok = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

# Open a text file using the correct relative path (adjusted for your project structure)
txt_file = open("D:/Natural-language-processing/Data/sample_text.txt", mode='r', encoding='utf-8')

txt_read = txt_file.read()
print(txt_read)

# Tokenize the text using the loaded Punkt tokenizer
tok = punkt_tok.tokenize(txt_read)
tok

Hello! Mr reza. How are you today? I can't stand this weather.
The sun is too bright and the temperature is unbearable.
I don't know how people can work in these conditions.
Maybe we should move to a cooler place.
What do you think about that?


['Hello!',
 'Mr reza.',
 'How are you today?',
 "I can't stand this weather.",
 'The sun is too bright and the temperature is unbearable.',
 "I don't know how people can work in these conditions.",
 'Maybe we should move to a cooler place.',
 'What do you think about that?']

In [34]:
# Load raw text data from the 'overheard.txt' file in the webtext corpus
text_parameter = webtext.raw('overheard.txt')
# print(text_parameter)

In [35]:
from nltk.tokenize import PunktSentenceTokenizer # Class for sentence tokenization, can be trained on custom data for better accuracy

# The PunktSentenceTokenizer is an unsupervised machine learning sentence boundary detection algorithm.
# By creating a new instance and passing our own text (text_parameter) to it, we are training the tokenizer
# on the specific writing style, abbreviations, and sentence boundaries present in the 'overheard.txt' file.
# This allows the tokenizer to better adapt to the nuances of our dataset, potentially improving sentence splitting accuracy
# compared to the default pre-trained model.
my_tok = PunktSentenceTokenizer(text_parameter)
type(my_tok)

nltk.tokenize.punkt.PunktSentenceTokenizer

In [36]:
# Tokenize the text using the pre-trained sent_tokenize function
pre_token = sent_tokenize(text_parameter)

# Tokenize the text using our custom trained tokenizer
our_token = my_tok.tokenize(text_parameter)

print(f"pre_token[0]: {pre_token[0]}")

print(f"our_token[0]: {our_token[0]}")

pre_token[0]: White guy: So, do you have any plans for this evening?
our_token[0]: White guy: So, do you have any plans for this evening?


In [37]:
text = "Apple is looking at buying U.K. startup for $1 billion."

# Process the text with spaCy
proc = spacy.load("en_core_web_sm")
doc = proc(text)

token_info = []
for token in doc:
    info = {
        "Token": token.text,
        "Lemma": token.lemma_,
        "Sentence": token.sent.text,
        "POS": token.pos_,
        "Tag": token.tag_,
        "Dep": token.dep_,
        "Shape": token.shape_,
        "Is alpha": token.is_alpha,
        "Is stop": token.is_stop,
        "Is punctuation": token.is_punct,
        "Head": token.head.text,
        "Children": [child.text for child in token.children]
    }
    token_info.append(info)
df = pd.DataFrame(token_info)
df

Unnamed: 0,Token,Lemma,Sentence,POS,Tag,Dep,Shape,Is alpha,Is stop,Is punctuation,Head,Children
0,Apple,Apple,Apple is looking at buying U.K. startup for $1...,PROPN,NNP,nsubj,Xxxxx,True,False,False,looking,[]
1,is,be,Apple is looking at buying U.K. startup for $1...,AUX,VBZ,aux,xx,True,True,False,looking,[]
2,looking,look,Apple is looking at buying U.K. startup for $1...,VERB,VBG,ROOT,xxxx,True,False,False,looking,"[Apple, is, at, .]"
3,at,at,Apple is looking at buying U.K. startup for $1...,ADP,IN,prep,xx,True,True,False,looking,[buying]
4,buying,buy,Apple is looking at buying U.K. startup for $1...,VERB,VBG,pcomp,xxxx,True,False,False,at,[startup]
5,U.K.,U.K.,Apple is looking at buying U.K. startup for $1...,PROPN,NNP,nsubj,X.X.,False,False,False,startup,[]
6,startup,startup,Apple is looking at buying U.K. startup for $1...,VERB,VBD,ccomp,xxxx,True,False,False,buying,"[U.K., for]"
7,for,for,Apple is looking at buying U.K. startup for $1...,ADP,IN,prep,xxx,True,True,False,startup,[billion]
8,$,$,Apple is looking at buying U.K. startup for $1...,SYM,$,quantmod,$,False,False,False,billion,[]
9,1,1,Apple is looking at buying U.K. startup for $1...,NUM,CD,compound,d,False,False,False,billion,[]


**Preprocessing:**<br/>
Preprocessing in NLP typically involves several key steps to clean and standardize text data before analysis or modeling. The main preprocessing steps are:<br/>
▪ `Lowercasing:` Convert all text to lowercase to ensure uniformity (e.g., "Hello" and "hello").<br/>
▪ `Punctuation & Special Character Removal:` Remove punctuation marks, non-alphanumeric symbols, and special characters to focus on meaningful words.<br/>
▪ `Tokenization:` Divide text into smaller units—such as sentences or words—to simplify processing and analysis. This step impacts model size, training efficiency, and how well models interpret language.<br/>
▪ `Stop-Word Removal:` Remove common words (like "the", "is", "and") that do not add significant meaning.<br/>
▪ `Stemming and Lemmatization:` Reduce words to their root or base form (e.g., "running" → "run").<br/> 

In [39]:
# Select a text file from Gutenberg (e.g., 'shakespeare-hamlet.txt')
file_id = "shakespeare-hamlet.txt"
raw_text = gutenberg.raw(file_id)

# Step 1: Text Cleaning (Removing Gutenberg Header/Footer)
def clean_text(text):
    lines = text.split("\n") # break (enter - new line)
    start_idx, end_idx = 0, len(lines)

    # Removing Gutenberg boilerplate (First few and last few lines)
    for i, line in enumerate(lines):
        if "START OF THIS PROJECT GUTENBERG" in line:
            start_idx = i + 1
        if "END OF THIS PROJECT GUTENBERG" in line:
            end_idx = i
            break

    cleaned_lines = lines[start_idx:end_idx]
    cleaned_text = " ".join(cleaned_lines)
    return cleaned_text

text = clean_text(raw_text)

# Step 2: Lowercase
text = text.lower()

# Step 3: Tokenization
tokens = word_tokenize(text)

# Step 4: Remove Punctuation & Stopwords
stop_words = set(stopwords.words("english"))
tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Step 5: Stemming & Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_tokens = [stemmer.stem(word) for word in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

# Step 6: Convert back to text
stemmed_text = " ".join(stemmed_tokens)
lemmatized_text = " ".join(lemmatized_tokens)

# Output Results
print("Original Text (First 500 characters):\n", text[:500])
print("\nStemmed Text (First 500 characters):\n", stemmed_text[:500])
print("\nLemmatized Text (First 500 characters):\n", lemmatized_text[:500])


Original Text (First 500 characters):
 [the tragedie of hamlet by william shakespeare 1599]   actus primus. scoena prima.  enter barnardo and francisco two centinels.    barnardo. who's there?   fran. nay answer me: stand & vnfold your selfe     bar. long liue the king     fran. barnardo?   bar. he     fran. you come most carefully vpon your houre     bar. 'tis now strook twelue, get thee to bed francisco     fran. for this releefe much thankes: 'tis bitter cold, and i am sicke at heart     barn. haue you had quiet guard?   fran. not

Stemmed Text (First 500 characters):
 tragedi hamlet william shakespear 1599 actu primu scoena prima enter barnardo francisco two centinel barnardo fran nay answer stand vnfold self bar long liue king fran barnardo bar fran come care vpon hour bar strook twelu get thee bed francisco fran releef much thank bitter cold sick heart barn haue quiet guard fran mous stir barn well goodnight meet horatio marcellu riual watch bid make hast enter horatio marcellu f