<span style="font-size:16px; font-weight:bold">Welcome to Natural Language Processing (NLP) in Python</span><br/>
Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

<span style="font-size:16px; font-weight:bold">Outline</span><br/>
▪ Introduction to NLP<br/>
▪ Text Dataset Structure<br/>
▪ Text Preprocessing<br/>
▪ Tokenization Concepts<br/>
▪ Stopwords Removal<br/>
▪ Regular Expressions in NLP<br/>
▪ Part-of-Speech (POS) Tagging<br/>
▪ Semantic, Syntactic, and Sentiment Analysis<br/>
▪ Chunking and Parsing<br/>
▪ Named Entity Recognition (NER)<br/>
▪ Hypernyms and Hyponyms<br/>
▪ Stemming and Lemmatization<br/>
▪ Custom Tokenizer Training<br/>

<span style="font-size:16px; font-weight:bold; color:rgb(255, 251, 18)">Introduction to NLP</span><br/>
NLP is a branch of artificial intelligence that allows computers to understand, interpret, and generate human language by combining linguistics, computer science, and machine learning.<br/>

<span style="font-size:16px; font-weight:bold">Key Areas of NLP:</span><br/>
`Text Analysis:`<br/>
▪ Tokenization: Breaking text into words or sentences.<br/>
▪ Part-of-Speech (POS) Tagging: Identifying grammatical components (e.g., nouns, verbs).<br/>
▪ Named Entity Recognition (NER): Extracting entities like names, dates, or organizations.<br/>
▪ Sentiment Analysis: Determining the emotional tone (positive, negative, neutral).<br/>

`Language Generation:`<br/>
▪ Text Summarization: Condensing long texts into shorter summaries.<br/>
▪ Machine Translation: Converting text between languages (e.g., Google Translate).<br/>
▪ Text Generation: Creating human-like text (e.g., chatbots, story generators).<br/>

`Speech Processing:`<br/>
▪ Speech Recognition: Converting spoken words to text (e.g., Siri, Alexa).<br/>
▪ Text-to-Speech (TTS): Generating spoken language from text.<br/>
▪ Voice Assistants: Combining speech recognition and NLP for interactive systems.<br/>

`Semantic Understanding:`<br/>
▪ Word Embeddings: Representing words as vectors (e.g., Word2Vec, BERT).<br/>
▪ Question Answering: Providing precise answers to user queries.<br/>
▪ Dialogue Systems: Enabling conversational agents to maintain context.<br/>

<span style="font-size:16px; font-weight:bold">NLP Challenges:</span><br/>
▪ `Ambiguity & Context Sensitivity:` Words or sentences with multiple meanings.<br/>
▪ `Cultural & Linguistic Nuances:` Variations in idioms, slang, and grammar.<br/>
▪ `Handling Massive Datasets:` Processing large volumes of text efficiently.<br/>
▪ `Continuous Innovation:` Keeping up with evolving models and data sources.<br/>

<span style="font-size:16px; font-weight:bold">NLTK and spaCy Libraries</span><br/>
▪ [NLTK](https://www.nltk.org/): Open-source Python library for educational and research purposes, offering tools for tokenization, stemming, lemmatization, and more.<br/>
▪ [spaCy](https://spacy.io/): Industrial-strength NLP library for real-world applications, optimized for speed and accuracy.<br/>

<span style="font-size:16px; font-weight:bold">NLTK & spaCy Setup</span><br/>
▪ `NLTK's Punkt:` Unsupervised sentence tokenizer.<br/>
▪ `spaCy's en_core_web_sm:` Small English model for tokenization, POS tagging, and NER.<br/>
▪ `WordNet:` Lexical database for semantic analysis.<br/>
▪ `Gutenberg Corpus:` Collection of classic literary texts.<br/>

In [None]:
! pip install spacy
! pip install nltk
! pip install regex==2023.10.3
! pip install spacy-wordnet
! python -m spacy download en_core_web_sm

In [11]:
import nltk
import spacy
import re
import pandas as pd
from nltk.corpus import webtext # Provides access to the Webtext corpus, useful for training and testing tokenizers

from nltk.corpus import  gutenberg, wordnet, stopwords
from nltk.tokenize import sent_tokenize, word_tokenize, WordPunctTokenizer, TreebankWordTokenizer, RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
# from nltk.sentiment import SentimentIntensityAnalyzer
# from spacy.matcher import Matcher

# nltk.download('punkt')      # Download the pre-trained sentence tokenizer model ('punkt')
# nltk.download('punkt_tab')  # Download additional tokenizer resources for handling special cases ('punkt_tab')
# nltk.download('wordnet')    # Download the WordNet lexical database for lemmatization
# nltk.download('webtext')    # Download the Webtext corpus containing diverse text samples
# nltk.download('gutenberg')  # Download the Gutenberg corpus for text samples
# nltk.download('stopwords')  # Download the list of common stopwords
# nltk.download('vader_lexicon')  # Download the VADER sentiment analysis lexicon
# nltk.download('averaged_perceptron_tagger_eng')  # Download the averaged perceptron tagger for POS tagging
# nltk.download('averaged_perceptron_tagger', quiet=True)  # Download the averaged perceptron tagger for POS tagging
# !python -m spacy download en_core_web_sm  # Download the small English model for spaCy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

<span style="font-size:16px; font-weight:bold; color:rgb(11, 7, 241)">Text Dataset Structure</span><br/>
▪ `Corpora:` Large, structured sets of texts for linguistic analysis or model training.<br/>
▪ `Corpus:` A single collection of documents.<br/>
▪ `Document:` An individual piece of text (e.g., article, tweet).<br/>
▪ `Token:` Smallest unit of text (e.g., word, punctuation) after tokenization.<br/>

In [3]:
corpus = webtext # The entire webtext corpus
documents = corpus.fileids()   # List all documents (fileids) in the corpus
print("Documents in the webtext corpus:")
for doc in documents:
    print(f"• {doc}")

Documents in the webtext corpus:
• firefox.txt
• grail.txt
• overheard.txt
• pirates.txt
• singles.txt
• wine.txt


In [4]:
document_id = documents[0]  # Select a document (e.g., 'grail.txt')
print(f"Selected Document: {document_id}")

raw_text = corpus.raw(document_id)  # Get the raw text of the document
print(f"\nFirst 300 characters of the document:\n{raw_text[:300]}")

Selected Document: firefox.txt

First 300 characters of the document:
Cookie Manager: "Don't allow sites that set removed cookies to set future cookies" should stay checked
When in full screen mode
Pressing Ctrl-N should open a new browser when only download dialog is left open
add icons to context menu
So called "tab bar" should be made a proper toolbar or given 


<span style="font-size:16px; font-weight:bold; color:rgb(255, 165, 0)">Text Preprocessing</span><br/>
Preprocessing involves cleaning and standardizing text data for analysis or modeling.<br/>
▪ `Lowercasing:` Convert text to lowercase for uniformity.<br/>
▪ `Punctuation & Special Character Removal:` Remove non-alphanumeric symbols.<br/>
▪ `Tokenization:` Split text into sentences or words.<br/>
▪ `Stop-Word Removal:` Remove common, low-meaning words.<br/>
▪ `Stemming/Lemmatization:` Reduce words to their root or base form.<br/>

In [12]:
file_id = 'shakespeare-hamlet.txt'
raw_text = gutenberg.raw(file_id)

def clean_text(text):
    lines = text.split('\n')
    start_idx, end_idx = 0, len(lines)
    for i, line in enumerate(lines):
        if 'START OF THIS PROJECT GUTENBERG' in line:
            start_idx = i + 1
        if 'END OF THIS PROJECT GUTENBERG' in line:
            end_idx = i
            break
    cleaned_lines = lines[start_idx:end_idx]
    cleaned_text = ' '.join(cleaned_lines)
    return cleaned_text

text = clean_text(raw_text)
text = text.lower()
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
stemmed_text = ' '.join(stemmed_tokens)
lemmatized_text = ' '.join(lemmatized_tokens)

print('Original Text (First 500 characters):\n', text[:500])
print('\nStemmed Text (First 500 characters):\n', stemmed_text[:500])
print('\nLemmatized Text (First 500 characters):\n', lemmatized_text[:500])

Original Text (First 500 characters):
 [the tragedie of hamlet by william shakespeare 1599]   actus primus. scoena prima.  enter barnardo and francisco two centinels.    barnardo. who's there?   fran. nay answer me: stand & vnfold your selfe     bar. long liue the king     fran. barnardo?   bar. he     fran. you come most carefully vpon your houre     bar. 'tis now strook twelue, get thee to bed francisco     fran. for this releefe much thankes: 'tis bitter cold, and i am sicke at heart     barn. haue you had quiet guard?   fran. not

Stemmed Text (First 500 characters):
 tragedi hamlet william shakespear 1599 actu primu scoena prima enter barnardo francisco two centinel barnardo fran nay answer stand vnfold self bar long liue king fran barnardo bar fran come care vpon hour bar strook twelu get thee bed francisco fran releef much thank bitter cold sick heart barn haue quiet guard fran mous stir barn well goodnight meet horatio marcellu riual watch bid make hast enter horatio marcellu f

<span style="font-size:16px; font-weight:bold; color:rgb(255, 0, 157)">4️⃣ Tokenization Concepts</span><br/>
Tokenization breaks text into smaller units (tokens) for processing.<br/>
▪ Sentence Tokenization: Splits text into sentences.<br/>
▪ Word Tokenization: Breaks text into words.<br/>
▪ Regex Tokenization: Extracts patterns using regular expressions.<br/>
▪ Treebank Tokenization: Follows Penn Treebank conventions.<br/>
▪ WordPunct Tokenization: Separates words and punctuation.<br/>
▪ Whitespace Tokenization: Splits on spaces.<br/>
▪ Character Tokenization: Breaks into individual characters.<br/>

In [None]:
sentences = sent_tokenize(raw_text)
print(f'Number of sentences in the document: {len(sentences)}')
print(f'\nFirst sentence:\n{sentences[0]}')

In [None]:
tokens = word_tokenize(raw_text)
print(f'Number of tokens in the document: {len(tokens)}')
print(f'First 10 tokens:\n{tokens[:10]}')

In [None]:
txt = "I am learning Natural Language Processing. I'm learning Python programming. It is very user friendly. I'm ready to start coding."
print(f'Sentence tokenization:\n{sent_tokenize(txt)}')
print(f'\nWord tokenization:\n{word_tokenize(txt)}')
tok = RegexpTokenizer(r'\w+')
print(f'\nRegex tokenization (words only):\n{tok.tokenize(txt)}')
tree_tok = TreebankWordTokenizer()
print(f'\nTreebankWordTokenizer:\n{tree_tok.tokenize(txt)}')
punkt_tok = WordPunctTokenizer()
print(f'\nWordPunctTokenizer:\n{punkt_tok.tokenize(txt)}')
print(f'\nWhitespace tokenization:\n{txt.split()}')
print(f'\nCharacter tokenization:\n{list(txt)}')

<span style="font-size:16px; font-weight:bold; color:rgb(6, 168, 243)">5️⃣ Stopwords Removal</span><br/>
Stopwords are common words (e.g., 'the', 'is', 'and') removed to reduce noise and improve analysis efficiency.<br/>

In [None]:
text = 'This is an example sentence showing off the stop words filtration.'
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
print('Filtered Words:', filtered_words)

In [None]:
doc = nlp(text)
filtered_tokens = [token.text for token in doc if not token.is_stop]
print('Original Text:')
print(text)
print('\nAfter Stopwords Removal:')
print(' '.join(filtered_tokens))

<span style="font-size:16px; font-weight:bold; color:rgb(237, 4, 245)">6️⃣ Regular Expressions in NLP</span><br/>
Regular expressions (regex) identify, extract, and manipulate text patterns.<br/>
▪ Pattern Matching: Find sequences like emails or dates.<br/>
▪ Substring Extraction: Extract specific text parts.<br/>
▪ Pattern Replacement/Removal: Substitute or remove patterns.<br/>
▪ Noise Filtering: Clean irrelevant data.<br/>

<span style="font-size:15.5px; font-weight:bold">Character Ranges and Quantifiers:</span><br/>
▪ [A-Za-z]: Matches any letter.<br/>
▪ {2}: Exactly 2 occurrences.<br/>
▪ \d{3}: Exactly 3 digits.<br/>
▪ []: Define a character set/range.<br/>

<span style="font-size:15.5px; font-weight:bold">Example Email Regex:</span><br/>
^[\w\.-]+@([\w-]+\.)+[\w-]{2,4}$<br/>
▪ ^[\w\.-]+: Username (word characters, dots, hyphens).<br/>
▪ @: Literal '@' symbol.<br/>
▪ ([\w-]+\.)+: Domain/subdomains.<br/>
▪ [\w-]{2,4}$: Top-level domain (2-4 characters).<br/>

In [None]:
example_text = 'The quick brown fox jumps over the lazy dog, 123-252 times!'
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(example_text)
print('NLTK Regex Tokens:', tokens)

In [None]:
text_emails = 'Contact us at admin.support_34@example.com or sales-dep@company.org for inquiries.'
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text_emails)
print('Detected Emails:', emails)

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '^[A-Z][a-z]+'}}]
matcher.add('CAPITALIZED_WORD', [pattern])
text = 'Alice and Bob went to New York City last Friday.'
doc = nlp(text)
matches = matcher(doc)
print('Capitalized words found in the sentence:')
for match_id, start, end in matches:
    span = doc[start:end]
    print('•', span.text)

<span style="font-size:16px; font-weight:bold; color:rgb(12, 238, 125)">7️⃣ Part-of-Speech (POS) Tagging</span><br/>
POS tagging labels words with their grammatical roles (e.g., noun, verb).<br/>
▪ Noun (N): Names of people, places, things.<br/>
▪ Verb (V): Actions or states.<br/>
▪ Adjective (ADJ): Describes nouns.<br/>
▪ Adverb (ADV): Modifies verbs, adjectives, or adverbs.<br/>
▪ Preposition (P): Shows relationships (e.g., at, on, in).<br/>
▪ Conjunction (CON): Connects clauses/words (e.g., and, or).<br/>
▪ Pronoun (PRO): Replaces nouns (e.g., you, I).<br/>
▪ Interjection (INT): Expresses emotion (e.g., Wow!, Oh!).<br/>

In [None]:
sentence = 'The quick brown fox jumps over the lazy dog.'
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(f'POS Tags using NLTK:\n{pos_tags}')
doc = nlp(sentence)
spacy_pos_tags = [(token.text, token.pos_) for token in doc]
print(f'\nPOS Tags using spaCy:\n{spacy_pos_tags}')

<span style="font-size:16px; font-weight:bold; color:rgb(38, 255, 18)">8️⃣ Semantic, Syntactic, and Sentiment Analysis</span><br/>
▪ Semantic Analysis: Understands meaning in context.<br/>
▪ Syntactic Analysis: Examines grammatical structure.<br/>
▪ Sentiment Analysis: Determines emotional tone (positive, negative, neutral).<br/>

In [None]:
word = 'bank'
synsets = wordnet.synsets(word)
print(f'Semantic Analysis: Synsets for \'{word}\':')
for syn in synsets:
    print(f'• {syn.name()}: {syn.definition()}')

In [None]:
sentence = 'The quick brown fox jumps over the lazy dog.'
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(f'Syntactic Analysis: POS Tags for the sentence:\n{pos_tags}')

In [None]:
sia = SentimentIntensityAnalyzer()
text = 'I love natural language processing! It\'s amazing and fun.'
sentiment_scores = sia.polarity_scores(text)
print(f'Sentiment Analysis: Sentiment scores for the text:\n{sentiment_scores}')

<span style="font-size:16px; font-weight:bold; color:rgb(171, 12, 245)">9️⃣ Chunking and Parsing</span><br/>
▪ Chunking: Groups words into meaningful phrases (e.g., noun phrases).<br/>
▪ Parsing: Analyzes grammatical structure and sentence hierarchy.<br/>

In [None]:
sentence = 'The quick brown fox jumps over the lazy dog.'
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
grammar = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(grammar)
tree = cp.parse(pos_tags)
print('Chunked Sentence Structure:')
print(tree)
tree.pretty_print()

In [None]:
sen = nlp(sentence)
spacy_noun_chunks = [chunk.text for chunk in sen.noun_chunks]
print(f'spaCy Noun Chunks List:\n{spacy_noun_chunks}')
nltk_noun_chunks = []
for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        chunk = ' '.join(word for word, pos in subtree.leaves())
        nltk_noun_chunks.append(chunk)
print(f'\nNLTK Noun Chunks List:\n{nltk_noun_chunks}')

<span style="font-size:16px; font-weight:bold; color:rgb(171, 12, 245)">🔟 Named Entity Recognition (NER)</span><br/>
NER identifies and classifies entities (e.g., person, organization, location) in text.<br/>

In [None]:
text = 'Barack Obama was born in Hawaii and served as the 44th President of the United States.'
doc = nlp(text)
print('spaCy NER Results:')
for ent in doc.ents:
    print(f'• {ent.text} ({ent.label_})')

<span style="font-size:16px; font-weight:bold; color:#eb5e28">1️⃣1️⃣ Hypernyms and Hyponyms</span><br/>
▪ Hypernyms: General category terms (e.g., 'animal' for 'dog').<br/>
▪ Hyponyms: Specific instances (e.g., 'poodle' for 'dog').<br/>

In [None]:
word = 'dog'
synsets = wordnet.synsets(word, pos=wordnet.NOUN)
if synsets:
    syn = synsets[0]
    print(f'Synset for \'{word}\': {syn.name()} - {syn.definition()}')
    hypernyms = syn.hypernyms()
    print('\nHypernyms:')
    for h in hypernyms:
        print(f'• {h.name()} - {h.definition()}')
    hyponyms = syn.hyponyms()
    print('\nHyponyms:')
    for h in hyponyms[:5]:
        print(f'• {h.name()} - {h.definition()}')
else:
    print(f'No synsets found for \'{word}\'.')

<span style="font-size:16px; font-weight:bold; color:#cbaf89">1️⃣2️⃣ Stemming and Lemmatization</span><br/>
▪ Stemming: Removes suffixes to get root form (e.g., 'playing' → 'play').<br/>
▪ Lemmatization: Reduces to dictionary form, considering context (e.g., 'better' → 'good').<br/>

<span style="font-size:15.5px; font-weight:bold">Comparison Table:</span><br/>
| Word      | Lancaster Stem | Porter Stem | Lemma      |
|-----------|:--------------|:------------|:-----------|
| playing   | play           | play        | play       |
| played    | play           | play        | play       |
| plays     | play           | play        | play       |
| better    | bet            | better      | good       |
| running   | run            | run         | run        |


In [None]:
documents = ['Cats are running', 'Dogs played outside']
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
results = []
for doc in documents:
    tokens = word_tokenize(doc.lower())
    stems = [stemmer.stem(token) for token in tokens]
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    results.append({'original': doc, 'tokens': tokens, 'stems': stems, 'lemmas': lemmas})
df = pd.DataFrame(results)
df.head()

In [None]:
sample_words = ['playing', 'played', 'plays', 'better', 'running']
comparison = []
for word in sample_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    comparison.append({'word': word, 'stem': stem, 'lemma': lemma})
print('\nStemming vs Lemmatization Comparison:')
df = pd.DataFrame(comparison)
df.head()

In [None]:
spacy_docs = ['Cats are running', 'Dogs played outside']
spacy_results = []
for doc in spacy_docs:
    spacy_doc = nlp(doc)
    tokens = [token.text for token in spacy_doc]
    lemmas = [token.lemma_ for token in spacy_doc]
    pos = [token.pos_ for token in spacy_doc]
    spacy_results.append({'original': doc, 'tokens': tokens, 'lemmas': lemmas, 'pos': pos})
spacy_df = pd.DataFrame(spacy_results)
print('\nspaCy Lemmatization and POS Tagging:')
spacy_df

<span style="font-size:16px; font-weight:bold; color:rgb(7, 213, 240)">1️⃣3️⃣ Custom Tokenizer Training</span><br/>
Training custom tokenizers improves accuracy for specific datasets.<br/>
Note: The following cell is commented out as it requires a local file. Replace the file path with your own text file or use an available corpus.<br/>

In [None]:
# import nltk.data
# punkt_tok = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
# txt_file = open('sample_text.txt', mode='r', encoding='utf-8')
# txt_read = txt_file.read()
# print(txt_read)
# tok = punkt_tok.tokenize(txt_read)
# txt_file.close()
# tok

In [None]:
text_parameter = webtext.raw('overheard.txt')
from nltk.tokenize import PunktSentenceTokenizer
my_tok = PunktSentenceTokenizer(text_parameter)
pre_token = sent_tokenize(text_parameter)
our_token = my_tok.tokenize(text_parameter)
print(f'pre_token[0]: {pre_token[0]}')
print(f'our_token[0]: {our_token[0]}')

In [None]:
text = 'Apple is looking at buying U.K. startup for $1 billion.'
doc = nlp(text)
results = []
for token in doc:
    results.append({
        'Token': token.text,
        'Lemma': token.lemma_,
        'Sentence': str(token.sent),
        'POS': token.pos_,
        'Tag': token.tag_,
        'Dep': token.dep_,
        'Shape': token.shape_,
        'Is alpha': token.is_alpha,
        'Is stop': token.is_stop,
        'Is punctuation': token.is_punct,
        'Head': token.head.text,
        'Children': [child.text for child in token.children]
    })
df = pd.DataFrame(results)
df