**Welcome to Natural language processing (NLP) in Python**<br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

**Outline:**<br/>
1️⃣ <span style="font-size:15px; font-weight:bold">Introduction to NLP</span><br/>
▪ Overview of Natural Language Processing<br/>
▪ Key areas: Text Analysis, Language Generation, Speech Processing, Semantic Understanding<br/>
▪ Introduction to NLTK library and its capabilities<br/>

2️⃣ Tokenization Concepts<br/>
▪ Definition and importance of tokenization<br/>
▪ Types of tokenization: Sentence, Word, Regex, Treebank, WordPunct, Whitespace, Character<br/>
▪ Introduction to Webtext Corpus for training tokenizers<br/>

3️⃣ Tokenization Examples<br/>
▪ Demonstrating different tokenization techniques on sample text<br/>
▪ Comparing outputs of various tokenizers (sent_tokenize, word_tokenize, RegexpTokenizer, TreebankWordTokenizer, WordPunctTokenizer, whitespace, character)<br/>

4️⃣ Working with External Text Files<br/>
▪ Reading and processing text from a file<br/>
▪ Analyzing text length and basic properties<br/>

5️⃣ Custom Tokenizer Training<br/>
▪ Loading and using pre-trained Punkt tokenizer<br/>
▪ Training a custom PunktSentenceTokenizer with Webtext corpus data<br/>
▪ Comparing pre-trained and custom tokenizer outputs<br/>

6️⃣ Practical Applications<br/>
▪ Applying tokenization to real-world text data<br/>
▪ Understanding the role of tokenization in NLP pipelines<br/>

<span style="color:#ee0b0b; font-size:20px; font-weight:bold;">1️⃣ Introduction to NLP</span><br/>
NLP is a branch of artificial intelligence that allows computers to understand, interpret, and generate human language by combining linguistics, computer science, and machine learning.<br/>

**Key Areas of NLP:**<br/>
`Text Analysis:`<br/>
▪ Tokenization: Breaking text into words or sentences.<br/>
▪ Part-of-Speech (POS) Tagging: Identifying grammatical components (e.g., nouns, verbs).<br/>
▪ Named Entity Recognition (NER): Extracting entities like names, dates, or organizations.<br/>
▪ Sentiment Analysis: Determining the emotional tone (positive, negative, neutral).<br/>

`Language Generation:`<br/>
▪ Text Summarization: Condensing long texts into shorter summaries.<br/>
▪ Machine Translation: Converting text between languages (e.g., Google Translate).<br/>
▪ Text Generation: Creating human-like text (e.g., chatbots, story generators).<br/>

`Speech Processing:`<br/>
▪ Speech Recognition: Converting spoken words to text (e.g., Siri, Alexa).<br/>
▪ Text-to-Speech (TTS): Generating spoken language from text.<br/>
▪ Voice Assistants: Combining speech recognition and NLP for interactive systems.<br/>

`Semantic Understanding:`<br/>
▪ Word Embeddings: Representing words as vectors (e.g., Word2Vec, BERT).<br/>
▪ Question Answering: Providing precise answers to user queries.<br/>
▪ Dialogue Systems: Enabling conversational agents to maintain context.<br/>

**[NLTK](https://www.nltk.org/)**<br/>
▪ `Tokenization:` Breaking text into smaller units (words, sentences, or phrases)<br/>
▪ `Stemming:` Reducing words to their root form by removing suffixes/prefixes<br/>
▪ `Tagging:` Assigning grammatical categories (POS tags) to words<br/>
▪ `Parsing:` Analyzing sentence structure and grammatical relationships<br/>
▪ `Semantic` Reasoning: Understanding meaning and relationships between words/concepts<br/>
▪ `Wrappers:` Interface layers that connect to powerful NLP libraries like spaCy or Stanford NLP<br/>

**Custom Tokenizer Training**<br/>
We can train our own sentence tokenizer using custom text data. This allows the tokenizer to learn patterns specific to our domain.<br/>

[Webtext Corpus](https://paperswithcode.com/dataset/webtext)<br/>
The Webtext corpus is a high-quality dataset created by OpenAI through web scraping. It contains diverse text samples that can be used to train robust tokenizers. The corpus emphasizes document quality and natural language patterns.<br/>

<span style="color:rgb(15, 12, 226); font-size:20px; font-weight:bold;">Importing libraries</span>

In [1]:
# !pip install nltk
import nltk
import nltk.data
from nltk.corpus import webtext
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer, PunktSentenceTokenizer, WordPunctTokenizer

In [None]:
# Download the Punkt tokenizer models:
# - 'punkt': Pre-trained sentence tokenizer model
# - 'punkt_tab': Additional tokenizer resources for handling special cases
# - 'webtext': Dataset containing diverse text samples for training custom tokenizers

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('webtext')

In [2]:
txt = "I am learning Natural Language Processing. I'm learning Python programming. It is very user friendly. I'm ready to start coding."

# Using sent_tokenize to split text into sentences
# This is useful when you need to process text at the sentence level
print(f"Sentence tokenization:\n{sent_tokenize(txt)}")

# Using word_tokenize to split text into individual words
# This is useful for word-level analysis and processing
print(f"\nWord tokenization:\n{word_tokenize(txt)}")

# Using RegexpTokenizer to extract only word characters
# This is useful when you want to remove punctuation and keep only alphanumeric characters
tokenizer = RegexpTokenizer(r'\w+')
print(f"\nRegex tokenization (words only):\n{tokenizer.tokenize(txt)}")

# Using TreebankWordTokenizer for standard word tokenization
# This follows the Penn Treebank tokenization conventions
tree_Toknizer = nltk.TreebankWordTokenizer()
print(f"\nTreebankWordTokenizer:\n{tree_Toknizer.tokenize(txt)}")

# Using WordPunctTokenizer to split text into words and punctuation
# This is useful when you need to preserve punctuation as separate tokens
punkt_token = WordPunctTokenizer()
print(f"\nWordPunctTokenizer:\n{punkt_token.tokenize(txt)}")

# Using simple whitespace tokenization
# This is the most basic form of tokenization, splitting on spaces
print(f"\nWhitespace tokenization:\n{txt.split()}")

# Using character-level tokenization
# This is useful for character-level analysis or when working with non-standard text
print(f"\nCharacter tokenization:\n{list(txt)}")


Sentence tokenization:
['I am learning Natural Language Processing.', "I'm learning Python programming.", 'It is very user friendly.', "I'm ready to start coding."]

Word tokenization:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.', 'I', "'m", 'learning', 'Python', 'programming', '.', 'It', 'is', 'very', 'user', 'friendly', '.', 'I', "'m", 'ready', 'to', 'start', 'coding', '.']

Regex tokenization (words only):
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', 'I', 'm', 'learning', 'Python', 'programming', 'It', 'is', 'very', 'user', 'friendly', 'I', 'm', 'ready', 'to', 'start', 'coding']

TreebankWordTokenizer:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing.', 'I', "'m", 'learning', 'Python', 'programming.', 'It', 'is', 'very', 'user', 'friendly.', 'I', "'m", 'ready', 'to', 'start', 'coding', '.']

WordPunctTokenizer:
['I', 'am', 'learning', 'Natural', 'Language', 'Processing', '.', 'I', "'", 'm', 'learning', 'Python', 'programming', '.', 'It',

In [68]:
# Open a text file using the correct relative path (adjusted for your project structure)
txt_file = open("D:/Natural-language-processing/Data/sample_text.txt", mode='r', encoding='utf-8')

txt_read = txt_file.read()
print(txt_read)

Hello! Mr reza. How are you today? I can't stand this weather.
The sun is too bright and the temperature is unbearable.
I don't know how people can work in these conditions.
Maybe we should move to a cooler place.
What do you think about that?


In [33]:
len(txt_read) #the number of charachters

243

In [69]:
# Load the pre-trained English Punkt tokenizer model
punkt_tok = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

# Tokenize the text using the loaded Punkt tokenizer
punkt_tok.tokenize(txt_read)

['Hello!',
 'Mr reza.',
 'How are you today?',
 "I can't stand this weather.",
 'The sun is too bright and the temperature is unbearable.',
 "I don't know how people can work in these conditions.",
 'Maybe we should move to a cooler place.',
 'What do you think about that?']

In [70]:
# Get the length of the tokenized text using punkt tokenizer
len(punkt_tok.tokenize(txt_read))

8

In [53]:
# Load raw text data from the 'overheard.txt' file in the webtext corpus
text_parameter = webtext.raw('overheard.txt')
# print(text_parameter)

In [73]:
# Create a new instance of PunktSentenceTokenizer and train it with our text data
my_tokenizer = PunktSentenceTokenizer(text_parameter)

In [74]:
type(my_tokenizer)

nltk.tokenize.punkt.PunktSentenceTokenizer

In [78]:
# Tokenize the text using the pre-trained sent_tokenize function
pre_token = sent_tokenize(text_parameter)

# Tokenize the text using our custom trained tokenizer
our_token = my_tokenizer.tokenize(text_parameter)

print(f"pre_token[0]: {pre_token[0]}")

print(f"our_token[0]: {our_token[0]}")

pre_token[0]: White guy: So, do you have any plans for this evening?
our_token[0]: White guy: So, do you have any plans for this evening?
