**Welcome to Natural language processing (NLP) in Python**<br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

Outline:<br/>
1️⃣ <br/>
2️⃣ <br/>
3️⃣ <br/>
4️⃣ <br/>
5️⃣ <br/>
6️⃣ <br/>
7️⃣ <br/>

NLP is a branch of artificial intelligence that allows computers to understand, interpret, and generate human language by combining linguistics, computer science, and machine learning.<br/>

**Key Areas of NLP:**<br/>
`Text Analysis:`<br/>
▪ Tokenization: Breaking text into words or sentences.<br/>
▪ Part-of-Speech (POS) Tagging: Identifying grammatical components (e.g., nouns, verbs).<br/>
▪ Named Entity Recognition (NER): Extracting entities like names, dates, or organizations.<br/>
▪ Sentiment Analysis: Determining the emotional tone (positive, negative, neutral).<br/>

`Language Generation:`<br/>
▪ Text Summarization: Condensing long texts into shorter summaries.<br/>
▪ Machine Translation: Converting text between languages (e.g., Google Translate).<br/>
▪ Text Generation: Creating human-like text (e.g., chatbots, story generators).<br/>

`Speech Processing:`<br/>
▪ Speech Recognition: Converting spoken words to text (e.g., Siri, Alexa).<br/>
▪ Text-to-Speech (TTS): Generating spoken language from text.<br/>
▪ Voice Assistants: Combining speech recognition and NLP for interactive systems.<br/>

`Semantic Understanding:`<br/>
▪ Word Embeddings: Representing words as vectors (e.g., Word2Vec, BERT).<br/>
▪ Question Answering: Providing precise answers to user queries.<br/>
▪ Dialogue Systems: Enabling conversational agents to maintain context.<br/>

**Text Word-level Representation (Word Embedding)**<br/>
`One-hot Encoding:`<br/>
A one-hot encoding transforms categorical variables into binary vectors where each category is represented by a vector with a single '1' and all other positions as '0'. This creates a sparse representation where each category is equidistant from others in vector space.<br/>

`Bag-Of-Words (BoW):`<br/>
BoW represents text as an unordered collection of words, maintaining word frequency but ignoring grammar and word order. It creates a vocabulary from all unique words and represents each document as a vector of word counts, making it simple but losing contextual relationships.<br/>

`Word Embedding:`<br/>
Word embeddings map words to dense vectors in a continuous vector space, where semantic relationships are preserved. Words with similar meanings are positioned closer together, enabling the capture of semantic relationships and analogies (e.g., king - man + woman ≈ queen).<br/>

`F-IDF (Term Frequency-Inverse Document Frequency):`<br/>
TF-IDF measures word importance by considering both:<br/>
▪ Term Frequency (TF): How often a word appears in a document<br/>
▪ Inverse Document Frequency (IDF): How rare the word is across all documents<br/>
This helps identify words that are significant in specific documents but not common across the corpus<br/>

`Word2Vec:`<br/>
Word2Vec is a neural network-based approach that learns word embeddings by predicting surrounding words (CBOW) or predicting a word from its context (Skip-gram). It captures semantic relationships and can perform tasks like:<br/>
▪ Finding similar words<br/>
▪ Solving word analogies<br/>
▪ Identifying word relationships<br/>
▪ Generating word suggestions<br/>

**[NLTK](https://www.nltk.org/)**<br/>
▪ `Tokenization:` Breaking text into smaller units (words, sentences, or phrases)<br/>
▪ `Stemming:` Reducing words to their root form by removing suffixes/prefixes<br/>
▪ `Tagging:` Assigning grammatical categories (POS tags) to words<br/>
▪ `Parsing:` Analyzing sentence structure and grammatical relationships<br/>
▪ `Semantic` Reasoning: Understanding meaning and relationships between words/concepts<br/>
▪ `Wrappers:` Interface layers that connect to powerful NLP libraries like spaCy or Stanford NLP<br/>

<span style="color:#ee0b0b; font-size:20px; font-weight:bold;">Importing libraries</span>

In [51]:
# !pip install nltk
import nltk
import nltk.data
from nltk.tokenize import sent_tokenize
from nltk.corpus import webtext

In [None]:
# Download the Punkt tokenizer models:
# - 'punkt': Pre-trained sentence tokenizer model
# - 'punkt_tab': Additional tokenizer resources for handling special cases
# - 'webtext': Dataset containing diverse text samples for training custom tokenizers

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('webtext')

In [7]:
txt = 'I am learning Natural Language Processing. I am learning Python programming. It is very user friendly. I am ready to start coding.'
sent_tokenize(txt)

['I am learning Natural Language Processing.',
 'I am learning Python programming.',
 'It is very user friendly.',
 'I am ready to start coding.']

In [32]:
# Open a text file using the correct relative path (adjusted for your project structure)
txt_file = open("D:/Natural-language-processing/Data/sample_text.txt", mode='r', encoding='utf-8')

txt_read = txt_file.read()
print(txt_read)

Hello! Mr reza. How are you today? I can't stand this weather.
The sun is too bright and the temperature is unbearable.
I don't know how people can work in these conditions.
Maybe we should move to a cooler place.
What do you think about that?


In [33]:
len(txt_read) #the number of charachters

243

In [35]:
Punkt_tok = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
Punkt_tok.tokenize(txt_read)

['Hello!',
 'Mr reza.',
 'How are you today?',
 "I can't stand this weather.",
 'The sun is too bright and the temperature is unbearable.',
 "I don't know how people can work in these conditions.",
 'Maybe we should move to a cooler place.',
 'What do you think about that?']

In [36]:
len(Punkt_tok.tokenize(txt_read))

8

Custom Tokenizer Training<br/>
We can train our own sentence tokenizer using custom text data. This allows the tokenizer to learn patterns specific to our domain.<br/>

[Webtext Corpus](https://paperswithcode.com/dataset/webtext)<br/>
The Webtext corpus is a high-quality dataset created by OpenAI through web scraping. It contains diverse text samples that can be used to train robust tokenizers. The corpus emphasizes document quality and natural language patterns.<br/>

In [53]:
text_parameter = webtext.raw('overheard.txt')
# print(text_parameter)

In [41]:
#Train my tokenizer
from nltk.tokenize import PunktSentenceTokenizer
My_tokenizer = PunktSentenceTokenizer(text_parameter)

In [42]:
type(My_tokenizer)

nltk.tokenize.punkt.PunktSentenceTokenizer

In [43]:
from nltk.tokenize import sent_tokenize    # to compare two methods
pre_token = sent_tokenize(text_parameter)
our_token = My_tokenizer.tokenize(text_parameter)

In [45]:
pre_token[0]

'White guy: So, do you have any plans for this evening?'

In [46]:
our_token[0]

'White guy: So, do you have any plans for this evening?'

##Word Tokenization

In [47]:
from nltk.tokenize import word_tokenize
word_tokenize(txt)

['I',
 'am',
 'learning',
 'Natural',
 'Language',
 'Processing',
 '.',
 'I',
 'am',
 'learning',
 'Python',
 'programming',
 '.',
 'It',
 'is',
 'very',
 'user',
 'friendly',
 '.',
 'I',
 'am',
 'ready',
 'to',
 'start',
 'coding',
 '.']

In [48]:
word_tokenize("don't")

['do', "n't"]

###TreebankWordTokenize

In [49]:
from nltk import TreebankWordTokenizer
Tree_Toknizer = TreebankWordTokenizer()  # Create an object
Tree_Toknizer.tokenize("Hello! Mr reza. How are you today? I can't stand") # the same problem

['Hello',
 '!',
 'Mr',
 'reza.',
 'How',
 'are',
 'you',
 'today',
 '?',
 'I',
 'ca',
 "n't",
 'stand']

###WordPunktTokenizer

In [50]:
from nltk.tokenize import WordPunctTokenizer
Punkt_token = WordPunctTokenizer()
Punkt_token.tokenize("can't")

['can', "'", 't']