# Text Mining and NLP Tutorial
- NLP - text audio data

from NLP (Natural Language Processing) --> NLU (Natural Language Understanding) --> LLM (Large Language Model)

Text Mining - is the process of deriving meaningful information from natural language text

NLP - components:
- NLU
     - ambiguity
        - lexical Ambiguity
            - word having more than one meaning for it. E.g.Bank - river/money 
        - Syntactic Ambiguity
            - also called structural ambiguity - when a sentence can be interpreted more than one way 
        - Referential AMbiguity
            - the context does not make sense , when being interpreted by a collective or single one for.
            - E.g. The Teacher told the student to bring their book to class. Ambiguity -  Its unclear whether “their” refers to the teacher or the student.
            - E.g. The Manager spoke to the employee while he was leaving the office. Ambiguity - Its unclear whether “he” refers to the manager or the employee.

Python nltk


## lexical Ambiguity

In [1]:
#lexical Ambiguity

import nltk
from nltk.wsd import lesk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

sentence = " I went to the bank to deposit money"
ambiguousword = "bank"

sens = lesk(sentence.split(),ambiguousword)
print(sens , sens.definition())


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...


Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home


OMW Version 1.4 This page provides access to wordnets in a variety of languages, all linked to the Princeton Wordnet of English (PWN).

In [2]:
import nltk
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')

def check_ambiguity(word):
  syssets = wordnet.synsets(word)
  print(f"word : {word}")
  if len(syssets) > 1:
    print(f"Ambiguity Detected. The word '{word}' has {len(syssets)} meanings")
    for syn in syssets:
      print(f"Meaning : {syn.definition()}")
  else:
    print(f"The word '{word}' has only one meaning")

words = ["bank","Bark","Date","Judge","Plant"]
for word in words:
  check_ambiguity(word)
  print()


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...


word : bank
Ambiguity Detected. The word 'bank' has 18 meanings
Meaning : sloping land (especially the slope beside a body of water)
Meaning : a financial institution that accepts deposits and channels the money into lending activities
Meaning : a long ridge or pile
Meaning : an arrangement of similar objects in a row or in tiers
Meaning : a supply or stock held in reserve for future use (especially in emergencies)
Meaning : the funds held by a gambling house or the dealer in some gambling games
Meaning : a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Meaning : a container (usually with a slot in the top) for keeping money at home
Meaning : a building in which the business of banking transacted
Meaning : a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
Meaning : tip laterally
Meaning : enclose with a bank
Meaning : do business with a bank or keep an account a

In [3]:
from nltk.corpus import wordnet as wn
# No need to import 'omw'

nltk.download('wordnet')
nltk.download('omw-1.4')  # This downloads the resource, not a module

a = wn.synset('bank.n.01')  # Specify a synset name, e.g., 'bank.n.01'
print(a.definition())
print(a.examples())


sloping land (especially the slope beside a body of water)
['they pulled the canoe up on the bank', 'he sat on the bank of the river and watched the currents']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Tokenization 
first step of NLP how to break the words (Tokens)

Uni-Gram - Each word is a single token
BI-gram - Two words become a single token 
TriGram - Three words become a single token

In [6]:
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams

# Define the sentence
sentence = "I went to the bank to deposit money"

# Tokenize the sentence using RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)

# Generate trigrams (3-grams)
trigrams = list(ngrams(tokens, 3))

# Print the trigrams
print(trigrams)


[('I', 'went', 'to'), ('went', 'to', 'the'), ('to', 'the', 'bank'), ('the', 'bank', 'to'), ('bank', 'to', 'deposit'), ('to', 'deposit', 'money')]


In [9]:
from nltk import bigrams
from nltk.tokenize import RegexpTokenizer

sentence = "I went to the bank to deposit money to the"
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)

bigram_list = list(bigrams(tokens))
bigram_list


[('I', 'went'),
 ('went', 'to'),
 ('to', 'the'),
 ('the', 'bank'),
 ('bank', 'to'),
 ('to', 'deposit'),
 ('deposit', 'money'),
 ('money', 'to'),
 ('to', 'the')]

In [11]:
import nltk
from nltk.util import ngrams

nltk.download('punkt_tab')

sentence = "I went to the bank to deposit money"
tokens = nltk.word_tokenize(sentence)

bigrams = list(ngrams(tokens, 3))
print(bigrams)


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[('I', 'went', 'to'), ('went', 'to', 'the'), ('to', 'the', 'bank'), ('the', 'bank', 'to'), ('bank', 'to', 'deposit'), ('to', 'deposit', 'money')]


In [14]:
from nltk import bigrams
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

sentence = "I went to the bank to deposit money to the"
tokens = word_tokenize(sentence)

bigram_list = list(bigrams(tokens))
bigram_list


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[('I', 'went'),
 ('went', 'to'),
 ('to', 'the'),
 ('the', 'bank'),
 ('bank', 'to'),
 ('to', 'deposit'),
 ('deposit', 'money'),
 ('money', 'to'),
 ('to', 'the')]

In [15]:
from nltk import bigrams
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample sentence
sentence = "I went to the bank to deposit money to the"

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Generate bigrams
bigram_list = list(bigrams(tokens))

# Compute frequency distribution of bigrams
bigram_freq = FreqDist(bigram_list)

# Display frequency count of each bigram
for bigram, freq in bigram_freq.items():
    print(f"Bigram: {bigram}, Frequency: {freq}")


Bigram: ('I', 'went'), Frequency: 1
Bigram: ('went', 'to'), Frequency: 1
Bigram: ('to', 'the'), Frequency: 2
Bigram: ('the', 'bank'), Frequency: 1
Bigram: ('bank', 'to'), Frequency: 1
Bigram: ('to', 'deposit'), Frequency: 1
Bigram: ('deposit', 'money'), Frequency: 1
Bigram: ('money', 'to'), Frequency: 1


## Stemming

Short forms of words 

Please - pls

will got to the root word affection --> affect

## Lemmatization
goes for the word meanings


In [18]:
import nltk
from nltk.stem import PorterStemmer,WordNetLemmatizer,LancasterStemmer,SnowballStemmer

nltk.download('punkt')

porter = PorterStemmer() # connectivity - connect
lancaster = LancasterStemmer() # connectivity - connec
snowball = SnowballStemmer(language='english')

words = ["running" ,"ran","runner","run","easily","fairly"]
print("orignal words",words)
print()

print("Porter steammer Results")
porter_stemmed = [porter.stem(word) for word in words]
print(porter_stemmed)
print()


orignal words ['running', 'ran', 'runner', 'run', 'easily', 'fairly']

Porter steammer Results
['run', 'ran', 'runner', 'run', 'easili', 'fairli']



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [19]:
import nltk
from nltk.stem import PorterStemmer,WordNetLemmatizer,LancasterStemmer,SnowballStemmer

nltk.download('punkt')

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')

words = ["running" ,"ran","runner","run","easily","fairly"]
print("orignal words",words)
print()

print("Porter steammer Results")
porter_stemmed = [porter.stem(word) for word in words]
print(porter_stemmed)
print()

print("###################################################################")

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

words = ["running" ,"ran","runner","run","easily","fairly"]
print("orignal words",words)
print()

print("Porter lemmatizer Results")
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
print()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


orignal words ['running', 'ran', 'runner', 'run', 'easily', 'fairly']

Porter steammer Results
['run', 'ran', 'runner', 'run', 'easili', 'fairli']

###################################################################
orignal words ['running', 'ran', 'runner', 'run', 'easily', 'fairly']

Porter lemmatizer Results
['running', 'ran', 'runner', 'run', 'easily', 'fairly']



### Stop words

In [20]:
# prompt: List of stop words NLP

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora\stopwords.zip.


### Parts of Speech

In [8]:
# prompt: parts of speech NLP code
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_tab')
nltk.download('averaged_perceptron_tagger')

text = "This is a sample sentence for parts of speech tagging."
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

pos_tags


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\E1005290\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('sample', 'JJ'),
 ('sentence', 'NN'),
 ('for', 'IN'),
 ('parts', 'NNS'),
 ('of', 'IN'),
 ('speech', 'NN'),
 ('tagging', 'NN'),
 ('.', '.')]

In [4]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
text = "Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.[2] Its best-known software products are the Windows line of operating systems, the Microsoft 365 suite of productivity applications, the Azure cloud computing platform and the Edge web browser. Its flagship hardware products are the Xbox video game consoles and the Microsoft Surface lineup of touchscreen personal computers. Microsoft ranked No. 14 in the 2022 Fortune 500 rankings of the largest United States corporations by total revenue;[3] and it was the world's largest software maker by revenue in 2022 according to Forbes Global 2000. It is considered one of the Big Five American information technology companies, alongside Alphabet (parent company of Google), Amazon, Apple, and Meta (parent company of Facebook)."
doc = nlp(text)

print("Names Entities")
for ent in doc.ents:
  print(f"{ent.text} ({ent.label_})")


Names Entities
Microsoft Corporation (ORG)
American (NORP)
Redmond (GPE)
Windows (NORP)
Microsoft (ORG)
Azure (ORG)
Edge (ORG)
Xbox (ORG)
Microsoft Surface (ORG)
Microsoft (ORG)
14 (CARDINAL)
2022 (DATE)
Fortune 500 (LAW)
United States (GPE)
revenue;[3] (PERSON)
2022 (DATE)
Forbes Global 2000 (ORG)
American (NORP)
Alphabet (GPE)
Google (ORG)
Amazon (ORG)
Apple (ORG)
Meta (ORG)


In [3]:
import spacy

# Load the SpaCy model
nlp = spacy.load('en_core_web_sm')

# Define the text to be processed
text = "This is a sample sentence for parts of speech tagging."

# Process the text with SpaCy
doc = nlp(text)

# Extract and print POS tags
pos_tags = [(token.text, token.pos_) for token in doc]
print(pos_tags)


[('This', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('sample', 'NOUN'), ('sentence', 'NOUN'), ('for', 'ADP'), ('parts', 'NOUN'), ('of', 'ADP'), ('speech', 'NOUN'), ('tagging', 'NOUN'), ('.', 'PUNCT')]


In [5]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
text = "Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington.[2] Its best-known software products are the Windows line of operating systems, the Microsoft 365 suite of productivity applications, the Azure cloud computing platform and the Edge web browser. Its flagship hardware products are the Xbox video game consoles and the Microsoft Surface lineup of touchscreen personal computers. Microsoft ranked No. 14 in the 2022 Fortune 500 rankings of the largest United States corporations by total revenue;[3] and it was the world's largest software maker by revenue in 2022 according to Forbes Global 2000. It is considered one of the Big Five American information technology companies, alongside Alphabet (parent company of Google), Amazon, Apple, and Meta (parent company of Facebook)."
doc = nlp(text)

print("Names Entities")
for ent in doc.ents:
  print(f"{ent.text} ({ent.label_})")

  """NORP = Nationalities, religious or political groups
  ORG = Companies, agencies, institutions, etc.
  GPE = Countries, cities, states.
  PERSON = People, including fictional."""

displacy.render(doc, style="ent", jupyter=True)


Names Entities
Microsoft Corporation (ORG)
American (NORP)
Redmond (GPE)
Windows (NORP)
Microsoft (ORG)
Azure (ORG)
Edge (ORG)
Xbox (ORG)
Microsoft Surface (ORG)
Microsoft (ORG)
14 (CARDINAL)
2022 (DATE)
Fortune 500 (LAW)
United States (GPE)
revenue;[3] (PERSON)
2022 (DATE)
Forbes Global 2000 (ORG)
American (NORP)
Alphabet (GPE)
Google (ORG)
Amazon (ORG)
Apple (ORG)
Meta (ORG)


### syntext tree python NLP

In [9]:
# prompt: syntext tree python NLP

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = "This is a sample sentence."
doc = nlp(text)

displacy.render(doc, style="dep", jupyter=True)


## Chunking



In [10]:
# prompt: chunking code

nlp = spacy.load("en_core_web_sm")
text = "We caught the black panther."
doc = nlp(text)

for chunk in doc.noun_chunks:
  print(chunk.text, chunk.label_, chunk.root.text)


We NP We
the black panther NP panther
