# Natural language processing (NLP)
NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
![image.png](attachment:image.png)

# NLTK 

### NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum. 
- The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. 
- NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

### Installation
- conda install -c anaconda nltk

## Components of NLP
Five main Component of Natural Language processing are:

## Morphological and Lexical Analysis##
 - Lexical analysis is a vocabulary that includes its words and expressions. 
 - It depicts analyzing, identifying and description of the structure of words. 
 - It includes dividing a text into paragraphs, words and the sentences. 
 - Individual words are analyzed into their components, and nonword tokens such as punctuations are separated from the words
 
 ## Semantic Analysis: 
 - Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings. 
 - This component transfers linear sequences of words into structures. 
 - It shows how the words are associated with each other.
## Pragmatic Analysis: 
- Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation. 
- It means abstracting or deriving the meaningful use of language in situations. 
- In this analysis, the main focus always on what was said in reinterpreted on what is meant.
 ## Syntax Analysis:
 - The words are commonly accepted as being the smallest units of syntax. 
 - The syntax refers to the principles and rules that govern the sentence structure of any individual languages.
  ## Discourse Integration : 
  - It means a sense of the context. 
  - The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.

### NLP and writing systems
The kind of writing system used for a language is one of the deciding factors in determining the best approach for text pre-processing. Writing systems can be


- Logographic: a Large number of individual symbols represent words. Example Japanese, Mandarin
- Syllabic: Individual symbols represent syllables
- Alphabetic: Individual symbols represent sound

   ### Challenges

     - Extracting meaning(semantics) from a text is a challenge
     - NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to understand context.
     - There is a dependence on the character set and language

In [None]:
## list,sets,dictionary,Tuples,strings

In [7]:
import nltk
#nltk.download()

- First step :conda install -c anaconda nltk
             : pip install nltk
- Second Step : import nltk
              nltk.download()

## NLP lib in Python
- NLTK
- Gensim (Topic Modelling, Document summarization)
- CoreNLP(linguistic analysis)
- SpaCY
- TextBlob 
- Pattern (Web minning)

## pip install --proxy http://noidasezproxy.corp.exlservice.com:8000 package 


### Tokenizing Words & Sentences

A sentence or data can be split into words using the method sent_tokenize() & word_tokenize() respectively.


In [1]:
from nltk.tokenize import  word_tokenize 

E_TEXT = "Hello -Hello , - i am Suyashi Raiwani"

a= word_tokenize(E_TEXT)
#type(a)
a

['Hello', '-Hello', ',', '-', 'i', 'am', 'Suyashi', 'Raiwani']

In [2]:
from nltk.tokenize import sent_tokenize
#ispace , a=[,.?,!]

S2_TEXT = "Positive thinking! You know. is all imp A matter of habits? If you are not quite a positive thinker Change Yourself?"

print(sent_tokenize(S2_TEXT))

## type(sent_tokenize(E_TEXT)) ##!,? and.

['Positive thinking!', 'You know.', 'is all imp A matter of habits?', 'If you are not quite a positive thinker Change Yourself?']


## Quick Parctice :
- Do you know Customer and target audience’s reviews can be analyzed? You can  use this! to create a roadmap of features and products.
### Convert above para into word token and sentence token

In [5]:
##  store the words and sentences and type cast them in form of array:

from nltk.tokenize import sent_tokenize, word_tokenize
import numpy as np
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
 
phrases = sent_tokenize(data)
words = word_tokenize(data)

new_array=np.array(words)
new_array
# print(type(new_array))

array(['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'dull', 'boy',
       '.', 'All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a',
       'dull', 'boy', '.'], dtype='<U5')

### Stopping Words
- To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering too useful data. In natural language processing, not imp words (data), are referred to as stop words.

- For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.
 
- We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus .


In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

a = "I think i that Learning  DATA Science will bring a big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"
word_tokens = word_tokenize(a)
print(word_tokens)
print ("Lenghth of words = ",len(word_tokens))

['I', 'think', 'i', 'that', 'Learning', 'DATA', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains']
Lenghth of words =  58


In [7]:
stop_words1 = set(stopwords.words('english')) #downloads the file with english stop words
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words1]
print(filtered_sentence)
#print(word_tokens)
#print(filtered_sentence)
print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))
print ("Lenghth of words = ",len(filtered_sentence))

['I', 'think', 'Learning', 'DATA', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
The number of words stopped : 18
Lenghth of words =  40


In [8]:
 
b=["I",".",",","?",":"]  #Creating your own Stop word list
stop_words1=list(stop_words1)
stop_words2 = b #downloads the file with english stop words
stop_words=stop_words1+stop_words2
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

#print(word_tokens)
#print(filtered_sentence)
print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))
print ("Lenghth of words filtered sentence = ",len(filtered_sentence))

['think', 'Learning', 'DATA', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', 'processes', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', 'structured', 'unstructured', 'data', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
The number of words stopped : 24
Lenghth of words filtered sentence =  34


### Write a Python script that takes a paragraph of text and performs word tokenization using NLTK. Print the list of tokens

In [9]:
# Import necessary modules
import nltk
from nltk.tokenize import word_tokenize

# Download the necessary NLTK data files (you may need to run this once)
nltk.download('punkt')

# Function to tokenize text
def tokenize_text(paragraph):
    # Tokenize the paragraph into words
    tokens = word_tokenize(paragraph)
    return tokens

# Example paragraph of text
paragraph = """Life is a journey, not a destination. 
It's about the experiences we gather, the relationships we build, and the lessons we learn along the way. Embrace each moment, 
find joy in the simple things, and remember that even the challenges help shape who we are."""

# Get the tokens
tokens = tokenize_text(paragraph)

# Print the list of tokens
print(tokens)

# Alternatively, print each token on a new line
# for token in tokens:
#     print(token)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\om\AppData\Roaming\nltk_data...


['Life', 'is', 'a', 'journey', ',', 'not', 'a', 'destination', '.', 'It', "'s", 'about', 'the', 'experiences', 'we', 'gather', ',', 'the', 'relationships', 'we', 'build', ',', 'and', 'the', 'lessons', 'we', 'learn', 'along', 'the', 'way', '.', 'Embrace', 'each', 'moment', ',', 'find', 'joy', 'in', 'the', 'simple', 'things', ',', 'and', 'remember', 'that', 'even', 'the', 'challenges', 'help', 'shape', 'who', 'we', 'are', '.']


[nltk_data]   Package punkt is already up-to-date!


### STEMMING
A word stem is part of a word. It is sort of a normalization idea, but linguistic.
- For example, the stem of the word **Using** is **use**.

In [10]:
from nltk.stem import PorterStemmer

ps = PorterStemmer() ## defining stemmer
s_words = ["AIIMS","Aims","Aimed","Aimmer","Aiming","Aim"]
for i in s_words:
    print(ps.stem(i))


aiim
aim
aim
aimmer
aim
aim


In [11]:
from nltk.stem import PorterStemmer

ps = PorterStemmer() ## defining stemmer
s_words = ["Dance", "dances", "Dancing", "dancer", "dances", "danced", "Goods","Good","sings","singings","that"]
for i in s_words:
    print(ps.stem(i))

danc
danc
danc
dancer
danc
danc
good
good
sing
sing
that


In [12]:
from nltk.stem import PorterStemmer
ps = PorterStemmer() ## defining stemmer
s_words = ["Calls","Caller","Calling","Call","Called"]
for i in s_words:
    print(ps.stem(i))

call
caller
call
call
call


help(Nltk)

### Part of Speech tagging 
This means labeling words in a sentence as nouns, adjectives, verbs.

![image.png](attachment:image.png)

In [13]:
import nltk 
from nltk.tokenize import PunktSentenceTokenizer

document = 'Whether you\'re new to DataScience or an paracetamol , it\'s easy to learn and use Python.Are you Good enough in Prgramming? I Am based in Delhi location'
sentences = nltk.sent_tokenize(document)   
for sent in sentences: 
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

[('Whether', 'IN'), ('you', 'PRP'), ("'re", 'VBP'), ('new', 'JJ'), ('to', 'TO'), ('DataScience', 'NNP'), ('or', 'CC'), ('an', 'DT'), ('paracetamol', 'NN'), (',', ','), ('it', 'PRP'), ("'s", 'VBZ'), ('easy', 'JJ'), ('to', 'TO'), ('learn', 'VB'), ('and', 'CC'), ('use', 'VB'), ('Python.Are', 'NNP'), ('you', 'PRP'), ('Good', 'NNP'), ('enough', 'RB'), ('in', 'IN'), ('Prgramming', 'NNP'), ('?', '.')]
[('I', 'PRP'), ('Am', 'VBP'), ('based', 'VBN'), ('in', 'IN'), ('Delhi', 'NNP'), ('location', 'NN')]


In [14]:
sentences

["Whether you're new to DataScience or an paracetamol , it's easy to learn and use Python.Are you Good enough in Prgramming?",
 'I Am based in Delhi location']

In [15]:
##from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

document = 'Whether you\'re new to DataScience or an experienced, it\'s easy to learn and use Python.'
sentences = nltk.sent_tokenize(document)
data = []
for sent in sentences:
    data = data + nltk.pos_tag(nltk.word_tokenize(sent))
   
for word in data: 
    if 'CC' in word[1]: 
        print(word)


('or', 'CC')
('and', 'CC')


## Get synonyms/antonyms using WordNet

- WordNet’s structure makes it a useful tool for computational linguistics and natural language processing

- WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings

In [16]:
# First, you're going to need to import wordnet:
from nltk.corpus import wordnet

# Then, we're going to use the term "" to find synsets like so:
syns = wordnet.synsets("Nostalgic")

# An example of a synset:
print(syns[0].name())

# Just the word:
print(syns[0].lemmas()[0].name())

# Definition of that first synset:
print(syns[0].definition())

# Examples of the word in use in sentences:
print(syns[0].examples())


nostalgic.s.01
nostalgic
unhappy about being away and longing for familiar things or persons
[]


In [17]:
# Then, we're going to use the term "Sad" to find synsets like so:
syns = wordnet.synsets("Generative")

# An example of a synset:
print(syns[0].name())

# Just the word:
print(syns[0].lemmas()[0].name())

# Definition of that first synset:
print(syns[0].definition())

# Examples of the word in use in sentences:
print(syns[0].examples())

generative.a.01
generative
having the ability to produce or originate
['generative power', 'generative forces']


In [18]:
import nltk
from nltk.corpus import wordnet
synonyms = []
antonyms = []

for syn in wordnet.synsets("Sound"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print("Similar words =",set(synonyms))
print(set(antonyms))


Similar words = {'healthy', 'vocalise', 'level-headed', 'vocalize', 'legal', 'heavy', 'strait', 'sound', 'fathom', 'wakeless', 'speech_sound', 'levelheaded', 'intelligent', 'go', 'phone', 'profound', 'audio', 'effectual', 'reasoned', 'good', 'auditory_sensation', 'voice', 'well-grounded'}
{'silence', 'unsound', 'devoice'}


## let’s compare the similarity index of any two words

Similarity index is a measure of how much a text matches other sources. It can be used to detect plagiarism or compare sets of data. It is usually expressed as a percentage, where a higher value means more similarity. 

## Filtering Duplicate Words we can use sets

In [19]:
##Sets 
s={1,2,33,33,44,0,-5}
s

{-5, 0, 1, 2, 33, 44}

In [20]:
import nltk
word_data = "The python is a a python data analytics language" 

# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

# Applying Set
no_order = list(set(nltk_tokens))

print (no_order)

['data', 'language', 'analytics', 'is', 'python', 'The', 'a']


In [21]:
ordered_tokens = set()
result = []
for word in nltk_tokens:
    if word not in ordered_tokens:
        ordered_tokens.add(word)
        result.append(word)
     
print (result )       

['The', 'python', 'is', 'a', 'data', 'analytics', 'language']


## Lemmentization
Lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'danc'.

In [22]:
import nltk 
from nltk.tokenize import  word_tokenize
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
a = "I have yellow and Black scarves. I love to wear scarf"
words = word_tokenize(a)
words

['I',
 'have',
 'yellow',
 'and',
 'Black',
 'scarves',
 '.',
 'I',
 'love',
 'to',
 'wear',
 'scarf']

In [23]:
from nltk.tokenize import word_tokenize
lemmatized_words = [lem.lemmatize(words) for words in words]
lemmatized_words

['I',
 'have',
 'yellow',
 'and',
 'Black',
 'scarf',
 '.',
 'I',
 'love',
 'to',
 'wear',
 'scarf']

In [24]:
from nltk.stem import PorterStemmer
ps = PorterStemmer() ## defining stemmer
s_words = ["Dances", "dances", "Dancing", "dancer", "dances", "danced", "ddd","Sang","sings","singings","that"]
s_words1 = ["dancess", "dances", "dancing", "dancer", "dances", "danced", "ddd"]
for i in s_words1:
    print(ps.stem(i))

dancess
danc
danc
dancer
danc
danc
ddd


In [25]:
s = "dancess dances dancing dancer dances danced ddd"

words = word_tokenize(s)
lm = [lem.lemmatize(word) for word in words]
lm

['dance', 'dance', 'dancing', 'dancer', 'dance', 'danced', 'ddd']

## Lem works directly on noun, for other Parts of Speeching tagging 

In [26]:
a="dancer"
lem.lemmatize(a,pos ="n")
#lemmatizer.lemmatize("worst", pos="a") #y

'dancer'

In [27]:
s="these places have many worst wolves. My friends love nicer to visit Zoo.All of Us goods were wearing beautiful dressess"
words = word_tokenize(s)
lemmatized_words = [lem.lemmatize(word) for word in words]
lemmatized_words

['these',
 'place',
 'have',
 'many',
 'worst',
 'wolf',
 '.',
 'My',
 'friend',
 'love',
 'nicer',
 'to',
 'visit',
 'Zoo.All',
 'of',
 'Us',
 'good',
 'were',
 'wearing',
 'beautiful',
 'dress']

In [28]:
from nltk.tokenize import word_tokenize
d = "Good nice , India is place for young people, Suyashi are you from Delhi? I love this place"
quote = word_tokenize(d)
quote

['Good',
 'nice',
 ',',
 'India',
 'is',
 'place',
 'for',
 'young',
 'people',
 ',',
 'Suyashi',
 'are',
 'you',
 'from',
 'Delhi',
 '?',
 'I',
 'love',
 'this',
 'place']

In [29]:
# next step is to tag those words by part of speech:
import nltk
#nltk.download("averaged_perceptron_tagger")
pos_tags = nltk.pos_tag(quote)
pos_tags

[('Good', 'JJ'),
 ('nice', 'NN'),
 (',', ','),
 ('India', 'NNP'),
 ('is', 'VBZ'),
 ('place', 'NN'),
 ('for', 'IN'),
 ('young', 'JJ'),
 ('people', 'NNS'),
 (',', ','),
 ('Suyashi', 'NNP'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('from', 'IN'),
 ('Delhi', 'NNP'),
 ('?', '.'),
 ('I', 'PRP'),
 ('love', 'VBP'),
 ('this', 'DT'),
 ('place', 'NN')]

In [30]:
a="worst"
lem.lemmatize(a, pos ="a")
#lemmatizer.lemmatize("worst", pos="a") #y

'bad'

## A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions

- According to the rule you created, your chunks:
- Start with an optional (?) determiner ('DT')
- Can have any number (*) of adjectives (JJ)
- End with a noun ()

In [31]:
grammar = "JJ: {<DT>?<JJ>*<NN>}"  #NP stands for noun phrase. 

In [32]:
chunk_parser = nltk.RegexpParser(grammar)

In [33]:
tree = chunk_parser.parse(pos_tags)

In [34]:
tree.draw()

In [43]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Sample text
text = "The quick brown fox jumps over the lazy dog"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


### Steps for Chunking

- We define the chunk grammar using regular expressions. The grammar specifies patterns that indicate how chunks should be formed. In this example, we define a simple grammar to chunk noun phrases (NP) consisting of optional determiners (DT), adjectives (JJ), and nouns (NN).

- Chunk Parser: We create a chunk parser using the defined gramma

-  Apply the chunk parser to the tagged tokens, which identifies and groups tokens according to the patterns specified in the grammar.

- Print the chunked tokens, which represent the identified phrases based on the chunking rules.

In [44]:
# Define chunk grammar using regular expressions
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}   # Chunk sequences of DT, JJ, and NN
    """

# Create a chunk parser using the defined grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunked_tokens = chunk_parser.parse(tagged_tokens)

# Print the chunked tokens
print(chunked_tokens)

(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN))


## Rules
- According to the rule you created, your chunks:
- Start with an optional (?) determiner ('DT')
- Can have any number (*) of adjectives (JJ)
- End with a noun ()

## Let's create another example for chunking, this time focusing on verb phrases (VP) in a sentence

In [45]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Sample text
text = "Suyashi has a Rabbit that ran from the Table. She bought it from Isha . The Jumping is best Habit of the Rabbit"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
word_tokens_text = pos_tag(tokens)

# Define chunk grammar using regular expressions
chunk_grammar = r"""
    VP: {<VB.*><DT>?<JJ>*<NN>}   # Chunk sequences of verbs, determiners, adjectives, and nouns
    """

# Create a chunk parser using the defined grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunked_tokens = chunk_parser.parse(word_tokens_text)

# Print the chunked tokens
print(chunked_tokens)


(S
  Suyashi/NNP
  has/VBZ
  a/DT
  Rabbit/NNP
  that/WDT
  ran/VBD
  from/IN
  the/DT
  Table/NN
  ./.
  She/PRP
  bought/VBD
  it/PRP
  from/IN
  Isha/NNP
  ./.
  The/DT
  Jumping/NNP
  is/VBZ
  best/JJS
  Habit/NN
  of/IN
  the/DT
  Rabbit/NNP)


## Quick Practice
Let's create another example for chunking, this time focusing on extracting noun phrases (NP) along with prepositional phrases (PP) from a sentence

In [46]:
# Solution

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

# Sample text
text = "Ram Loves his Life. He have a cat named RUMMY"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

# Define chunk grammar using regular expressions
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}    # Chunk noun phrases
    PP: {<IN><NP>}           # Chunk prepositional phrases
    """

# Create a chunk parser using the defined grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply chunking
chunked_tokens = chunk_parser.parse(tagged_tokens)

# Print the chunked tokens
print(chunked_tokens)

(S
  Ram/NNP
  Loves/VBZ
  his/PRP$
  (NP Life/NN)
  ./.
  He/PRP
  have/VBP
  (NP a/DT cat/NN)
  named/VBN
  RUMMY/NNP)


## Using Named Entity Recognition (NER)
- Named entities are noun phrases that refer to specific locations, people, organizations, and so on.
- With named entity recognition, you can find the named entities in your texts and also determine what kind of named entity they are
- you can use nltk.ne_chunk() to recognize named entities

In [47]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = "You know Ashi, she works  in ABC pvt Lt. India, and its CEO  Rommy  is from Australia. Rabbit plays with Cat"
# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

# Perform named entity recognition
named_entities = ne_chunk(tagged_tokens)  #This function identifies named entities in the text based on the part-of-speech tags.

# Print the named entities
print(named_entities) 

(S
  You/PRP
  know/VBP
  (PERSON Ashi/NNP)
  ,/,
  she/PRP
  works/VBZ
  in/IN
  (ORGANIZATION ABC/NNP)
  pvt/NN
  Lt./NNP
  (GPE India/NNP)
  ,/,
  and/CC
  its/PRP$
  (ORGANIZATION CEO/NNP Rommy/NNP)
  is/VBZ
  from/IN
  (GPE Australia/NNP)
  ./.
  (PERSON Rabbit/NNP)
  plays/VBZ
  with/IN
  (ORGANIZATION Cat/NNP))


In [48]:
%%time
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = "The teacher have a kid named Asha. They stay in Bali. They have a pet named Kipy."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Initialize a list to store named entities
named_entities = []

# Iterate through each sentence
for sentence in sentences:
    # Tokenize the sentence into words
    tokens = word_tokenize(sentence)
    # Perform part-of-speech tagging
    tagged_tokens = pos_tag(tokens)
    # Perform named entity recognition
    named_entities.extend(ne_chunk(tagged_tokens))

# Print the named entities
for entity in named_entities:
    if hasattr(entity, 'label'):
        print(' '.join(c[0] for c in entity.leaves()), '-', entity.label())


Asha - PERSON
Bali - GPE
Kipy - PERSON
CPU times: total: 0 ns
Wall time: 8.32 ms


- Lambda functions are typically used for simple operations on single elements of a list, not for complex operations involving iteration, tokenization, tagging, and parsing.

In [49]:
%%time
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = "The teacher have a kid named Asha. They stay in Bali. They have a pet named Kipy."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Perform named entity recognition for each sentence and flatten the result
named_entities = [entity for sentence in sentences 
                        for entity in ne_chunk(pos_tag(word_tokenize(sentence))) 
                            if hasattr(entity, 'label')]

# Print the named entities
for entity in named_entities:
    print(' '.join(c[0] for c in entity.leaves()), '-', entity.label())


Asha - PERSON
Bali - GPE
Kipy - PERSON
CPU times: total: 15.6 ms
Wall time: 9.4 ms


In [35]:
nltk.download("maxent_ne_chunker")
nltk.download("words")
tree = nltk.ne_chunk(pos_tags)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\om\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\om\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [36]:
quote = """
She has a cat named Rajat. she bought it from Isha . Suyashi  lives in India.
 """

## Now create a function to extract named entities

In [37]:
from nltk.tokenize import word_tokenize
def extract_ne(quote):
    words = word_tokenize(quote, language="english")
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE"
    )

In [38]:
extract_ne(quote)

{'India', 'Isha', 'Rajat', 'Suyashi'}

## Named Entity Recognition (NER) with Custom Entities:
- Extracting custom named entities from text using regular expressions and NLTK's NER capabilities

In [50]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text
text = " Tina and Roohi are best Friends. They work in same company. They stay in America"

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

# Define a custom named entity chunker
chunk_rule = r"NE: {<NNP>+}"
custom_chunker = nltk.RegexpParser(chunk_rule)

# Apply custom chunker
custom_named_entities = custom_chunker.parse(tagged_tokens)

# Print custom named entities
print(custom_named_entities)


(S
  (NE Tina/NNP)
  and/CC
  (NE Roohi/NNP)
  are/VBP
  best/JJS
  Friends/NNS
  ./.
  They/PRP
  work/VBP
  in/IN
  same/JJ
  company/NN
  ./.
  They/PRP
  stay/VBP
  in/IN
  (NE America/NNP))


## Extarcting email Ids from data

In [39]:
import re 
# install re module to run below code (conda install re)
text = "Please contact me at ashi@py.com for further information."+\
        " You can also give feedback at ashi@py.in"


emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
print(emails)

['ashi@py.com', 'ashi@py.in']


## Convertion & Encrption.

Encryption is a way of scrambling data so that only authorized parties can understand the information.

In [40]:
import string

t= 'Here I am'
pwd="ABCD@1234" #to "numerical format"

print (string.capwords(t)) # Capitalize each word
print(str.upper(t)) # converts into upper case

Here I Am
HERE I AM


In [41]:
transtable = str.maketrans("ABCD@1234","1***@0000")
print (pwd.translate(transtable)) 
## the first two maketrans arguments must have equal length
## if you give only one argument to maketrans it must be a dict
#string keys in translate table must be of length 1

1***@0000


## The corpora with NLTK

- The NLTK corpus is a massive dump of all kinds of natural language data sets 

In [42]:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root=(r"C:\Users\suyashi144893\Documents\Python Analytics\Natural Language Processing using Python")
filelists = PlaintextCorpusReader(corpus_root, '.*')

OSError: No such file or directory: 'C:\\Users\\suyashi144893\\Documents\\Python Analytics\\Natural Language Processing using Python'

### Example 

we can train the Punkt tokenizer like

In [1]:
from nltk.corpus import state_union
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [2]:
train_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nFebruary 2, 2005\n\n\n9:10 P.M. EST \n\nTHE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: \n\nAs a new Congress gathers, all of us in the elected branches of government share a great privilege: We\'ve been placed in office by the votes of the people we serve. And tonight that is a privilege we share with newly-elected leaders of Afghanistan, the Palestinian Territories, Ukraine, and a free and sovereign Iraq. (Applause.) \n\nTwo weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. \n\nTonight, with a healthy, growing economy, with more Americans going back to work, with our nation an active force for good in the world -- the state of our union is confident and strong. (Applause.

In [3]:
sample_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King. (Applause.)\n\nPresident George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan. 31, 2006. White House photo by Eric DraperEvery time I\'m invited to this rostrum, I\'m humbled by the privilege, and mindful of the history we\'ve seen together. We have gathered under this Capitol dome in moments of national mourning and national achievement. We have serv

In [4]:
from nltk.tokenize import PunktSentenceTokenizer
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [5]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [8]:
### creating a function that will run through and tag all of the parts of speech per sentence 
def process_content():
    try:
        for i in tokenized[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

### Text Classification with NLTK

In [9]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
#print(all_words.most_common(15))
#print(all_words["poor"])

(['first', 'impressions', ':', 'critically', ',', 'a', 'close', '-', 'to', '-', 'awful', 'film', ',', 'but', 'money', '-', 'wise', ',', 'it', 'has', 'been', 'doing', '(', 'and', 'will', 'continue', 'to', 'do', ')', 'great', '.', 'a', 'sometimes', '-', 'funny', 'film', 'that', 'sags', 'and', 'lags', 'and', 'oftentimes', 'gets', 'boring', '.', 'an', 'orginal', 'plot', 'that', 'grows', 'old', 'real', 'fast', '.', 'one', 'of', 'the', 'only', '90', 'minute', 'films', 'that', 'i', "'", 've', 'gotten', 'bored', 'through', '.', 'men', 'in', 'black', 'has', 'defied', 'the', 'odds', '.', 'when', 'i', 'first', 'saw', 'that', 'the', 'flick', 'was', '89', 'minutes', 'long', ',', 'i', 'thought', 'maybe', 'that', 'this', 'was', 'a', 'poor', 'attempt', 'at', 'an', 'independence', 'day', 'type', 'film', 'that', 'just', 'ran', 'out', 'of', 'gas', '.', 'however', ',', 'i', 'now', 'realize', 'that', 'not', 'only', 'did', 'men', 'in', 'black', 'run', 'out', 'of', 'gas', ',', 'but', 'the', 'film', 'in', '90

In [10]:
## Converting Words to features 
random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:200]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

{'plot': True, ':': True, 'two': True, 'teen': True, 'couples': True, 'go': True, 'to': True, 'a': True, 'church': True, 'party': True, ',': True, 'drink': True, 'and': True, 'then': True, 'drive': True, '.': True, 'they': True, 'get': True, 'into': True, 'an': True, 'accident': True, 'one': True, 'of': True, 'the': True, 'guys': True, 'dies': True, 'but': True, 'his': True, 'girlfriend': True, 'continues': True, 'see': True, 'him': True, 'in': True, 'her': True, 'life': True, 'has': True, 'nightmares': True, 'what': True, "'": True, 's': True, 'deal': True, '?': True, 'watch': True, 'movie': True, '"': True, 'sorta': True, 'find': True, 'out': True, 'critique': True, 'mind': True, '-': True, 'fuck': True, 'for': True, 'generation': True, 'that': True, 'touches': True, 'on': True, 'very': True, 'cool': True, 'idea': True, 'presents': True, 'it': True, 'bad': True, 'package': True, 'which': True, 'is': True, 'makes': True, 'this': True, 'review': True, 'even': True, 'harder': True, 'wri

In [11]:
##Then we can do this for all of our documents, saving the feature existence booleans 
## and their respective positive or negative categories by doing:

featuresets = [(find_features(rev), category) for (rev, category) in documents]
featuresets

[({'plot': True,
   ':': True,
   'two': True,
   'teen': False,
   'couples': False,
   'go': True,
   'to': True,
   'a': True,
   'church': False,
   'party': False,
   ',': True,
   'drink': False,
   'and': True,
   'then': True,
   'drive': False,
   '.': True,
   'they': True,
   'get': False,
   'into': True,
   'an': True,
   'accident': False,
   'one': True,
   'of': True,
   'the': True,
   'guys': False,
   'dies': False,
   'but': True,
   'his': True,
   'girlfriend': False,
   'continues': False,
   'see': False,
   'him': False,
   'in': True,
   'her': False,
   'life': False,
   'has': True,
   'nightmares': False,
   'what': False,
   "'": True,
   's': True,
   'deal': False,
   '?': True,
   'watch': True,
   'movie': True,
   '"': True,
   'sorta': False,
   'find': True,
   'out': True,
   'critique': True,
   'mind': False,
   '-': True,
   'fuck': False,
   'for': True,
   'generation': False,
   'that': True,
   'touches': False,
   'on': True,
   'very': Fal

## Naive Bayes classifiers
In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

## Constructing classifiers
Models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

In [12]:
# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll test against.
testing_set = featuresets[1900:]

In [13]:
## define, and train our classifier like:

classifier = nltk.NaiveBayesClassifier.train(training_set)

In [14]:
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

Classifier accuracy percent: 65.0


In [15]:
## what the most valuable words are when it comes to positive or negative reviews:

classifier.show_most_informative_features(20)

Most Informative Features
                 unravel = True              pos : neg    =      5.6 : 1.0
                    mess = True              neg : pos    =      3.7 : 1.0
                 jumbled = True              neg : pos    =      3.4 : 1.0
                terribly = True              neg : pos    =      2.7 : 1.0
                 touches = True              pos : neg    =      2.6 : 1.0
                 highway = True              neg : pos    =      2.5 : 1.0
               thrilling = True              pos : neg    =      2.4 : 1.0
                    mold = True              neg : pos    =      2.4 : 1.0
                meantime = True              neg : pos    =      2.3 : 1.0
                    hide = True              neg : pos    =      2.2 : 1.0
                  normal = True              pos : neg    =      2.2 : 1.0
                 applaud = True              pos : neg    =      2.1 : 1.0
                      it = False             neg : pos    =      2.1 : 1.0

## Pickle module:  
To go ahead and serialize our classifier object, so that all we need to do is load that file in real quick

- This opens up a pickle file, preparing to write in bytes some data. Then, we use pickle.dump() to dump the data. The first parameter to pickle.dump() is what are you dumping, the second parameter is where are you dumping it.
- The .pickle file is a serialized object, all we need to do now is read it into memory, which will be about as quick as reading any other ordinary file. 

- Now, we can use this object, and we no longer need to train our classifier every time we wanted to use it to classify.

## Scikit-Learn Sklearn with NLTK

In [16]:
from nltk.classify.scikitlearn import SklearnClassifier

#### The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [17]:
import numpy as np
X = np.random.randint(5, size=(9, 100))
y = np.array([0,1, 2, 3, 4,5,6,7,8])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)
print(clf.predict(X[6:8]))

[6 7]


In [18]:
## From here, you can use just about any of the sklearn classifiers. For example, lets bring in a couple more variations of the Naive Bayes algorithm:

from sklearn.naive_bayes import MultinomialNB,BernoulliNB
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, testing_set))

BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, testing_set))

MultinomialNB accuracy percent: 0.67
BernoulliNB accuracy percent: 0.65


In [19]:
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [20]:
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)

Original Naive Bayes Algo accuracy percent: 65.0
Most Informative Features
                 unravel = True              pos : neg    =      5.6 : 1.0
                    mess = True              neg : pos    =      3.7 : 1.0
                 jumbled = True              neg : pos    =      3.4 : 1.0
                terribly = True              neg : pos    =      2.7 : 1.0
                 touches = True              pos : neg    =      2.6 : 1.0
                 highway = True              neg : pos    =      2.5 : 1.0
               thrilling = True              pos : neg    =      2.4 : 1.0
                    mold = True              neg : pos    =      2.4 : 1.0
                meantime = True              neg : pos    =      2.3 : 1.0
                    hide = True              neg : pos    =      2.2 : 1.0
                  normal = True              pos : neg    =      2.2 : 1.0
                 applaud = True              pos : neg    =      2.1 : 1.0
                      it 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression_classifier accuracy percent: 75.0
SGDClassifier_classifier accuracy percent: 62.0
SVC_classifier accuracy percent: 72.0




LinearSVC_classifier accuracy percent: 74.0
NuSVC_classifier accuracy percent: 71.0
