## Natural Language Processing (NLP)
is a technology used to help computers understand human language. It’s used in many areas like social media, banking, and insurance.
When you’re analyzing text (like a book or a tweet), you often start with a big chunk of text. Your job is to make sense of this text. For example, you might want to translate this text into another language, like Hindi.
To do this, you need to break down the big task (translating the text) into smaller tasks. These smaller tasks might include splitting the text into sentences or words, or figuring out what a word means based on the words around it.
In this course, you’ll learn about the different steps you usually take to go from a big chunk of text to understanding what the text means. This process can be split into three main parts.
The course also includes a video that explains how to understand text. The video covers topics like semantic, syntactic, and lexical processing.
Let’s take an example. Imagine you have a Wikipedia article. The article is just a bunch of characters that a computer can’t understand on its own. So, you need to follow certain steps to help the computer understand the text.
The first step is
## Lexical Processing.
Here, you convert the raw text into words, sentences, or paragraphs, depending on what you need. For instance, if an email contains words like ‘lottery’, ‘prize’, and ‘luck’, it’s probably a spam email.
However, just looking at the words isn’t enough for more complex tasks, like translating text. For example, the sentences ‘My cat ate its third meal’ and ‘My third cat ate its meal’ mean different things, but if you’re just looking at the words, you might think they mean the same thing.
That’s where
## Syntactic Processing
comes in. This is the next step after Lexical Processing. Here, you try to understand the sentence better by looking at its structure or grammar. For example, you might look at who is doing an action and who is affected by it. This can help you tell the difference between sentences like ‘Ram thanked Shyam’ and ‘Shyam thanked Ram’.
## Semantic Processing 
is the next step after Lexical and Syntactic processing in Natural Language Processing (NLP). It’s used when you’re building more advanced NLP applications, like language translation and chatbots.
After Lexical and Syntactic processing, the machine might still not understand the meaning of the text. For example, it might not know that ‘PM’ and ‘Prime Minister’ mean the same thing. So, you need a way for the machine to learn this on its own.
One way to do this is by looking at the words that usually appear around a word. If ‘PM’ and ‘Prime Minister’ often appear around similar words, you can assume they mean the same thing.
The machine should also be able to understand other relationships between words. For example, it should know that ‘King’ and ‘Queen’ are related, and that ‘Queen’ is the female version of ‘King’. These words can be grouped under the word ‘Monarch’.
Once the machine understands the meaning of words through Semantic Analysis, it can be used for various applications. <b>These applications need a complete understanding of the text, from the words (Lexical level), to the sentence structure (Syntactic level), to the meaning of the words (Semantic level)</b>.
In simpler applications, only Lexical processing might be needed. But in most applications, Lexical and Semantic processing form the ‘preprocessing’ layer of the overall process.
Now that you have a basic idea of how to analyze text and understand its meaning, the next segment will teach you how the text is stored on machines.

In [1]:
amount="$50^"
amount_encoded = amount.encode('utf-16')#here i could have written utf-8 or ascii these are the wasy of encoding an element 
print(f"The amount {amount} and its encoded version {amount_encoded}")
amount_decoded=amount_encoded.decode("utf-16")
print(f"The encoded version {amount_encoded} and its normal format {amount_decoded}")

The amount $50^ and its encoded version b'\xff\xfe$\x005\x000\x00^\x00'
The encoded version b'\xff\xfe$\x005\x000\x00^\x00' and its normal format $50^


## Regular Expressions – Quantifiers I
From this segment onwards, you’ll learn about regular expressions. They are often known as regexes and are extremely powerful programming tools that can be used to extract features from the text, replace strings and perform other string manipulations. Being familiar with regular expressions is essential for becoming a text analytics expert.

    A regular expression is a pattern or character collection used to find a text’s substrings
Eg-
Let’s say you want to extract all of the hashtags from a tweet. A hashtag follows a set pattern, consisting of a pound (‘#’) character followed by a string. Some hashtags include ‘#mumbai’, ‘#bengaluru’ and ‘#upgrad’. This work is readily accomplished by submitting this pattern as well as a tweet from which you wish to extract the pattern (in this example, the pattern is any string beginning with ‘#’)

    Studying regular expressions entails learning how to recognise and define these patterns.



In [2]:
import re

In [3]:
example="this consist of sentences which is random i even dont know what i am typing"

In [4]:
re.search("this",example)

<re.Match object; span=(0, 4), match='this'>

In [5]:
def find_pattern(word,pattern):
    if re.search(patter,word):
        return re.search(pattern,word)
    else:
        return "Not Found"

In [6]:
find_pattern("abb","ab*")#zero or more 

<re.Match object; span=(0, 3), match='abb'>

In [7]:
find_pattern("abbb","ab+")#one or more

<re.Match object; span=(0, 4), match='abbb'>

In [8]:
find_pattern("abbb","ab?")#zero or one

<re.Match object; span=(0, 2), match='ab'>

In [9]:
find_pattern("abbb","ab{1,3}")# through this you could set the range

<re.Match object; span=(0, 4), match='abbb'>

In [10]:
#^ this symbol indicates the starting symbol thsi help us  to match the starting symbol
# $ indicates the ending of the string 
print(find_pattern("james","^j"))
print(find_pattern("rohan","$n"))

<re.Match object; span=(0, 1), match='j'>
Not Found


WildCard

In [11]:
# . is an universal character it will match to anything
print(find_pattern("hello",".$"))

<re.Match object; span=(4, 5), match='o'>


Character Sets []

In [12]:
#here i could be giving an character in the square bracket where it would mathch each character from the set
#here i could aslo mention the range by writing - 
find_pattern("rohan","[a-r]")
find_pattern("hello","[hehe]")
#internally it works on ascii values
#sso suppose i write a-c it will understand the set as a has value of(ASCII) 70 it will start its search by 71,72,73...


<re.Match object; span=(0, 1), match='h'>

<b>Some popular Character sets</b>
-  [aeiou] matches any vowel.
-  [0-9] matches any digit.
-  [^0-9] matches any non-digit character.
-  [\d\s] matches any digit or whitespace character.
-  [A-Z]
-  [abcAbc]
-  [A-z] it an case insensitive match


<b>Meta Sequence</b>
-  \d: Matches any decimal digit; this is equivalent to the set [0-9].
-  \D: Matches any non-digit character; this is equivalent to the set [^0-9].
-  \s: Matches any whitespace character; this includes spaces, tabs, and newline characters.
-  \S: Matches any non-whitespace character.
-  \w: Matches any alphanumeric character; this includes letters, digits, and underscores.
-  \W: Matches any non-alphanumeric character.
-  \b: Matches a word boundary (position between a word character and a non-word character).
-  \B: Matches a non-word boundary.

In [13]:
print(find_pattern("Hello ","\s+"))# gives an match if one ore mroe then one white space is present
print(find_pattern("124","\d+"))# gives an match if any no. is present
print(find_pattern("hey12He","\w+"))
print(find_pattern("Hey","\w"))
print(find_pattern("1ey","\w"))

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(0, 3), match='124'>
<re.Match object; span=(0, 7), match='hey12He'>
<re.Match object; span=(0, 1), match='H'>
<re.Match object; span=(0, 1), match='1'>


<b>Greedy vs Non-Greedy regex </b>

<b>Greedy Matching</b>: Greedy matching attempts to match as much as possible while still satisfying the entire pattern. For example, if you use the pattern bat* on the string "batsman", it will match "bat" followed by as many "t" characters as possible, resulting in "bat" being matched entirely.


<b>Non-greedy (Lazy) Matching</b>: Non-greedy matching, on the other hand, attempts to match as little as possible while still satisfying the entire pattern. For example, if you use the pattern bat*? on the string "batsman", it will match "bat" followed by as few "t" characters as possible, resulting in just "ba" being matched.

In [14]:
print(find_pattern("aaabbbbc","ab{3,5}"))

<re.Match object; span=(2, 7), match='abbbb'>


In [15]:
print(find_pattern("aaabbbb","ab{3,5}?"))
# when you want your search to stop at first constraint here it is 3 and dont go beyond that

<re.Match object; span=(2, 6), match='abbb'>


### re.match(pattern, string):
This function attempts to match the pattern at the beginning of the string. If the pattern matches, it returns a match object; otherwise, it returns None.
### re.search(pattern, string):
This function searches the entire string for a match to the pattern. It returns a match object if the pattern is found anywhere in the string; otherwise, it returns None.
### re.findall(pattern, string):
This function finds all occurrences of the pattern in the string and returns them as a list of strings. It does not return match objects but only the matched substrings.
### re.sub(pattern, repl, string):
This function replaces occurrences of the pattern in the string with the replacement string repl. It returns the modified string.
### re.split(pattern, string):
This function splits the string into substrings based on the occurrences of the pattern and returns them as a list of strings.

In [16]:
import re

# Sample string
text = "The quick brown fox jumps over the lazy dog"

# 1. re.match()
match_obj = re.match(r"The", text)
if match_obj:
    print("Match found at the beginning:", match_obj.group())
else:
    print("No match found")

# 2. re.search()
search_obj = re.search(r"fox", text)
if search_obj:
    print("Match found anywhere:", search_obj.group())
else:
    print("No match found")

# 3. re.findall()
matches = re.findall(r"\b\w{4}\b", text)
print("All words with 4 characters:", matches)

# 4. re.sub()
modified_text = re.sub(r"brown", "red", text)
print("Modified text:", modified_text)

# 5. re.split()
words = re.split(r"\s", text)
print("Words in the text:", words)

#usage of group function:
pattern="/\b\d{3}/"

Match found at the beginning: The
Match found anywhere: fox
All words with 4 characters: ['over', 'lazy']
Modified text: The quick red fox jumps over the lazy dog
Words in the text: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


<b>An practical implementation of regex</b>

In [17]:
import re
group= ("(555) 123-4567","555-123-4567","5551234567")
information="So here is my phone no.555-123-4567 and my gmail rohanpatel.737797045@gmail.com"
pattern="\(\d{3}\)\W?\s*\d{3}\W?\s?\d{4}|\d{3}\W?\d{3}\W?\d{4}|\d{10}"
pattern2="\(\d{3}\)\W?\s*\d{3}\W?\s?\d{4}|\d{3}\W?\d{3}\W?\d{4}|\d{10}|[a-zA-Z0-9]*\.[0.9]*\.@gmail\.com"
# for x in group:
#     print(re.findall(pattern,x))
print(re.findall(pattern2,information))

['555-123-4567']


In [18]:
import re

group = ("(555) 123-4567", "555-123-4567", "5551234567")
pattern = r"(\d{3})\W?\s*(\d{3})\W?\s?(\d{4})"

for phone_number in group:
    match = re.search(pattern, phone_number)
    if match:
        area_code = match.group(1)
        exchange_code = match.group(2)
        subscriber_number = match.group(3)
        print(f"Area Code: {area_code}, Exchange Code: {exchange_code}, Subscriber Number: {subscriber_number}")
    else:
        print(f"No match found for '{phone_number}'")

Area Code: 555, Exchange Code: 123, Subscriber Number: 4567
Area Code: 555, Exchange Code: 123, Subscriber Number: 4567
Area Code: 555, Exchange Code: 123, Subscriber Number: 4567


In [19]:
import re
information = "So here is my phone no.555-123-4567 and my gmail rohanpatel.737797045@gmail.com"
pattern_phone = r"\(\d{3}\)\W?\s*\d{3}\W?\s?\d{4}|\d{3}\W?\d{3}\W?\d{4}|\d{10}"
pattern_gmail = r"[a-zA-Z0-9]+\.[0-9]+@gmail\.com"
phone_numbers = re.findall(pattern_phone, information)
gmail_addresses = re.findall(pattern_gmail, information)

print("Phone Numbers:", phone_numbers)
print("Gmail Addresses:", gmail_addresses)


Phone Numbers: ['555-123-4567']
Gmail Addresses: ['rohanpatel.737797045@gmail.com']


In [20]:
!python -m spacy info
# this command shows the information regarding the python libraries

[1m

spaCy version    3.7.4                         
Location         C:\Users\rashi\anaconda\lib\site-packages\spacy
Platform         Windows-10-10.0.22621-SP0     
Python version   3.9.13                        
Pipelines                                      



In [21]:
import nltk
from nltk.tokenize import sent_tokenize
sentence="Hello this is an example of how we can use Tokenization. Here it ends"
sent_tokenize(sentence)



['Hello this is an example of how we can use Tokenization.', 'Here it ends']

In [22]:
from nltk.tokenize import word_tokenize
word_tokenize(sentence)

['Hello',
 'this',
 'is',
 'an',
 'example',
 'of',
 'how',
 'we',
 'can',
 'use',
 'Tokenization',
 '.',
 'Here',
 'it',
 'ends']

## Design Philosophy:
-  NLTK: NLTK is a comprehensive library for natural language processing (NLP) written in Python. It provides a wide range of tools and modules for tasks such as tokenization, stemming, tagging, parsing, and classification. NLTK is designed to be a learning tool, making it suitable for educational purposes and experimentation.
-  SpaCy: SpaCy is a modern and efficient library for NLP written in Python and Cython. It's designed to be fast, easy to use, and production-ready. SpaCy focuses on providing state-of-the-art performance and capabilities for real-world NLP applications.
## Ease of Use:
-  NLTK: NLTK is more flexible and customizable, allowing users to build NLP pipelines from scratch and experiment with different algorithms and techniques. It provides a wide range of modules and functions, but this flexibility can make it more complex for beginners.
-  SpaCy: SpaCy is designed to be simple and intuitive, with pre-trained models and streamlined APIs that make common NLP tasks easy to perform. It provides out-of-the-box support for tasks such as tokenization, named entity recognition (NER), part-of-speech (POS) tagging, and dependency parsing, making it more beginner-friendly and efficient for many use cases.
## Performance:
-  NLTK: NLTK provides a rich set of algorithms and tools, but it may not be as efficient or performant as SpaCy, especially for large-scale or production-grade NLP tasks.
-  SpaCy: SpaCy is known for its speed and efficiency, thanks to its optimized implementation and use of Cython. It's designed to handle large volumes of text and perform complex NLP tasks quickly, making it well-suited for production environments and real-time applications.

In [None]:
nltk.download()# this will open an terminal of all the packages that nltk 
#have you could use this to see which which are package are used for what and what are installed

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [43]:
import sys
print(sys.version) # this command is used to check the version of python


3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]


In [None]:
import spacy
nlp = spacy.blank("en")

doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate.")

for token in doc:
    print(token)

In [None]:
doc[0]# and here we could apply more function to it to identifie what it is in particular

In [None]:
example=doc[0]#here i have made an object of first token
example.is_alpha# her eafter typing is_ we could tap tab to access more function to use 

In [None]:
num=doc[11]
num.is_digit #but before we could use tab ew have run that object so that system could recognise that varibale 
#then only you can use that tab to get acces to all the avaialabe functions

In [None]:
with open("ForNlp.txt") as f:
    text=f.readlines()
text
type(text)

In [None]:
text[0]

In [None]:
from nltk.tokenize import word_tokenize
hello=word_tokenize(text[0])
print(hello)

In [None]:
hey=" ".join(hello)
print(hey)

#### Reading and extracting gmails and phone numbers from build in libraries 

In [None]:
with open("Dummie.txt") as f:
    text=f.readlines();
type(text)
# now it is in the from of list so to convert all the element of list into one singnal line 
textz=" ".join(text)

## Steps to use spacy
1. import language in spacy that which language you want to get engaged by using <b>nlp=spacy.blank("en","hi")</b> and make it store in an variable so that it could be used as an object
2. Put setence under the object you have created like <b>nlp("here is the sentence")</b>
3. you can perform operation on it by taking the build in function provided by the spacy librarie

In [None]:
import spacy
nlp=spacy.blank("en")
docsec=nlp(textz)
emails=[]
for token in docsec:
    if token.like_email:
        emails.append(token)
num=1
for x in emails:
    print(f"This is the {num} email {x} ")
    num+=1

In [None]:
doc_text="का हिन्दी अनुवाद | कोलिन्स अंग्रेज़ी-हिन्दी शब्दकोश"
with open("hindifile.txt","w",encoding="utf-8") as f:
    f.write(doc_text)

In [None]:
doc=nlp("here is an example of an sentence where it includes many thing which cannot be compiled at once")
token_doc=[token.text for token in doc]
# here if i didnt add token.text in front of token we could get an object called spacy.tokens.token.Token 
# if i add token.text in token in the above loop we will get an string as an element in token_doc list
print(token_doc)
type(token_doc[1])


In [None]:
#suppose i have sentence where gimme is defiend in it and i want it to be calles as give me in tokens format
#so what i can do is i can introduce an special case justfor gimme so that whenever tokenization performs it should
# return give and me 
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case("gimme",[{ORTH:"gim"},{ORTH:"me"}])
#add_special_case() takes exactly 2 positional arguments 
nlp.tokenizer.add_special_case("sandwich",[{ORTH:"sand"},{ORTH:"wich"}])
example=nlp("gimme a sandwich")
token=[token.text for token in example]
print(token)
# but remember coul only spit one word in  2  words but can never update the character
# it can just split the character in two but not modifie it try putting give in place of gim

In [None]:
#and for spliting sentence from the given text  we need to add an componet which will basically teach
# tokenizer to split anything  using differnet rules
# so for it we have add an pipeline is na basically different attributess or boundary whcih are firt checked
# then it is tokenized
doctor=nlp("Dr. Ambedakar Centrally Sponsored Scheme of Post-Matric Scholarships for the Economically Backward Class (EBC) Students\" is a Scholarship Scheme by the Department of Social Justice and Empowerment, Ministry of Social Justice and Empowerment.")
for token in doctor.sents:
    print(token)

# this is the error we will get 

In [None]:
nlp.pipe_names

In [None]:
nlp.add_pipe("sentencizer")

In [None]:
nlp.pipe_names
#here suppose we have added an component with tokenizer to make it capable of spliting sentences

In [46]:
#now we will able to pefrom the above operation
doctor=nlp("")
for token in doctor.sents:
    print(token)


![Alt text](2.png)
### when you used spacy.blank you basically create an tokenizer with blank pipe 
![Alt text](1.png)
### and here you will get an pipe which are fill with attributes and rules through  tokenizing will work
### Pipeline are basically a set of components

## Language Processing  pipeline

Overview of spaCy Language Processing Pipeline:

-  Components: Tokenization, Part-of-Speech Tagging, Lemmatization, Named Entity Recognition (NER).
-  Importance: Essential for NLP tasks, enhances text processing capabilities.

Customization of spaCy Pipeline:

-  Blank Pipeline Customization: Users can include only necessary components for their specific needs.
-  Differences: Pre-trained pipelines vs. blank pipelines, flexibility vs. specificity.

Accessing NLP Tools:

-  APIs: Provide quick access for implementation without deep knowledge.
-  Importance of Understanding: Key components like part of speech and entities aid in NLP analysis and processing.

Named Entity Recognition (NER):

-  Customization: Allows recognition of entities in text, enhancing language analysis.
-  Variation in Pipelines: Different pipelines offer varying language support and components for entity recognition.

Utilizing 'displacy' Module:

-  Visual Entity Display: Helps understand entity types like organization, money, and people.

Importance of Language Pipelines:

-  Obtaining POS, Lemma, and Entity Recognition: Crucial for text processing and comprehension.
-  Differences between Blank and Trained Pipelines: Highlighted importance of customization for specific tasks.

Customizing Blank Pipeline:

-  Adding Custom Components: Enhances entity recognition for specific tasks in English sentences.

In [1]:
import spacy

!: This symbol indicates that the command is being executed in the shell or command line interface rather than in Python code.(that whaterver it follows it is an shell command not an python code)

python: This is the command used to execute Python code in the shell.

-m: This flag stands for "module" and tells Python to run the specified module as a script. It's used when you want to execute a module as a standalone script, without needing to know the exact location of the module file.

spacy: This is the module or package name that we want to run.

download: This is a command provided by the spacy module to download resources such as language models or data files.



In [5]:
!python -m spacy download en_core_web_sm
#command to download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 10.9 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
#this command shows where the python is installed and in which enviorment`
import sys
print(sys.executable)

C:\Users\rashi\anaconda\python.exe


In [1]:
import spacy
nlp=spacy.load("en_core_web_sm")


In [6]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1f7a9dc6640>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1f7a9dc65e0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1f7a9ca4dd0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1f7a9cdd680>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1f7a9f29900>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1f7a9da7350>)]

In [6]:
one=nlp("it means that the command is being executed in the shell or command line interface rather than being interpreted as Python code.")
for token in one:
    print(f"{token}| {token.pos_} | {token.lemma_}")

it| PRON | it
means| VERB | mean
that| SCONJ | that
the| DET | the
command| NOUN | command
is| AUX | be
being| AUX | be
executed| VERB | execute
in| ADP | in
the| DET | the
shell| NOUN | shell
or| CCONJ | or
command| NOUN | command
line| NOUN | line
interface| NOUN | interface
rather| ADV | rather
than| ADP | than
being| AUX | be
interpreted| VERB | interpret
as| ADP | as
Python| PROPN | Python
code| NOUN | code
.| PUNCT | .


In [7]:
#an example of NER(NAMED ENTITY RECOGNISE)
two=nlp("Tesla will spend more than $500 million this year to expand its fast-charging network, CEO Elon Musk said on Friday")
for token in two.ents:
    print(token.text,"|",token.label_,"|",spacy.explain(token.label_))
    #label will tell you baout ner of an sentence like what does the part of the sentence represent wheter its an organisation or an geoloigcal place

Tesla | ORG | Companies, agencies, institutions, etc.
more than $500 million | MONEY | Monetary values, including unit
this year | DATE | Absolute or relative dates or periods
Elon Musk | PERSON | People, including fictional
Friday | DATE | Absolute or relative dates or periods


## An way to visualise NER

In [8]:
from spacy import displacy
displacy.render(two,style="dep")

In [42]:
displacy.render(two,style="ent")

In [40]:
help(displacy)

Help on package spacy.displacy in spacy:

NAME
    spacy.displacy - spaCy's built in visualization suite for dependencies and named entities.

DESCRIPTION
    DOCS: https://spacy.io/api/top-level#displacy
    USAGE: https://spacy.io/usage/visualizers

PACKAGE CONTENTS
    render
    templates

FUNCTIONS
    app(environ, start_response)
    
    get_doc_settings(doc: spacy.tokens.doc.Doc) -> Dict[str, Any]
    
    parse_deps(orig_doc: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], options: Dict[str, Any] = {}) -> Dict[str, Any]
        Generate dependency parse in {'words': [], 'arcs': []} format.
        
        orig_doc (Union[Doc, Span]): Document to parse.
        options (Dict[str, Any]): Dependency parse specific visualisation options.
        RETURNS (dict): Generated dependency parse keyed by words and arcs.
    
    parse_ents(doc: spacy.tokens.doc.Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]
        Generate named entities in [{start: i, end: i, label: 'label'}

In [11]:
nlp_one=spacy.blank("en")
nlp_one.add_pipe("ner",source=nlp)
hey=nlp_one("Dr. Strange hey yeah. Hulk loves chat from delhi")
for token in hey:
    print(f"{token} | {token.ent_type_}")

Dr. | 
Strange | PERSON
hey | 
yeah | 
. | 
Hulk | 
loves | 
chat | 
from | 
delhi | GPE


## Stemimng and lemmatization
### Stemming
used fixed rules such as remove able ,ing to derive its base word
### Lemmatization
where lemmatization is used to bring the word coomes to there root word with knowledge of language

In [2]:
# for applying stemming we have to use nltk librarie cause in spacy there isnt a function which can perform stemming
import nltk 
import spacy
from nltk.stem import SnowballStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import PorterStemmer# Porter stemmer is an class so to use its function we have create an object of it first
stemmer= PorterStemmer()
words=["running","cats","jumped","studying","friendly","eating","walked","swimming","houses","faster"]
stemmed_words=[]
for word in words:
    stemmed_words.append(stemmer.stem(word))
for worde,wordr in zip(words,stemmed_words):
    print(f"{worde}|{wordr}")



running|run
cats|cat
jumped|jump
studying|studi
friendly|friendli
eating|eat
walked|walk
swimming|swim
houses|hous
faster|faster


In [3]:
#performing lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer= WordNetLemmatizer()
words=["running","cats","jumped","studying","friendly","eating","walked","swimming","houses","faster"]
for word in words:
    print(f"{word} | {lemmatizer.lemmatize(word)}")
#this dosent give the proper result as we expected cause we have to give proper pos tags
# now lets use spacy

running | running
cats | cat
jumped | jumped
studying | studying
friendly | friendly
eating | eating
walked | walked
swimming | swimming
houses | house
faster | faster


In [4]:
import spacy
nlp=spacy.load("en_core_web_sm")
hey=nlp("running cats jumped studying friendly eating walked swimming houses faster better")
for word in hey:
    print(f"{word} | {word.lemma_}")


running | run
cats | cat
jumped | jump
studying | study
friendly | friendly
eating | eating
walked | walk
swimming | swimming
houses | house
faster | fast
better | well


In [5]:
       
#suppose there are some words which you knwo mean sthe same but model dosent so you can add an special rule too
nlp.pipe_names
#attribute_ruler so this is something which assigns an attribute to the word so you have customize it get the result
ar=nlp.get_pipe("attribute_ruler")

ar.add([[{"TEXT":"bro"}],[{"TEXT":"bruh"}]],{"LEMMA":"Brother"})
hey=nlp("bruh means brother and bro also means brother")
for word in hey:
    print(f"{word} | {word.lemma_}")
    

bruh | Brother
means | mean
brother | brother
and | and
bro | Brother
also | also
means | mean
brother | brother


### POS(PART OF SPEECH) BY SPACY

In [6]:
import spacy
nlp=spacy.load("en_core_web_sm")
sentence=nlp("Dr. Strange hey yeah. Hulk loves chat from delhi.And hulk made it to the very end")
for word in sentence:
    #here i would also include the tag cause through that we could get the tense of the word used in sentence
    print(f"{word} | {word.pos_} | {spacy.explain(word.pos_)} |{word.pos} | {word.tag_} | {spacy.explain(word.tag_)}")

Dr. | PROPN | proper noun |96 | NNP | noun, proper singular
Strange | PROPN | proper noun |96 | NNP | noun, proper singular
hey | INTJ | interjection |91 | UH | interjection
yeah | INTJ | interjection |91 | UH | interjection
. | PUNCT | punctuation |97 | . | punctuation mark, sentence closer
Hulk | PROPN | proper noun |96 | NNP | noun, proper singular
loves | VERB | verb |100 | VBZ | verb, 3rd person singular present
chat | NOUN | noun |92 | NN | noun, singular or mass
from | ADP | adposition |85 | IN | conjunction, subordinating or preposition
delhi | ADV | adverb |86 | RB | adverb
. | PUNCT | punctuation |97 | . | punctuation mark, sentence closer
And | CCONJ | coordinating conjunction |89 | CC | conjunction, coordinating
hulk | NOUN | noun |92 | NN | noun, singular or mass
made | VERB | verb |100 | VBD | verb, past tense
it | PRON | pronoun |95 | PRP | pronoun, personal
to | ADP | adposition |85 | IN | conjunction, subordinating or preposition
the | DET | determiner |90 | DT | deter

-  {word.pos_}: This placeholder is replaced with the textual representation of the part-of-speech (POS) tag of the word.
-  {spacy.explain(word.pos_)}: This placeholder is replaced with the explanation of the POS tag provided by spaCy.
-  {word.pos}: This placeholder is replaced with the numerical representation of the POS tag of the word.
-  {word.tag_}: This placeholder is replaced with the textual representation of the detailed POS tag (also known as the fine-grained tag) of the word.
-  {spacy.explain(word.tag_)}: This placeholder is replaced with the explanation of the detailed POS tag provided by spaCy.

In [7]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:
·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)

In [8]:
count = doc.count_by(spacy.attrs.POS)
count

{96: 13,
 92: 46,
 100: 24,
 90: 9,
 85: 16,
 93: 16,
 97: 27,
 98: 1,
 84: 20,
 103: 10,
 87: 6,
 99: 5,
 89: 12,
 86: 3,
 94: 3,
 95: 2}

In [9]:
for k,v in count.items():
    print(doc.vocab[k].text,"|",v)

PROPN | 13
NOUN | 46
VERB | 24
DET | 9
ADP | 16
NUM | 16
PUNCT | 27
SCONJ | 1
ADJ | 20
SPACE | 10
AUX | 6
SYM | 5
CCONJ | 12
ADV | 3
PART | 3
PRON | 2


In [27]:
import spacy
tagger=nlp.get_pipe("ner")
for label in tagger.labels:
    print(f"{label}| {spacy.explain(label)}")

CARDINAL| Numerals that do not fall under another type
DATE| Absolute or relative dates or periods
EVENT| Named hurricanes, battles, wars, sports events, etc.
FAC| Buildings, airports, highways, bridges, etc.
GPE| Countries, cities, states
LANGUAGE| Any named language
LAW| Named documents made into laws.
LOC| Non-GPE locations, mountain ranges, bodies of water
MONEY| Monetary values, including unit
NORP| Nationalities or religious or political groups
ORDINAL| "first", "second", etc.
ORG| Companies, agencies, institutions, etc.
PERCENT| Percentage, including "%"
PERSON| People, including fictional
PRODUCT| Objects, vehicles, foods, etc. (not services)
QUANTITY| Measurements, as of weight or distance
TIME| Times smaller than a day
WORK_OF_ART| Titles of books, songs, etc.


In [37]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [64]:
doc=nlp("Tesla  is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(f"{ent}|{ent.label_}")
#suppose i have add an new NER and you also now there some buggs in ner too so to correct it we could add our own NER


Twitter|PRODUCT
$45 billion|MONEY


In [63]:
from spacy.tokens import Span
s1=Span(doc,0,1,label="ORG")
s2=Span(doc,5,6,label="ORG")
doc.set_ents([s1,s2],default="unmodified")