# Start with loading the model and processing a text As Doccument.

In [1]:
import spacy

# Load a spaCy model
nlp = spacy.load("en_core_web_sm")  # basic english model 

# Process a text to create a Doc object
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Different Compoments of Doccument 

#### 1   token.text: The original word text.
#### 2   token.lemma_: The base form of the word.
#### 3   token.pos_: Part of Speech (POS) tag.
#### 4   token.dep_: Syntactic dependency. 

### 1 POS (Part Of Speach )

##### Part-of-Speech (POS) refers to the grammatical categories or word classes that classify words based on their syntactic roles and functions in a 

##### sentence. In Natural Language Processing (NLP), POS tagging is the process of assigning these grammatical categories to each word in a text.






ADJ (Adjective): Describes a noun, providing more information about it.

Example: "beautiful", "quick"

ADP (Adposition): Relates a noun to another word, often indicating direction, place, or time (includes prepositions and postpositions).
Example: "in", "on", "by"


ADV (Adverb): Modifies a verb, adjective, or another adverb, often describing how, when, or where something happens.
Example: "quickly", "very"


AUX (Auxiliary Verb): Helps the main verb by expressing tense, mood, or voice.
Example: "is", "have", "will"

CCONJ (Coordinating Conjunction): Connects words, phrases, or clauses that are of equal syntactic importance.
Example: "and", "but", "or"
    
DET (Determiner): Introduces a noun, specifying its definiteness, quantity, or possession.
Example: "the", "a", "some"

INTJ (Interjection): Expresses emotion or a reaction, often standing alone.
Example: "wow", "ouch"
    
NOUN (Noun): Refers to a person, place, thing, or idea.
Example: "dog", "computer"

NUM (Numeral): Represents a number or numerical value.
Example: "one", "two", "3"

PART (Particle): A small function word that has a grammatical role but doesn’t belong to the main word classes (e.g., adverbial particles in phrasal verbs).
Example: "not", "to" (as in "to go")

PRON (Pronoun): Replaces a noun in a sentence.
Example: "he", "she", "they"

PROPN (Proper Noun): Refers to specific names of people, places, organizations, etc., usually capitalized.
Example: "John", "London", "Microsoft"
    
PUNCT (Punctuation): Any punctuation mark that contributes to the structure and meaning of a text.
Example: ".", ",", "?"
    
SCONJ (Subordinating Conjunction): Connects a subordinate clause to a main clause, often introducing a dependent idea.
Example: "because", "although", "if"

SYM (Symbol): Non-alphabetic characters that represent concepts, often mathematical or currency symbols.
Example: "$", "%", "+"

VERB (Verb): Describes an action, state, or occurrence.
Example: "run", "is", "write"

X (Other): Used for words that do not fit into any other category, often foreign words, typos, or placeholders.
Example: "hmm", "grr", "xxx"

SPACE (Space): Represents a space between words, often included for formatting purposes in tokenization.
Example: " " (a blank space)

In [2]:
for token in doc:
    print(token.pos_)

PROPN
AUX
VERB
ADP
VERB
PROPN
NOUN
ADP
SYM
NUM
NUM
PUNCT


### 2 Lemitization (Convert into root word)

In [3]:
for token in doc:
    print(token.lemma_)

Apple
be
look
at
buy
U.K.
startup
for
$
1
billion
.


### 3 Explanation of token.dep_
#### dep_: This attribute returns the syntactic dependency label of a token as a string. It describes the role of the token in relation to its head token (the word it is syntactically connected to).

In [4]:
for token in doc:
    print(token.dep_)

nsubj
aux
ROOT
prep
pcomp
dobj
dep
prep
quantmod
compound
pobj
punct


### 4 Orignal Text 

In [6]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.


# Named Entity Recognition (NER)

Named Entity Recognition (NER) is a process in Natural Language Processing (NLP) that identifies and classifies key information (entities) in text into predefined categories such as names of people, organizations, locations, dates, and more. Here's a more detailed explanation with clear examples.

Common Entity Types in NER:

0 PERSON: Names of people.

1 ORG: Organizations such as companies, institutions, government agencies.

2 GPE: Geopolitical entities like countries, cities, states.

3 LOC: Non-GPE locations, like mountains, rivers, regions.

4 DATE: Dates, including days, months, years.

5 TIME: Times, such as "2:00 PM".

6 MONEY: Monetary values.

7 PERCENT: Percentage values.

8 FAC: Buildings, airports, highways, bridges, etc.

9 PRODUCT: Objects, vehicles, foods, etc. (Not services.)

### Why NER is Important:


Information Extraction: Automatically extracting important information from large volumes of text.

Data Structuring: Structuring unstructured data into a more usable format for further analysis.
                                                                         
Search Optimization: Enhancing search engines to retrieve information more accurately.
                                                                         

### Example 1 Basic NER on a Sentence

In [12]:
doc1 = nlp("i work at Apple")
for ent in doc1.ents:
    print(ent.label_)

ORG


### Example 2 Basic NER on a Sentence

In [46]:
# Process a sentence
doc3 = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Print the entities found in the sentence
for ent in doc3.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY


### Example 3 NER with Detailed Information

In [24]:
# Process another sentence
doc = nlp("Elon Musk was born on June 28, 1971, in Pretoria, South Africa.")

# Print detailed information about the entities
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}, Start: {ent.start_char}, End: {ent.end_char}")


Entity: Elon Musk, Label: PERSON, Start: 0, End: 9
Entity: June 28, 1971, Label: DATE, Start: 22, End: 35
Entity: Pretoria, Label: GPE, Start: 40, End: 48
Entity: South Africa, Label: GPE, Start: 50, End: 62


### Example 5: Visualizing Named Entities with displacy

In [30]:
from spacy import displacy

# Custom text
text = "Amazon is headquartered in Seattle, Washington, and was founded by Jeff Bezos in 1994."
doc = nlp(text)

# Render the named entities
displacy.render(doc, style="ent", jupyter=True)


# Sentence Segmentation

In [32]:
doc=nlp("Google plans to open a new office in Tokyo by April 2023. Tesla unveiled the new Model S Plaid at the Fremont factory.")
for sent in doc.sents:
    print(sent.text)


Google plans to open a new office in Tokyo by April 2023.
Tesla unveiled the new Model S Plaid at the Fremont factory.


In [59]:
transactions = "Tony gave  $ 2  to Peter, Bruce gave $ 500 to Steve"
docs = nlp(transactions)
for i,ent in enumerate(docs.ents):
    if ent.label_=="MONEY":
        print(ent.text,ent.text[i+1])


IndexError: string index out of range

2 $


In [53]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# The text containing transactions
transactions = "Tony gave  two $ to Peter, Bruce gave 500 € to Steve"

# Process the text with the NLP model
doc = nlp(transactions)

# List of currency symbols
currency_symbols = {"$", "€", "£", "¥"}

# Extract monetary values
for i, token in enumerate(doc):
    # Check if the token is a currency symbol
    if token.text in currency_symbols:
        # Look for the next token and previous token to find the amount
        if i + 1 < len(doc):
            next_token = doc[i + 1]
            if next_token.like_num or next_token.pos_ == "NUM":
                print(f"Monetary value: {token.text} {next_token.text}")
        if i - 1 >= 0:
            prev_token = doc[i - 1]
            if prev_token.like_num or prev_token.pos_ == "NUM":
                print(f"Monetary value: {prev_token.text} {token.text}")

# Also print entities labeled as MONEY
for ent in doc.ents:
    if ent.label_ == "MONEY":
        print(f"Monetary value: {ent.text}")


Monetary value: two $
Monetary value: 500 €
