<a href="https://colab.research.google.com/github/Abhilitcode/NLP_Practical/blob/main/POS_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

SpaCy is a powerful and popular open-source Natural Language Processing (NLP) library in Python. It is designed specifically for production-level work and real-world use cases, making it widely used for building NLP pipelines. Here's why it's used and what it offers:

### Key Features of SpaCy:
1. **Fast and Efficient**: SpaCy is optimized for performance, making it faster than many other NLP libraries.
2. **Pre-trained Models**: It provides pre-trained models for many languages, allowing you to easily work with tasks like part-of-speech tagging, named entity recognition, and dependency parsing.
3. **Tokenization**: It breaks down text into smaller units (tokens), which is a fundamental step in NLP.
4. **POS Tagging**: SpaCy assigns parts of speech (nouns, verbs, adjectives, etc.) to each token.
5. **Named Entity Recognition (NER)**: It identifies entities like people, organizations, dates, and more from text.
6. **Dependency Parsing**: It analyzes the grammatical structure of a sentence, showing relationships between words.
7. **Custom Pipelines**: SpaCy allows you to build custom NLP pipelines by adding your own components for specific use cases.
8. **Word Vectors**: It supports word vectors for semantic similarity comparisons and other tasks involving the meaning of words.
9. **Integration with Deep Learning**: SpaCy is designed to work seamlessly with deep learning frameworks like PyTorch and TensorFlow for advanced use cases.

### Common Use Cases of SpaCy:
1. **Text Preprocessing**: Tokenizing, stemming, and lemmatizing text for further analysis.
2. **Named Entity Recognition (NER)**: Extracting information like names, organizations, locations, etc., from text.
3. **Dependency Parsing**: Understanding the syntactic structure of sentences, which is useful for tasks like question answering or relationship extraction.
4. **Text Classification**: Assigning categories or labels to text (e.g., sentiment analysis, topic classification).
5. **Machine Translation**: Serving as a preprocessing tool for translating text from one language to another.
6. **Text Summarization**: Reducing text length while preserving important information.
7. **Question Answering**: Providing relevant answers from a set of documents.
8. **Information Extraction**: Extracting structured information from unstructured text.
   
### Why SpaCy Is Preferred:
- **Ease of Use**: It has a simple and intuitive API for handling common NLP tasks.
- **Production-Ready**: SpaCy is designed with production in mind, making it easy to scale for real-world applications.
- **Extensive Documentation**: There is strong support and excellent documentation, making it beginner-friendly while offering advanced features.
- **Multilingual Support**: It supports multiple languages out of the box, which is useful for global projects.

In summary, SpaCy is used for building robust NLP pipelines, handling text processing tasks with high performance and scalability.

In [20]:
!pip install spacy



In [21]:
import spacy

The code snippet you're referring to has a typo, but here's an explanation with the corrected code:

```python
spacy = spacy.load("en_core_web_sm")
```

### What It Means:

- **`spacy.load()`**: This function is used to load a pre-trained SpaCy model. Models in SpaCy contain pipelines for NLP tasks such as tokenization, part-of-speech tagging, named entity recognition, and more. You load the model to use its pipeline on text data.

- **`"en_core_web_sm"`**: This is the name of one of SpaCy’s pre-trained models. Let's break it down:
  - **`en`**: Indicates the language of the model (in this case, English).
  - **`core`**: Denotes that this is a general-purpose, core model designed for common NLP tasks.
  - **`web`**: Refers to the data source used to train the model, which in this case includes web texts.
  - **`sm`**: Indicates the model's size (small). SpaCy offers different model sizes, such as:
    - **`sm`** (small): Fast, with fewer features, but good for lightweight tasks.
    - **`md`** (medium): More comprehensive, with word vectors and more detailed pipelines.
    - **`lg`** (large): Includes large word vectors for improved accuracy and detailed features.
    
### What It Does:

- When you load the `"en_core_web_sm"` model, SpaCy initializes a pipeline that can process English text for various tasks like tokenization, part-of-speech (POS) tagging, named entity recognition (NER), and more.

### Example Usage:
After loading the model, you can use it to process text:

```python
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Process a text string
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")

# Extract tokens, POS tags, entities, etc.
for token in doc:
    print(token.text, token.pos_, token.dep_)

# Named entity recognition (NER)
for ent in doc.ents:
    print(ent.text, ent.label_)
```

### Why Use `en_core_web_sm`?
This small model is often used for tasks that require:
- **Quick processing**: Since it’s small, it’s faster to load and use.
- **Basic NLP tasks**: If you don’t need highly detailed word vectors or more complex parsing, this model works well.

In more complex applications where accuracy is critical, you might opt for a larger model (e.g., `en_core_web_md` or `en_core_web_lg`).


In [22]:
nlp = spacy.load('en_core_web_sm')



In [23]:
doc = nlp("My name is Abhishek and i will purchase a book today")

.text is some of the attributes of nlp model

In [24]:
doc.text

'My name is Abhishek and i will purchase a book today'

doc.text[0]
What it is: This accesses the first character of the entire processed document's text.

Explanation: doc.text is a string representation of the entire text in the doc object. By using [0], you're accessing the first character (not the first token) of that string.

In [25]:
doc.text[0]

'M'

What it is: This accesses the first token in the doc object.

Explanation: doc is a sequence of tokens, and using [0] gives you the first token, which is an object that contains much more information (like text, part-of-speech tags, etc.) than just the character.

In [26]:
doc[0]

My

In SpaCy, `pos_` and `tag_` are attributes of tokens that provide information about the grammatical roles of the words in a sentence. These attributes are assigned during the part-of-speech tagging process, which is an important step in Natural Language Processing (NLP).

### 1. **`pos_`** (Part-of-Speech)
- **Description**: `pos_` returns the coarse or universal part-of-speech tag for a token. These are general categories of words, like **noun**, **verb**, **adjective**, etc.
- **Examples of `pos_` values**:
  - `NOUN`: Noun (person, place, thing)
  - `VERB`: Verb (action word)
  - `ADJ`: Adjective (describes a noun)
  - `ADV`: Adverb (modifies a verb, adjective, or other adverbs)
  - `PRON`: Pronoun (replaces a noun)
  - `PROPN`: Proper noun (specific name, like "Apple")

  **Example Usage**:
  ```python
  doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")
  for token in doc:
      print(token.text, token.pos_)
  ```

  **Output** (for each word):
  ```
  Apple PROPN
  is AUX
  looking VERB
  at ADP
  buying VERB
  a DET
  U.K. PROPN
  startup NOUN
  for ADP
  $ SYM
  1 NUM
  billion NUM
  ```

### 2. **`tag_`** (Detailed Part-of-Speech Tag)
- **Description**: `tag_` provides a more fine-grained, detailed part-of-speech tag based on the specific language's grammar. It gives additional information about tense, number, gender, case, etc. in the context of the language you're working with.
- **Examples of `tag_` values** (for English):
  - `NN`: Noun, singular
  - `NNS`: Noun, plural
  - `VB`: Verb, base form
  - `VBD`: Verb, past tense
  - `JJ`: Adjective
  - `RB`: Adverb

  **Example Usage**:
  ```python
  doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")
  for token in doc:
      print(token.text, token.tag_)
  ```

  **Output** (for each word):
  ```
  Apple NNP
  is VBZ
  looking VBG
  at IN
  buying VBG
  a DT
  U.K. NNP
  startup NN
  for IN
  $ $
  1 CD
  billion CD
  ```

### Difference Between `pos_` and `tag_`:
- **`pos_`**: Coarse, general category of the word (universal POS tags). It tells you **what kind of word** it is.
- **`tag_`**: Fine-grained, detailed tag based on the specific grammar of the language. It tells you **additional grammatical information** about the word (e.g., tense, number).

### Example for Both `pos_` and `tag_`:
```python
doc = nlp("She was reading a book.")
for token in doc:
    print(token.text, token.pos_, token.tag_)
```

**Output**:
```
She PRON PRP
was AUX VBD
reading VERB VBG
a DET DT
book NOUN NN
```

- In this case, `pos_` indicates general categories (PRON for pronoun, AUX for auxiliary verb), while `tag_` gives more specific details (`PRP` for personal pronoun, `VBD` for past tense verb, etc.).

These attributes allow you to extract and analyze grammatical information from text for various NLP tasks.

In [27]:
doc[0].pos_

'PRON'

The attribute .text represents the entire text as a string, and iterating over it gives you individual characters, not tokens (words). if you need token you have to iterate through the doc. which will give you the token

In [28]:
for token in doc:
  print(token.text, "-------->",token.pos_, token.tag_, spacy.explain(token.tag_))

My --------> PRON PRP$ pronoun, possessive
name --------> NOUN NN noun, singular or mass
is --------> AUX VBZ verb, 3rd person singular present
Abhishek --------> PROPN NNP noun, proper singular
and --------> CCONJ CC conjunction, coordinating
i --------> PRON PRP pronoun, personal
will --------> AUX MD verb, modal auxiliary
purchase --------> VERB VB verb, base form
a --------> DET DT determiner
book --------> NOUN NN noun, singular or mass
today --------> NOUN NN noun, singular or mass


The `pos_` (coarse part-of-speech tag) and `tag_` (fine-grained part-of-speech tag) for each word in the sentence. This is great for getting a detailed breakdown of the text.

### Explanation of the output:

- **My PRON PRP$**:
  - `PRON`: Pronoun (a coarse tag).
  - `PRP$`: Possessive pronoun (a more specific tag indicating possession, e.g., "my," "your").
  
- **name NOUN NN**:
  - `NOUN`: Noun (general part-of-speech tag).
  - `NN`: Singular noun (fine-grained tag indicating singular form).

- **is AUX VBZ**:
  - `AUX`: Auxiliary verb (helper verb, used with a main verb).
  - `VBZ`: Present tense verb, 3rd person singular (e.g., "is," "does").

- **Abhishek PROPN NNP**:
  - `PROPN`: Proper noun (specific name).
  - `NNP`: Proper noun, singular.

- **and CCONJ CC**:
  - `CCONJ`: Coordinating conjunction (joins two elements of equal importance, like "and," "or").
  - `CC`: The specific tag for coordinating conjunction.

- **i PRON PRP**:
  - `PRON`: Pronoun.
  - `PRP`: Personal pronoun (e.g., "I," "he," "she").

- **will AUX MD**:
  - `AUX`: Auxiliary verb.
  - `MD`: Modal verb (e.g., "will," "can," "shall").

- **purchase VERB VB**:
  - `VERB`: Verb (general action word).
  - `VB`: Base form of a verb (e.g., "purchase," "eat").

- **a DET DT**:
  - `DET`: Determiner (used to specify a noun, e.g., "a," "the").
  - `DT`: Determiner.

- **book NOUN NN**:
  - `NOUN`: Noun.
  - `NN`: Singular noun.

- **today NOUN NN**:
  - `NOUN`: Noun.
  - `NN`: Singular noun (in this case, "today" is tagged as a noun, though it might often be considered an adverb depending on context).


if you want additional detail or explanation about the fine-grained pos then you can use this attribute called spacy.explain('nn')

In [29]:
spacy.explain('NN')

'noun, singular or mass'

In [30]:
spacy.explain('AUX')

'auxiliary'

Concept of word disambiguation

In [32]:
sentence1 = nlp("I went to the bank to deposit some money.")
sentence2 = nlp("We sat by the river bank and had a picnic.")

# Print token and part of speech to give context
for token in sentence1:
    print(f"{token.text}: {token.pos_} - {token.dep_}")


I: PRON - nsubj
went: VERB - ROOT
to: ADP - prep
the: DET - det
bank: NOUN - pobj
to: PART - aux
deposit: VERB - advcl
some: DET - det
money: NOUN - dobj
.: PUNCT - punct


In [33]:
for token in sentence2:
    print(f"{token.text}: {token.pos_} - {token.dep_}")

We: PRON - nsubj
sat: VERB - ROOT
by: ADP - prep
the: DET - det
river: PROPN - compound
bank: NOUN - pobj
and: CCONJ - cc
had: VERB - conj
a: DET - det
picnic: NOUN - dobj
.: PUNCT - punct


In [34]:
sentence3 = nlp("I left the room")
sentence4 = nlp("to the left of the room")

# Print token and part of speech to give context
for token in sentence3:
    print(f"{token.text}: {token.pos_} - {token.dep_}")

I: PRON - nsubj
left: VERB - ROOT
the: DET - det
room: NOUN - dobj


In [35]:
# Print token and part of speech to give context
for token in sentence4:
    print(f"{token.text}: {token.pos_} - {token.dep_}")

to: ADP - ROOT
the: DET - det
left: NOUN - pobj
of: ADP - prep
the: DET - det
room: NOUN - pobj


Word Sense Disambiguation (WSD) is the task of determining the correct meaning (or sense) of a word based on its context. Many words have multiple meanings, and their sense depends on how they are used in a sentence.

For example, consider the word "bank" which can mean:

A financial institution.
The side of a river.

In Sentence 1, "bank" is identified as a noun (NOUN) and is the object of a preposition (pobj), and based on context (words like "deposit" and "money"), it clearly refers to a financial institution.
In Sentence 2, "bank" is also identified as a noun, but this time it's modified by "river" (compound relationship), and the context suggests it refers to a physical location (the side of a river).

What is .dep_?
.dep_ is a way to tell us how a word (or token) in a sentence is connected to other words.
It shows the relationship between words, like whether one word is the subject, the object, or describes another word.
Think of it Like This:
Imagine you have a group of friends talking about a trip:

"Sara (the subject) went (the action) to the beach (where she went)."
In this example:

"Sara" is doing the action (the subject).
"went" is the action (the verb).
"to the beach" tells us where she went (this is a prepositional phrase).
How .dep_ Works:
"Sara" would have a .dep_ of "nsubj" (which means "nominal subject"—the one doing the action).
"went" would have a .dep_ of "ROOT" (it’s the main action).
"to" would have a .dep_ of "prep" (it’s a preposition that starts a phrase).
"beach" would have a .dep_ of "pobj" (which means it’s the object of the preposition).
Summary:
.dep_ tells you how words in a sentence relate to each other.
It helps you understand the sentence structure, like who is doing what and where.

Lets visulaize the pos for each word.

**Displacy** is a visualization tool included with SpaCy that allows you to create visual representations of syntactic dependencies and named entities in text. It's particularly useful for understanding the structure of sentences or the relationships between words in a visually intuitive way.

from spacy import displacy

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

# Process a sentence
doc = nlp("The quick brown fox jumps over the lazy dog.")

# Render the dependency parse in a Jupyter notebook
displacy.render(doc, style='dep', jupyter=True)
```

#### Output
When you run this code in a Jupyter notebook, you'll see a visual representation of the sentence with arrows showing the relationships between words. The main verb will typically be in the center, and the other words will connect to it according to their syntactic roles.

### 3. Visualizing Named Entities

You can also visualize named entities in text. Here’s how to do that:

```python
import spacy
from spacy import displacy

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

# Process a sentence with named entities
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion.")

# Render the named entities in a Jupyter notebook
displacy.render(doc, style='ent', jupyter=True)
```

#### Output
This will show you the entities (like "Apple," "U.K.," and "$1 billion") highlighted in different colors based on their type (e.g., organization, location, monetary value).

### 4. Customizing Visualization

You can customize the appearance of the visualization using CSS styles and other options. For example, you can specify color themes, adjust spacing, and more.

### Summary

- **Displacy** is a great tool for visualizing the structure of sentences and the relationships between words (dependency parsing) or identifying important entities (named entity recognition).
- It's especially useful for understanding complex sentences and for educational purposes.

In [36]:
from spacy import displacy

In [39]:
doc6 = nlp("I like Apple as an organization, it is located in U.K and it earns more than $1 billion")

In [40]:
displacy.render(doc6, style='ent', jupyter=True)

In [41]:
displacy.render(doc6, style='dep', jupyter=True)

In [46]:
options = {
    "compact": True,  # Makes the visualization more compact
    "bg": "#80ced6",  # Background color
    "color": "blue",  # Color of the arrows and text
    "distance": 80    # Distance between words
}

In [47]:
displacy.render(doc6, style='dep', jupyter=True, options=options)