# spaCy Functions Explained

## Introduction to spaCy
spaCy is an open-source library for advanced natural language processing in Python. It is designed for efficiency and ease of use, providing pre-trained models for various languages and supporting many NLP tasks.

## 1. Tokenization
Tokenization in spaCy involves breaking text into smaller units (tokens), which can be words or sentences.

### **1.1 Word Tokenization**
- **Function**: `nlp()`
- **Syntax**: `nlp(text)`
- **Description**: Processes a given text and breaks it down into tokens.
- **Example**:
    ```python
    import spacy

    nlp = spacy.load("en_core_web_sm")
    text = "Natural language processing with spaCy is efficient!"
    doc = nlp(text)
    words = [token.text for token in doc]
    print(words)  # Output: ['Natural', 'language', 'processing', 'with', 'spaCy', 'is', 'efficient', '!']
    ```

### **1.2 Sentence Tokenization**
- **Function**: `nlp()`
- **Syntax**: `nlp(text)`
- **Description**: Processes the text and segments it into sentences.
- **Example**:
    ```python
    doc = nlp("Hello world. How are you?")
    sentences = [sent.text for sent in doc.sents]
    print(sentences)  # Output: ['Hello world.', 'How are you?']
    ```

## 2. Stop Words Removal
spaCy includes a built-in list of stop words that can be easily filtered out.

- **Function**: `is_stop`
- **Syntax**: `token.is_stop`
- **Description**: Checks if a token is a stop word.
- **Example**:
    ```python
    doc = nlp("The cat sat on the mat.")
    stop_words = [token.text for token in doc if token.is_stop]
    print(stop_words)  # Output: ['The', 'on', 'the']
    ```

## 3. Stemming and Lemmatization
spaCy primarily focuses on lemmatization rather than stemming.

- **Function**: `lemma_`
- **Syntax**: `token.lemma_`
- **Description**: Provides the lemma (base form) of a word.
- **Example**:
    ```python
    doc = nlp("better")
    lemmatized_word = [token.lemma_ for token in doc]
    print(lemmatized_word)  # Output: ['good']
    ```

## 4. Part-of-Speech Tagging (POS)
spaCy performs part-of-speech tagging to identify the grammatical categories of tokens.

- **Function**: `tag_`
- **Syntax**: `token.tag_`
- **Description**: Provides the part-of-speech tag for each token.
- **Example**:
    ```python
    doc = nlp("Python is great")
    pos_tags = [(token.text, token.pos_) for token in doc]
    print(pos_tags)  # Output: [('Python', 'PROPN'), ('is', 'AUX'), ('great', 'ADJ')]
    ```

## 5. Named Entity Recognition (NER)
spaCy has robust named entity recognition capabilities.

- **Function**: `ents`
- **Syntax**: `doc.ents`
- **Description**: Extracts named entities from the processed document.
- **Example**:
    ```python
    doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    print(entities)  # Output: [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]
    ```

## 6. Dependency Parsing
Dependency parsing identifies the grammatical structure of a sentence by establishing relationships between words.

- **Function**: `dep_`
- **Syntax**: `token.dep_`
- **Description**: Provides the syntactic dependency relation of a token.
- **Example**:
    ```python
    doc = nlp("The quick brown fox jumps over the lazy dog.")
    dependencies = [(token.text, token.dep_) for token in doc]
    print(dependencies)  # Output: [('The', 'det'), ('quick', 'amod'), ('brown', 'amod'), ('fox', 'nsubj'), ('jumps', 'ROOT'), ...]
    ```

## 7. Text Classification
spaCy supports text classification through custom training.

- **Function**: `TextCategorizer`
- **Syntax**: `nlp.add_pipe("textcat", config={...})`
- **Description**: Allows for adding a text categorization pipeline.
- **Example**:
    ```python
    from spacy.pipeline.textcat import Config

    config = Config().from_str('{"model": "textcat"}')
    nlp.add_pipe("textcat", config=config)
    # Continue with model training as required
    ```

## 8. Corpora Access
spaCy does not provide direct access to corpora like NLTK, but it can be integrated with datasets.

## 9. Custom Pipeline Components
spaCy allows for custom components in the NLP pipeline.

- **Function**: `nlp.add_pipe()`
- **Syntax**: `nlp.add_pipe(component, name='component_name')`
- **Description**: Adds a custom component to the processing pipeline.
- **Example**:
    ```python
    def custom_component(doc):
        print("Custom component processing text...")
        return doc

    nlp.add_pipe(custom_component, last=True)
    doc = nlp("This will trigger the custom component.")
    ```

## 10. Visualizing Dependency Parse
spaCy can visualize dependency parsing with its integrated visualization tools.

- **Function**: `displacy.render()`
- **Syntax**: `displacy.render(doc, style='dep')`
- **Description**: Renders the dependency tree of a document.
- **Example**:
    ```python
    from spacy import displacy

    doc = nlp("The quick brown fox jumps over the lazy dog.")
    displacy.render(doc, style='dep', jupyter=True)  # Displays in Jupyter Notebook
    ```


# spaCy Functions Explained

## Introduction to spaCy
spaCy is an open-source library for advanced natural language processing in Python. It is designed for efficiency and ease of use, providing pre-trained models for various languages and supporting many NLP tasks.

## Table of spaCy Functions

| **Function**                       | **Syntax**                          | **Description**                                                        | **Example**                                                                                                                                                       |
|------------------------------------|-------------------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Tokenization**                   | `nlp(text)`                        | Processes a given text and breaks it down into tokens.                | ```python<br>import spacy<br>nlp = spacy.load("en_core_web_sm")<br>text = "Natural language processing with spaCy is efficient!"<br>doc = nlp(text)<br>words = [token.text for token in doc]<br>print(words)  # Output: ['Natural', 'language', 'processing', 'with', 'spaCy', 'is', 'efficient', '!']``` |
| **Sentence Tokenization**          | `nlp(text)`                        | Segments the text into sentences.                                     | ```python<br>doc = nlp("Hello world. How are you?")<br>sentences = [sent.text for sent in doc.sents]<br>print(sentences)  # Output: ['Hello world.', 'How are you?']```                         |
| **Stop Words Removal**             | `token.is_stop`                   | Checks if a token is a stop word.                                     | ```python<br>doc = nlp("The cat sat on the mat.")<br>stop_words = [token.text for token in doc if token.is_stop]<br>print(stop_words)  # Output: ['The', 'on', 'the']```                            |
| **Lemmatization**                  | `token.lemma_`                    | Provides the lemma (base form) of a word.                             | ```python<br>doc = nlp("better")<br>lemmatized_word = [token.lemma_ for token in doc]<br>print(lemmatized_word)  # Output: ['good']```                                                   |
| **Part-of-Speech Tagging (POS)**   | `token.tag_`                      | Provides the part-of-speech tag for each token.                       | ```python<br>doc = nlp("Python is great")<br>pos_tags = [(token.text, token.pos_) for token in doc]<br>print(pos_tags)  # Output: [('Python', 'PROPN'), ('is', 'AUX'), ('great', 'ADJ')]```        |
| **Named Entity Recognition (NER)**  | `doc.ents`                        | Extracts named entities from the processed document.                  | ```python<br>doc = nlp("Apple is looking at buying U.K. startup for $1 billion")<br>entities = [(ent.text, ent.label_) for ent in doc.ents]<br>print(entities)  # Output: [('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]``` |
| **Dependency Parsing**             | `token.dep_`                      | Provides the syntactic dependency relation of a token.                | ```python<br>doc = nlp("The quick brown fox jumps over the lazy dog.")<br>dependencies = [(token.text, token.dep_) for token in doc]<br>print(dependencies)  # Output: [('The', 'det'), ('quick', 'amod'), ('brown', 'amod'), ('fox', 'nsubj'), ('jumps', 'ROOT'), ...]``` |
| **Text Classification**            | `nlp.add_pipe("textcat", config={...})` | Allows for adding a text categorization pipeline.                     | ```python<br>from spacy.pipeline.textcat import Config<br>config = Config().from_str('{"model": "textcat"}')<br>nlp.add_pipe("textcat", config=config)<br># Continue with model training as required``` |
| **Custom Pipeline Components**      | `nlp.add_pipe(component, name='component_name')` | Adds a custom component to the processing pipeline.                   | ```python<br>def custom_component(doc):<br>    print("Custom component processing text...")<br>    return doc<br>nlp.add_pipe(custom_component, last=True)<br>doc = nlp("This will trigger the custom component.")``` |
| **Visualizing Dependency Parse**   | `displacy.render(doc, style='dep')` | Renders the dependency tree of a document.                            | ```python<br>from spacy import displacy<br>doc = nlp("The quick brown fox jumps over the lazy dog.")<br>displacy.render(doc, style='dep', jupyter=True)  # Displays in Jupyter Notebook``` |



# Comparison of spaCy and NLTK

## Overview
Both spaCy and NLTK are popular libraries for Natural Language Processing (NLP) in Python, but they have different design philosophies and functionalities.

| **Feature**                   | **spaCy**                                                 | **NLTK**                                                  |
|-------------------------------|----------------------------------------------------------|----------------------------------------------------------|
| **Purpose**                   | Designed for production and efficiency in NLP tasks.     | More of an educational tool with a wide variety of NLP tasks. |
| **Ease of Use**               | User-friendly API, straightforward and intuitive.       | Slightly steeper learning curve due to its complexity.    |
| **Speed**                     | Optimized for performance, faster in processing.         | Generally slower; offers more flexibility and options.     |
| **Pre-trained Models**        | Provides state-of-the-art pre-trained models for various languages. | Limited pre-trained models; focuses on educational resources. |
| **Tokenization**              | Advanced tokenization that accounts for various cases.   | Basic tokenization; can be less efficient for large texts.  |
| **Lemmatization/Stemming**    | Integrated lemmatization; stemming available but less emphasized. | Both stemming and lemmatization options available.          |
| **Named Entity Recognition**   | Built-in NER with high accuracy.                         | NER functionality is available but less efficient.         |
| **Dependency Parsing**        | Powerful dependency parsing features included.           | Dependency parsing available but less intuitive.           |
| **Visualization**             | Offers visualization tools (e.g., displaCy) for NER and dependencies. | Visualization requires additional tools (e.g., Matplotlib). |
| **Community and Support**     | Growing community, well-documented with active support.  | Established community with extensive documentation and resources. |
| **Text Classification**       | Integrated text classification support with pipelines.   | Requires more setup for text classification tasks.         |
| **Corpora**                   | Limited built-in corpora; focuses on model training.     | Extensive collection of corpora and linguistic datasets.    |


# Comparison of spaCy and NLTK Syntax

## Overview
Both spaCy and NLTK offer unique syntaxes and functionalities for natural language processing, catering to different use cases. Below is a comparison highlighting their syntax for various NLP tasks.

| **Task**                        | **spaCy Syntax**                                       | **NLTK Syntax**                                           |
|---------------------------------|-------------------------------------------------------|----------------------------------------------------------|
| **Importing Library**           | ```python<br>import spacy<br>nlp = spacy.load("en_core_web_sm")``` | ```python<br>import nltk<br>nltk.download('punkt')```     |
| **Tokenization**                | ```python<br>doc = nlp("Text to tokenize")<br>tokens = [token.text for token in doc]``` | ```python<br>from nltk.tokenize import word_tokenize<br>tokens = word_tokenize("Text to tokenize")``` |
| **Sentence Tokenization**       | ```python<br>sentences = [sent.text for sent in doc.sents]``` | ```python<br>from nltk.tokenize import sent_tokenize<br>sentences = sent_tokenize("Text to tokenize.")``` |
| **Stop Words Removal**          | ```python<br>stop_words = [token.text for token in doc if token.is_stop]``` | ```python<br>from nltk.corpus import stopwords<br>stop_words = set(stopwords.words('english'))``` |
| **Lemmatization**               | ```python<br>lemmatized_word = [token.lemma_ for token in doc]``` | ```python<br>from nltk.stem import WordNetLemmatizer<br>lemmatizer = WordNetLemmatizer()<br>lemmatized_word = lemmatizer.lemmatize("better", pos='a')``` |
| **Stemming**                    | ```python<br>from spacy.lang.en import English<br>nlp = English()<br>stemmed_word = nlp("running")[0].lemma_``` | ```python<br>from nltk.stem import PorterStemmer<br>ps = PorterStemmer()<br>stemmed_word = ps.stem("running")``` |
| **Part-of-Speech Tagging (POS)**| ```python<br>pos_tags = [(token.text, token.pos_) for token in doc]``` | ```python<br>from nltk import pos_tag<br>pos_tags = pos_tag(tokens)``` |
| **Named Entity Recognition (NER)**| ```python<br>entities = [(ent.text, ent.label_) for ent in doc.ents]``` | ```python<br>from nltk import ne_chunk<br>ne_tree = ne_chunk(pos_tags)``` |
| **Dependency Parsing**          | ```python<br>dependencies = [(token.text, token.dep_) for token in doc]``` | ```python<br>from nltk.parse import CoreNLPParser<br>parser = CoreNLPParser()<br>parsed_sentence = list(parser.raw_parse("The cat sat on the mat."))``` |
| **Text Classification**         | ```python<br>nlp.add_pipe("textcat")<br>text_cat = nlp.get_pipe("textcat")``` | ```python<br>from nltk.classify import NaiveBayesClassifier<br>classifier = NaiveBayesClassifier.train(train_data)``` |
| **Corpora Access**              | ```python<br>from spacy.cli import download<br>download("en_core_web_sm")``` | ```python<br>from nltk.corpus import movie_reviews<br>words = movie_reviews.words()``` |
| **Visualizing Dependency Parse**| ```python<br>from spacy import displacy<br>displacy.render(doc, style='dep')``` | ```python<br>import matplotlib.pyplot as plt<br># Use additional libraries for visualization``` |


In [1]:
print("The End")

The End
