# <center> <font size = 24 color = 'steelblue'> <b> Document vectors

## Overvi: 

The goal is to generate document vectors using SpaCy by processing text data and calculating average vector representations. The notebook demonstrates how to extract linguistic annotations and token-level vectors for further analysis. This approach enables efficient handling of text data for downstream NLP tasks.

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b> Introduction

In Natural Language Processing (NLP), representing texts in a numerical format is essential for applying machine learning models to textual data. Doc2Vec is a technique that allows us to obtain dense, fixed-length vector representations for texts of arbitrary lengths—whether it's a phrase, sentence, paragraph, or entire document. Unlike traditional methods (like Bag-of-Words or TF-IDF), Doc2Vec considers the context of words in a text, which results in richer and more meaningful embeddings.

### Why Use spaCy for Document Embeddings?
spaCy is a popular and powerful Python library for NLP that provides pre-trained models, word embeddings, and robust tools for tokenization, lemmatization, part-of-speech tagging, named entity recognition, and more. spaCy is highly efficient and is designed for production use.

While Doc2Vec is typically associated with the Gensim library, we can also leverage spaCy's capabilities to generate document vectors using averaging techniques with pre-trained word embeddings.

<div class="alert alert-block alert-success">
<font size = 4>
    
- Doc2vec allows us to directly learn the representations for texts of arbitrary lengths (phrases, sentences, paragraphs and documents), by taking the context of words in the text into account.
- This notebook demonstrates the creation of a document vector using averaging with spaCy.
- spaCy is a python library for natural language processing (NLP) which has a lot of built-in capabilities and features.
- spaCy has different types of models. "The default English language model is `en_core_web_sm`.

##### <a id = 'c1'>
<font size = 10 color = 'midnightblue'> <b>  Install packages (spaCy) and necessary dependencies

In [None]:
!pip install spacy==2.2.4
!python -m spacy download en_core_web_lg

Collecting spacy==2.2.4
  Downloading spacy-2.2.4.tar.gz (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Installing build dependencies ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_cor

<font size = 5 color = seagreen> <b> Import spacy and load the model

In [1]:
import spacy
nlp = spacy.load("en_core_web_lg")

  hasattr(torch, "has_mps")
  and torch.has_mps  # type: ignore[attr-defined]
2024-10-28 03:24:59.963160: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-28 03:25:00.002362: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


VOC-NOTICE: GPU memory for this assignment is capped at 1024MiB


2024-10-28 03:25:02.135341: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


<font size = 5 color = seagreen> <b> Define document set.
<div class="alert alert-block alert-success">
<font size = 4>
    
**Assuming each statement corresponds to a separate document**

In [2]:
documents = ["The analyst reviews the dataset to identify trends and patterns."
             "Data analysis helps businesses make informed decisions based on facts and figures.",
             "In a research project the team gathers data for subsequent analysis.",
             "Charts and graphs are used to visually represent the results of data analysis.",
             "Analyzing customer feedback data provides valuable insights for product improvement."]

In [3]:
processed = [doc.lower().replace(".","") for doc in documents]
print("Document After Pre-Processing:",processed)

Document After Pre-Processing: ['the analyst reviews the dataset to identify trends and patternsdata analysis helps businesses make informed decisions based on facts and figures', 'in a research project the team gathers data for subsequent analysis', 'charts and graphs are used to visually represent the results of data analysis', 'analyzing customer feedback data provides valuable insights for product improvement']


<div class="alert alert-block alert-success">
<font size = 4>  <b>Iterate over each document and instantiate an nlp instance.

In [4]:
for doc in processed:
    # Create a spacy object which is a container for accessing linguistic annotations.
    oc_nlp = nlp(doc)
    print("-"*30)

    # This gives the average vector of each document.
    print("Average Vector of '{}'\n".format(doc),oc_nlp.vector) # Use oc_nlp

    # This gives the text of each word in the doc and their respective vectors.
    for token in oc_nlp: # Use oc_nlp
        print()
        print(token.text,token.vector)

------------------------------
Average Vector of 'the analyst reviews the dataset to identify trends and patternsdata analysis helps businesses make informed decisions based on facts and figures'
 [-1.0201000e+00  6.2543923e-01 -2.0599742e+00 -8.0464765e-02
  4.5161052e+00 -6.0325146e-01  8.3400942e-02  4.1078167e+00
 -1.5159620e+00 -1.1299504e+00  5.6197333e+00  2.8904324e+00
 -5.1036949e+00  7.2585815e-01  4.3935531e-01  2.9015341e+00
  3.8535662e+00 -5.3094435e-01 -1.3011557e+00 -2.9100089e+00
  7.8894794e-02 -1.6501561e+00 -1.6402471e+00 -7.4460286e-01
 -5.6991267e-01 -2.0238583e+00 -2.1424439e+00 -2.6869714e-01
 -1.0888776e+00  1.0297308e+00  1.2883589e+00  7.8147525e-01
 -1.7258956e+00 -2.0856442e+00 -8.2885236e-01 -9.3300593e-01
 -5.0691485e-01  6.8077087e-01  1.4439805e+00  1.1614985e+00
  1.4680083e+00  9.5859581e-01 -2.5663763e-01 -6.2966055e-01
 -2.0209975e+00  1.3653687e+00  2.4446952e+00 -2.3646829e+00
 -3.5524315e-01  1.3939896e+00  1.3539715e-01  2.5206051e+00
 -1.715843