## <p style = 'text-align: center'>Intoduction to SpaCy</p>
---

In [1]:
import spacy

Spacy has several language models available, including advanced German and Chinese implementations.

- English : *en_core_web_sm*
- Spanish : *es_core_news_sm*
- German : *de_core_news_sm*
- French : *fr_core_news_sm*
- Dutch : *nl_core_news_sm*

In [2]:
nlp = spacy.load('en_core_web_sm')

We load a new document by parsing a string into the NLP variable

In [3]:
text = "Apple Inc. is based in Cupertino, California, and it was founded by Steve Jobs."
doc = nlp(text)

In spaCy, the Doc object, which represents a processed text, provides access to various attributes and properties that allow you to access linguistic annotations and information about the text. Here are some of the commonly used attributes and properties of the Doc object:

**text**: The original text of the document.

**ents**: A list of named entities found in the text.

**sents**: A list of sentence objects in the document.

**tokens**: A list of token objects, where each token represents a word or punctuation mark in the text.

**noun_chunks**: A list of noun chunks or phrases in the text.

**vector**: The document's vector representation, if a word vectors model is available.

**vector_norm**: The L2 norm of the document's vector.

**is_parsed**: A Boolean value indicating whether the text has been syntactically parsed.

**is_tagged**: A Boolean value indicating whether part-of-speech tagging has been performed.

**is_nered**: A Boolean value indicating whether named entity recognition (NER) has been performed.

**has_annotation**: A Boolean value indicating whether the document has any linguistic annotations.

**user_data**: A dictionary where custom data can be stored.

**vocab**: The vocabulary of the language model used for tokenization and linguistic analysis.

**lang**: The language of the document.

**cats**: The document's category labels if text classification is performed.

**similarity()**: A method for computing the similarity between two documents.

In [4]:
for te in doc.noun_chunks:
    print(f"{te.text} : {te.label_}")

Apple Inc. : NP
Cupertino : NP
California : NP
it : NP
Steve Jobs : NP


In [5]:
doc.ents

(Apple Inc., Cupertino, California, Steve Jobs)

In [6]:
for entity in doc.ents:
    print(f'{entity.text} : {entity.label_}')
    
y = [(entity.text, entity.label_) for entity in doc.ents]

print(y)

Apple Inc. : ORG
Cupertino : GPE
California : GPE
Steve Jobs : PERSON


In [7]:
s = "Volkswagen is developing an electric sedan which could potentially come to America next fall."
doc = nlp(s)

## Part-of-Speech Tagging

Here we are just classifying how a word is used in a sentence e.g. {noun, verb, adjective}

POS tags can be accessed through the **pos_** attribute

In [9]:
[(t.text, t.pos_) for t in doc]

[('Volkswagen', 'PROPN'),
 ('is', 'AUX'),
 ('developing', 'VERB'),
 ('an', 'DET'),
 ('electric', 'ADJ'),
 ('sedan', 'NOUN'),
 ('which', 'PRON'),
 ('could', 'AUX'),
 ('potentially', 'ADV'),
 ('come', 'VERB'),
 ('to', 'ADP'),
 ('America', 'PROPN'),
 ('next', 'ADJ'),
 ('fall', 'NOUN'),
 ('.', 'PUNCT')]

To get a description we can use **spacy.explain()**

In [10]:
spacy.explain('PROPN')

'proper noun'

We can also access fine-grained tags though the **tag_** attribute. They provide more detailed information about a token such as its tense and, if a word is a pronoun, what type of a pronoun it is.

In [11]:
[(t.text, t.tag_) for t in doc]

[('Volkswagen', 'NNP'),
 ('is', 'VBZ'),
 ('developing', 'VBG'),
 ('an', 'DT'),
 ('electric', 'JJ'),
 ('sedan', 'NN'),
 ('which', 'WDT'),
 ('could', 'MD'),
 ('potentially', 'RB'),
 ('come', 'VB'),
 ('to', 'IN'),
 ('America', 'NNP'),
 ('next', 'JJ'),
 ('fall', 'NN'),
 ('.', '.')]

In [12]:
spacy.explain('VBZ')

'verb, 3rd person singular present'

In [8]:
y = [(entity.text, entity.label_) for entity in doc.ents]

print(y)

[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next fall', 'DATE')]


## Bag of Words

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
corpus = [
    "Red Bull drops hint on F1 engine.",
    "Honda exists F1, leaving F1 partner Red Bull.",
    "Hamilton eyes record eighth F1 title.",
    "Aston Martin announces sponsor."
]

<img src = "Doc.png">

The **fit_transform** method performs 2 steps:
- *fit*: learns a vocabulary dictionary from the corpus, i/e assign index values to each unique word in our vocabulary.

<img src = "fit_method.png">

- *transform*: creates a **BoW** matrix with appropriate token counts for each document. It returns a matrix where each row represents a document and each column represents a token (i.e. term).

<img src = "transform_method.png">


In [15]:
vectorizer = CountVectorizer()

bow = vectorizer.fit_transform(corpus)

The **CountVectrizer** took care of tokenization for us, removed punction and lower-cased everything.

In [16]:
print(bow)

  (0, 17)	1
  (0, 2)	1
  (0, 3)	1
  (0, 10)	1
  (0, 14)	1
  (0, 8)	1
  (0, 5)	1
  (1, 17)	1
  (1, 2)	1
  (1, 8)	2
  (1, 11)	1
  (1, 6)	1
  (1, 12)	1
  (1, 15)	1
  (2, 8)	1
  (2, 9)	1
  (2, 7)	1
  (2, 16)	1
  (2, 4)	1
  (2, 19)	1
  (3, 1)	1
  (3, 13)	1
  (3, 0)	1
  (3, 18)	1
