# **Get started with SpaCy**

* SapaCy oficial sites: https://spacy.io/
* SpaCy models: https://github.com/explosion/spacy-models

In [7]:
# pip install -U spacy
# python -m spacy download en_core_web_sm
import spacy
print(spacy.__version__)
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_md")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)


3.7.4


In [2]:
nlp.component_names

['tok2vec',
 'tagger',
 'parser',
 'senter',
 'attribute_ruler',
 'lemmatizer',
 'ner']

## Analyze syntax

In [45]:
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']


## Part-of-speech tags and dependencies (Linguistic annotations)


In [46]:
format_string = "{:10} | {:10} | {:10} | {:10} | {:10}| {:10} | {:10} | {:10}"

print(
    format_string.format(
        "Text",
        "Lemma",
        "POS",
        "Tag",
        "Dependency",
        "Shape",
        " Is Alpha",
        "Is Stop Word: ",
    )
)
print('-'*100)
for token in doc[:10]:
    print(
        format_string.format(
            token.text,
            token.lemma_,
            token.pos_,
            token.tag_,
            token.dep_,
            token.shape_,
            token.is_alpha,
            token.is_stop,
        )
    )


Text       | Lemma      | POS        | Tag        | Dependency| Shape      |  Is Alpha  | Is Stop Word: 
----------------------------------------------------------------------------------------------------
When       | when       | SCONJ      | WRB        | advmod    | Xxxx       |          1 |          1
Sebastian  | Sebastian  | PROPN      | NNP        | compound  | Xxxxx      |          1 |          0
Thrun      | Thrun      | PROPN      | NNP        | nsubj     | Xxxxx      |          1 |          0
started    | start      | VERB       | VBD        | advcl     | xxxx       |          1 |          0
working    | work       | VERB       | VBG        | xcomp     | xxxx       |          1 |          0
on         | on         | ADP        | IN         | prep      | xx         |          1 |          1
self       | self       | NOUN       | NN         | npadvmod  | xxxx       |          1 |          0
-          | -          | PUNCT      | HYPH       | punct     | -          |          0

## Find Named Entities
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.


In [47]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun FAC
Recode ORG
earlier this week DATE


## Visualization

In [48]:
spacy.displacy.serve(doc[:24] ,options={"compact": True})




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


##  Peplines

When you call `nlp` on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

The tokenizer runs before the components. Pipeline components can be added using [`Language.add_pipe`](https://spacy.io/api/language#add_pipe)

![alt](resources/pepelines.PNG)

| Name                              | Description                                                                                                 |
|-----------------------------------|-------------------------------------------------------------------------------------------------------------|
| Tokenization                      | Segmenting text into words, punctuations marks etc.                                                         |
| Part-of-speech (POS) Tagging      | Assigning word types to tokens, like verb or noun.                                                          |
| Dependency Parsing                | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| Lemmatization                     | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”. |
| Sentence Boundary Detection (SBD) | Finding and segmenting individual sentences.                                                                |
| Named Entity Recognition (NER)    | Labelling named “real-world” objects, like persons, companies or locations.                                 |
| Entity Linking (EL)               | Disambiguating textual entities to unique identifiers in a knowledge base.                                  |
| Similarity                        | Comparing words, text spans and documents and how similar they are to each other.                           |
| Text Classification               | Assigning categories or labels to a whole document, or parts of a document.                                  |
| Rule-based Matching               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
| Training                          | Updating and improving a statistical model’s predictions.                                                   |
| Serialization                     | Saving objects to files or byte strings.                                                                    |


In [8]:
from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
config = {
   "threshold": 0.5,
   "model": DEFAULT_MULTI_TEXTCAT_MODEL,
}
texcat= nlp.add_pipe("textcat_multilabel", config=config)

## Making own train and predict functions

* https://wandb.ai/authors/Kaggle-NLP/reports/Kaggle-s-NLP-Text-Classification--VmlldzoxOTcwNTc