# Basic features of spaCy library

## preliminary actions

```bash
$ pip install spacy numpy pandas scikit-learn matplotlib
$ python -m spacy download en_core_web_sm en_core_web_lg
```

## Loading libraries

In [1]:
import pandas as pd
import numpy as np
import spacy

## Loading the english model (and thus data associated)

In [2]:
nlp = spacy.load("en_core_web_sm")

## Tokenization

*Segmenting text into words, punctuations marks etc.*

#### Example

In [3]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


In [4]:
pd.DataFrame.from_dict(data={str(key): [token.text] for (key, token) in enumerate(doc)})

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Apple,is,looking,at,buying,U.K.,startup,for,$,1,billion


### Tokenization rules

* SpaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.
* Tokenizer exceptions strongly depend on the specifics of the individual language. 
* Each language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

![Alt text](https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg)

### Customization of the default tokenization rules

* Really crucial for uncommon use cases
* E.g. add rules for specific abbreviations in technical reports, ...
* Most of the time, better impact on performance than spending time on model selection or hyper-parameter tuning
* More information on how to customize rules with spaCy [here](https://spacy.io/usage/linguistic-features#tokenization)

## Lemmatization

* Disregard some spelling variations useless for machine processing
* E.g. connect, connection, connecting, connected, etc. => **connect**

#### Example

In [5]:
# Implementing lemmatization
lem = nlp("connect connection connecting connected")
# finding lemma for each word
pd.DataFrame.from_dict({
    "TEXT": [token.text for token in lem],
    "LEMMA": [token.lemma_ for token in lem],
})

Unnamed: 0,TEXT,LEMMA
0,connect,connect
1,connection,connection
2,connecting,connect
3,connected,connect


## spaCy’s Statistical Models

* **Power engines** of spaCy
* Used to perform several NLP tasks such as part-of-speech tagging, named entity recognition, word embeddings and dependency parsing.

List of the different statistical models in spaCy along with their specifications:

* **en_core_web_sm**: English multi-task CNN trained on OntoNotes. Size — 11 MB
* **en_core_web_md**: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size — 91 MB
* **en_core_web_lg**: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size — 789 MB

Importing these models:
```bash
nlp = spacy.load('en_core_web_sm')
````

## Part-of-speech tags and dependencies 

**Part-of-Speech tagging (POS)**
> Assigning word types to tokens, like verb or noun.

**Dependency Parsing**
> Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

#### Example

In [6]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

pd.DataFrame.from_dict({
    "TEXT": [token.text for token in doc],
    "LEMMA": [token.lemma_ for token in doc],
    "POS": [token.pos_ for token in doc],
    "TAG": [token.tag_ for token in doc],
    "DEP": [token.dep_ for token in doc],
    "SHAPE": [token.shape_ for token in doc],
    "STOP WORD": [token.is_stop for token in doc]
})

Unnamed: 0,TEXT,LEMMA,POS,TAG,DEP,SHAPE,STOP WORD
0,Apple,Apple,PROPN,NNP,nsubj,Xxxxx,False
1,is,be,AUX,VBZ,aux,xx,True
2,looking,look,VERB,VBG,ROOT,xxxx,False
3,at,at,ADP,IN,prep,xx,True
4,buying,buy,VERB,VBG,pcomp,xxxx,False
5,U.K.,U.K.,PROPN,NNP,compound,X.X.,False
6,startup,startup,NOUN,NN,dobj,xxxx,False
7,for,for,ADP,IN,prep,xxx,True
8,$,$,SYM,$,quantmod,$,False
9,1,1,NUM,CD,compound,d,False


* __Text__: The original word text.
* __Lemma__: The base form of the word.
* __POS__: The simple UPOS part-of-speech tag.
* __Tag__: The detailed part-of-speech tag.
* __Dep__: Syntactic dependency, i.e. the relation between tokens.
* __Shape__: The word shape – capitalization, punctuation, digits.
* __is stop__: Is the token part of a stop list, i.e. the most common words of the language?

In [7]:
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

## Named Entity Recognition 

> Labelling named “real-world” objects, like persons, companies or locations.

[List of all entities](https://spacy.io/api/annotation#named-entities)

#### Example

In [8]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

pd.DataFrame.from_dict({
    "TEXT": [ent.text for ent in doc.ents],
    "START": [ent.start_char for ent in doc.ents],
    "END": [ent.end_char for ent in doc.ents],
    "LABEL": [ent.label_ for ent in doc.ents],
    "DESCRIPTION": ["Companies, agencies, institutions.",
                    "Geopolitical entity, i.e. countries, cities, states.",
                    "Monetary values, including unit."]
})

Unnamed: 0,TEXT,START,END,LABEL,DESCRIPTION
0,Apple,0,5,ORG,"Companies, agencies, institutions."
1,U.K.,27,31,GPE,"Geopolitical entity, i.e. countries, cities, s..."
2,$1 billion,44,54,MONEY,"Monetary values, including unit."


In [9]:
displacy.render(doc, style="ent", jupyter=True)

### Customization 

* You can add your own entities
* And retrain a model (more information [here](https://spacy.io/usage/training))

## Word vectors and similarity 

> Comparing words, text spans and documents and how similar they are to each other.

* Each word is embedded/represented by a vector of floats.
* Similarity between 2 words is determined by the proximity of their vector. 

#### We need to load a larger english model in order to have those vectors

In [10]:
%%time
nlp = spacy.load("en_core_web_lg")

CPU times: user 5.3 s, sys: 1.24 s, total: 6.54 s
Wall time: 6.79 s


#### Example

In [11]:
house_token = nlp("house")
house_vector = house_token.vector
print(house_vector.shape)
house_vector

(300,)


array([ 1.9847e-01,  1.8087e-01, -8.9119e-02, -2.5626e-01,  7.4104e-02,
        5.9422e-03, -8.0814e-02, -8.7499e-01,  1.6353e-01,  2.7836e+00,
       -8.9134e-01,  3.7017e-02, -5.5995e-01, -2.1853e-01, -3.6847e-01,
        4.2609e-01,  2.5508e-02,  1.1834e+00, -5.9869e-02, -1.6261e-02,
        3.6331e-01,  1.2664e-01,  3.1424e-01,  2.3845e-02,  5.7331e-02,
       -4.7905e-01, -2.3247e-01,  2.3379e-02, -2.9739e-01,  1.0735e-01,
        2.9723e-01,  5.4123e-02, -2.6837e-01,  4.8272e-01, -4.8055e-02,
       -1.0766e-02,  1.6169e-01, -7.4395e-02,  1.2789e-03, -6.1155e-02,
        2.4258e-01,  1.4165e-02,  8.3789e-02, -3.5793e-01, -4.8655e-02,
        1.1436e-01,  2.7535e-01, -9.2720e-01,  3.2332e-01,  1.6197e-01,
       -2.6260e-01, -3.2542e-01,  1.8347e-01,  5.7849e-01,  1.9925e-01,
       -3.7611e-01,  1.8520e-01,  1.3349e-01,  1.9571e-01,  5.1844e-01,
        2.0733e-01,  2.0470e-01,  8.3850e-02,  4.2725e-01,  1.1571e-01,
       -1.2066e-01, -7.6344e-02,  2.2959e-01, -1.9066e-01,  2.88

In [12]:
tokens = nlp("dog cat banana afskfsd")

pd.DataFrame.from_dict({
    "TEXT": [token.text for token in tokens],
    "HAS VECTOR": [token.has_vector for token in tokens],
    "IS OOV": [token.is_oov for token in tokens],
})

Unnamed: 0,TEXT,HAS VECTOR,IS OOV
0,dog,True,False
1,cat,True,False
2,banana,True,False
3,afskfsd,False,True


* If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package, for example, **en_vectors_web_lg**, which includes over 1 million unique vectors.
* You can find [here](https://spacy.io/models/en-starters) most of the state of the art pretrained word representations (e.g. BERT)

In [13]:
dog_token = tokens[0]
cat_token = tokens[1]
banana_token = tokens[2]
print("Cat and Dog similarity: {0:.2f}".format(dog_token.similarity(cat_token)))
print("Cat and Banana similarity: {0:.2f}".format(cat_token.similarity(banana_token)))

Cat and Dog similarity: 0.80
Cat and Banana similarity: 0.28


## What are Stop Words?

* Common words in a vocabulary which are of little value when considering word frequencies in text. 
* They don't provide much useful information about what the sentence is telling the reader.

> Example: "the","and","a","are","is"

In [14]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))
#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:20])

Number of stop words: 326
First ten stop words: ['herself', 'her', '‘s', 'alone', 'being', 'besides', 'whether', 'it', 'we', 'well', 'them', 'everywhere', 'around', 'due', 'however', 'along', 'what', 'throughout', 'between', 'n‘t']


In [15]:
sentence = nlp("Apple is looking at buying U.K. startup for $1 billion")

doc_without_stop_words = []
for word in sentence:
    if word.is_stop==False:
        doc_without_stop_words.append(word)
print("Filtered Sentence:",doc_without_stop_words)

Filtered Sentence: [Apple, looking, buying, U.K., startup, $, 1, billion]
