# **spaCy**
 spaCy is a free open-source library for Natural Language Processing (NLP) in Python.  It features tokenization,  sentence splitting, NER, POS tagging, dependency parsing, word vectors and more. spaCy supports 72+ languages including <b>Greek </b>.

# **NLP pipeline**
![picture](https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the <a href="https://spacy.io/models">trained pipelines </a> typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed `Doc`, which is then passed on to the next component.

## Install spaCy and download pretrainned model for english 

In [1]:
!pip install -U spacy
!python -m spacy download en_core_web_sm
!python -m spacy download el_core_news_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2023-04-04 20:32:32.481588: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-04 20:32:35.711637: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Downloa

In [2]:
text = 'The Athens University of Economics and Business (AUEB) was originally founded in \
1920 under the name of Athens School of Commercial Studies. \
It was renamed in 1926 as the Athens School of Economics and Business, \
a name that was retained until 1989 when it assumed its present name, \
the Athens University of Economics and Business. It is the third oldest \
university in Greece and the oldest in the fields of economics and business. \
Up to 1955 the school offered only one degree in the general area of economics and \
commerce. In 1955 it started two separate programs leading to two separate degrees: \
one in economics and the other in business administration. In 1984 the school was \
divided into three departments, namely the Department of Economics, the Department of \
Business Administration and the Department of Statistics and Informatics. In 1989, the \
university expanded to six departments. From 1999 onwards, the university developed \
even further and nowadays it includes eight academic departments, offering eight \
undergraduate degrees, 28 master\'s degrees and an equivalent number of doctoral programs.'

## Load and use spaCy's pretrained model for english
* 'en_core_web_sm': English multi-task CNN trained on the datasets
`OntoNotes_5`, `ClearNLP Constituent-to-Dependency Conversion` and `WordNet 3.0`. The model assigns context-specific token vectors, POS tags, dependency parse and named entities.
* 'el_core_news_sm': The corresponding model in Greek, trained on the datasets `UD Greek GDT v2.5`, `Greek NER Corpus` and `spaCy lookups data`. 

More details about the pretrained model can by found in: https://spacy.io/models/

In [5]:
import spacy

nlp = spacy.load('en_core_web_sm')
# nlp = spacy.load('el_core_news_sm')
# nlp = spacy.load('en_core_web_sm',disable=["tagger","parser","lemmatizer"])

doc = nlp(text)

## Tokenized sentences
All the spaCy models tokenize your text into sentences and words (Tokens), too. The sentences can be found under the variable `class.Doc.sents` and the token's text under the variable `class.Token.text`.

In [6]:
for sent in doc.sents:
    toks = []
    for token in sent:
        toks.append(token.text)
    print(toks)
    print("_________________")

['The', 'Athens', 'University', 'of', 'Economics', 'and', 'Business', '(', 'AUEB', ')', 'was', 'originally', 'founded', 'in', '1920', 'under', 'the', 'name', 'of', 'Athens', 'School', 'of', 'Commercial', 'Studies', '.']
_________________
['It', 'was', 'renamed', 'in', '1926', 'as', 'the', 'Athens', 'School', 'of', 'Economics', 'and', 'Business', ',', 'a', 'name', 'that', 'was', 'retained', 'until', '1989', 'when', 'it', 'assumed', 'its', 'present', 'name', ',', 'the', 'Athens', 'University', 'of', 'Economics', 'and', 'Business', '.']
_________________
['It', 'is', 'the', 'third', 'oldest', 'university', 'in', 'Greece', 'and', 'the', 'oldest', 'in', 'the', 'fields', 'of', 'economics', 'and', 'business', '.']
_________________
['Up', 'to', '1955', 'the', 'school', 'offered', 'only', 'one', 'degree', 'in', 'the', 'general', 'area', 'of', 'economics', 'and', 'commerce', '.']
_________________
['In', '1955', 'it', 'started', 'two', 'separate', 'programs', 'leading', 'to', 'two', 'separate',

## POS Tags

The spaCy models can also find wich Part Of Speech tag corresponds to each token. 

A list of the POS tags:

| Tag         | Description |
| ----------- | ----------- |
| CC   | Coordinating Conjunction |
| CD   | Cardinal Digit |
| DT   | Determiner |
| EX   | Existential There. Example: “there is” … think of it like “there exists”) |
| FW   | Foreign Word. |
| IN   | Preposition/Subordinating Conjunction. |
| JJ   | Adjective. |
| JJR  | Adjective, Comparative. |
| JJS  | Adjective, Superlative. |
| LS   | List Marker 1. |
| MD   | Modal. |
| NN   | Noun, Singular. |
| NNS  | Noun Plural. |
| NNP  | Proper Noun, Singular. |
| NNPS | Proper Noun, Plural. |
| PDT  | Predeterminer. |
| POS  | Possessive Ending. Example: parent’s |
| PRP  | Personal Pronoun. Examples: I, he, she |
| RB   | Adverb. Examples: very, silently, |
| RBR  | Adverb, Comparative. Example: better |
| RBS  | Adverb, Superlative. Example: best |
| RP   | Particle. Example: give up |
| TO   | to. Example: go ‘to’ the store. |
| UH   | Interjection. Example: errrrrrrrm |
| VB   | Verb, Base Form. Example: take |
| VBD  | Verb, Past Tense. Example: took |
| VBG  | Verb, Gerund/Present Participle. Example: taking |
| VBN  | Verb, Past Participle. Example: taken |
| VBP  | Verb, Sing Present, non-3d take |
| VBZ  | Verb, 3rd person sing. present takes |
| WDT  | wh-determiner. Example: which |
| WP   | wh-pronoun. Example: who, what |
| WP$  | possessive wh-pronoun. Example: whose |
| WRB  | wh-abverb. Example: where, when |
</style>


In [8]:
for sent in doc.sents:
    for token in sent:
        print("{} | {} | {}".format(token.text, token.pos_, token.tag_))
    print("_________________________")

The | DET | DT
Athens | PROPN | NNP
University | PROPN | NNP
of | ADP | IN
Economics | PROPN | NNP
and | CCONJ | CC
Business | PROPN | NNP
( | PUNCT | -LRB-
AUEB | PROPN | NNP
) | PUNCT | -RRB-
was | AUX | VBD
originally | ADV | RB
founded | VERB | VBN
in | ADP | IN
1920 | NUM | CD
under | ADP | IN
the | DET | DT
name | NOUN | NN
of | ADP | IN
Athens | PROPN | NNP
School | PROPN | NNP
of | ADP | IN
Commercial | PROPN | NNP
Studies | PROPN | NNPS
. | PUNCT | .
_________________________
It | PRON | PRP
was | AUX | VBD
renamed | VERB | VBN
in | ADP | IN
1926 | NUM | CD
as | ADP | IN
the | DET | DT
Athens | PROPN | NNP
School | PROPN | NNP
of | ADP | IN
Economics | PROPN | NNP
and | CCONJ | CC
Business | PROPN | NNP
, | PUNCT | ,
a | DET | DT
name | NOUN | NN
that | PRON | WDT
was | AUX | VBD
retained | VERB | VBN
until | ADP | IN
1989 | NUM | CD
when | SCONJ | WRB
it | PRON | PRP
assumed | VERB | VBD
its | PRON | PRP$
present | ADJ | JJ
name | NOUN | NN
, | PUNCT | ,
the | DET | DT
Athens

## Display dependency trees

In [9]:
from spacy import displacy

for sent in doc.sents:
    displacy.render(sent, style='dep', jupyter=True)

## Name Entity Regognition (NER)

A list of spaCy entities:

| Type      | Description |
| ----------- | ----------- |
|PERSON	|People, including fictional. |
|NORP	| Nationalities or religious or political groups.|
|FAC | Buildings, airports, highways, bridges, etc.|
|ORG | Companies, agencies, institutions, etc.|
|GPE | Countries, cities, states.|
|LOC | Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT| Objects, vehicles, foods, etc. (Not services.)|
|EVENT | Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART | Titles of books, songs, etc.|
|LAW | Named documents made into laws.|
|LANGUAGE	| Any named language.|
|DATE	| Absolute or relative dates or periods.|
|TIME	| Times smaller than a day.|
|PERCENT | Percentage, including ”%“.|
|MONEY | Monetary values, including unit.
|QUANTITY	| Measurements, as of weight or distance.|
|ORDINAL | “first”, “second”, etc.|
|CARDINAL | Numerals that do not fall under another type.|

In [13]:
text2 = "ChatGPT is an artificial-intelligence (AI) chatbot developed by OpenAI and launched in November 2022.\
 It is built on top of OpenAI's GPT-3.5 and GPT-4 families of large language models (LLMs) and has been fine-tuned\
 (an approach to transfer learning) using both supervised and reinforcement learning techniques.\
 ChatGPT was launched as a prototype on November 30, 2022. It garnered attention for its detailed responses and articulate answers\
 across many domains of knowledge. Its uneven factual accuracy, however, has been identified as a significant drawback.\
 Following the release of ChatGPT, OpenAI's valuation was estimated at US$29 billion in 2023.\
 The original release of ChatGPT was based on GPT-3.5. A version based on GPT-4, the newest OpenAI model, was released on March\
 14, 2023, and is available for paid subscribers on a limited basis."
 
doc2 = nlp(text2)

# Visualize the entities
print()
displacy.render(doc2, style="ent", jupyter=True)

print()
for entity in doc2.ents:
    print("Text: {} - Label: {} - Range: ({}-{}): ".format(entity.text, entity.label_, entity.start_char, entity.end_char,))





Text: ChatGPT - Label: ORG - Range: (0-7): 
Text: AI - Label: ORG - Range: (39-41): 
Text: OpenAI - Label: GPE - Range: (64-70): 
Text: November 2022 - Label: DATE - Range: (87-100): 
Text: OpenAI - Label: GPE - Range: (124-130): 
Text: GPT-4 - Label: PERSON - Range: (145-150): 
Text: November 30, 2022 - Label: DATE - Range: (351-368): 
Text: OpenAI - Label: GPE - Range: (595-601): 
Text: US$29 billion - Label: MONEY - Range: (631-644): 
Text: 2023 - Label: DATE - Range: (648-652): 
Text: GPT-4 - Label: PERSON - Range: (727-732): 
Text: OpenAI - Label: GPE - Range: (745-751): 
Text: March 14, 2023 - Label: DATE - Range: (775-789): 


## Resources
* https://spacy.io/usage/
* https://spacy.io/usage/linguistic-features#section-sbd
* https://spacy.io/usage/models