<a href="https://github.com/kennethenevoldsen/asent"><img src="https://github.com/KennethEnevoldsen/asent/blob/main/docs/img/logo_black_font.png?raw=true" width="300" /></a>

## Installation
Before we start we should install asent this can be done simply by commenting out the following lines:

In [1]:
#!pip install asent

# Tutorial: Creating and Customizing your Pipeline

Asent is build using a series of extension attributes on the spaCy classes, `Doc`, `Token` and `Span`. This allow you to switch out the parts and also improve on  one component at a time. In this tutorial we will move you through how to customize your own pipeline. This will first include a quick approach using the `Asent` component to implement a Swedish pipeline and then we will customize the way the pipeline checks if a word is negated.

> To read more about the custom spaCy extensions check out their [documentation](https://kennethenevoldsen.github.io/asent/languages/index.html).

Before we start we will need a spaCy pipeline for Swedish. However, SpaCy only supply an experimental version for Swedish, which we will use here, but I recommedn also checking out the model by [the Swedish royal library](https://github.com/Kungbib/swedish-spacy).

In [3]:
# !pip install https://huggingface.co/explosion/sv_udv25_swedishtalbanken_trf/resolve/main/sv_udv25_swedishtalbanken_trf-any-py3-none-any.whl
# note this model is quite big so it can take a while

In [4]:
import spacy

nlp = spacy.load("sv_udv25_swedishtalbanken_trf")

## Creating the pipeline
A large part of the customization is made simple using the `Asent` component. Here we will implement the Swedish asent pipeline, for this we need a dictionary of rated words and potentially a list of intensifiers, negation and contrastive conjunctions, however these are not mandatory.

We can extract these using the `lexicons.get` function:


In [5]:
from asent import lexicons

rated_words = lexicons.get("lexicon_sv_v1")
negations = lexicons.get("negations_sv_v1")
intensifiers = lexicons.get("intensifiers_sv_v1")

Before we move on let us check what each of these contains, let us start with the `rated_words`, this is a dictionary which contain words as well as a human rating of how positive/negative it is: 

In [6]:
list(rated_words.items())[140:150]  # the start of the dictionary is mostly emoticons

[('motbjudande', -3.1),
 ('avskyr', -2.9),
 ('förmågor', 1.0),
 ('förmåga', 1.3),
 ('ombord', 0.1),
 ('frånvarande', -1.1),
 ('frikänna', 1.6),
 ('befrias', 1.5),
 ('frikänner', 1.6),
 ('missbruk', -3.2)]

`negations` is simply a list or a set of words, which is considered negations:

In [7]:
list(negations)[:10]

['utan',
 'aldrig',
 'ingenting',
 'sällan',
 'trots',
 'ingenstans',
 'ingen',
 'nej',
 'varken',
 'inte']

Finally, `intensifiers` is a dictionary of words such intensifies the valence of another words (e.g. "very"). It is associated with an score on how much it intensifies to following word. 

In [8]:
list(intensifiers.items())[:10]

[('absolut', 0.293),
 ('otroligt', 0.293),
 ('väldigt', 0.293),
 ('helt', 0.293),
 ('betydande', 0.293),
 ('betydligt', 0.293),
 ('bestämt', 0.293),
 ('djupt', 0.293),
 ('effing', 0.293),
 ('enorm', 0.293)]

We can now add a sentiment component to the pipeline, using the `Asent` component: 

In [10]:
import asent
from asent.component import Asent

Asent(
    nlp,
    name="asent_sv",
    lexicon=rated_words,
    intensifiers=intensifiers,
    negations=negations,
    lowercase=True,
    lemmatize=False,
)

# test it out and visualize results
doc = nlp("Jag är enormt lycklig")
asent.visualize(doc)



Do note, that we specified that when the model should lookup in the dictionaries it should lowercase the lookup word (`lowercase=True`) and it should not lemmatize(`lemmatize=False`) the word. This should naturally correspond to the lexicons you are using, if your lexicon contains lemmas then you should lemmatize beforehand, if your lexicon is case sensitive you should not lowercase before the look up.

## Customizing your Pipeline
In the following, we will customize our pipeline a bit further. We will especially look at the negations. The current implementation based on [Hutto and Gilbert (2014)](https://ojs.aaai.org/index.php/ICWSM/article/view/14550) assumed that the word is negated if one of the three proceeding words is a negation. This is a simplifying assumption which has been shown to generally work well, however, with spaCy performing a dependency parse and part-of-speech tagging we can do better!

### Examining an example:
First let us examine an example where it fails:

In [11]:
doc = nlp("Jag är inte glad men jag skulle inte säga att jag är ledsen.")
# I am not happy but I would not say that I am sad.

asent.visualize(doc)



In [12]:
from spacy import displacy

# examine the part-of-speech tags and dependency tree
displacy.render(doc)

From this we can notice two things:
1) Negation have the [*PART*](https://universaldependencies.org/u/pos/PART.html) part-of-speech tag, indicating that it is a particle, which among other things include negations.
2) Negations is related to other words using the [*advmod*](https://universaldependencies.org/u/dep/advmod.html), and the words we wish negated is "down the tree" (or down the arrow if you will) from the negated word.

We can even go a bit further and examine the morph extension: 

In [13]:
for t in doc:
    print(t, "\t", t.morph)

Jag 	 Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs
är 	 Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
inte 	 Polarity=Neg
glad 	 Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing
men 	 
jag 	 Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs
skulle 	 Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Act
inte 	 Polarity=Neg
säga 	 VerbForm=Inf|Voice=Act
att 	 
jag 	 Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs
är 	 Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
ledsen 	 Case=Nom|Definite=Ind|Gender=Com|Number=Sing|Tense=Past|VerbForm=Part
. 	 


Where we see that the negation *"inte"* is denoted by [*Polarity=Neg*](https://universaldependencies.org/u/feat/Polarity.html#Neg), indicating that it is a negation.

from this, there are two things we can change, first instead of looking up the negation, we can examine whether it is a negation or at least that it has the right part-of-speech tag. Secondly, we can implement a method which check if a word is negated using the dependency tree.

### Morphology and Part-of-Speech for negations
Asent check is a word is a negation using the `is_negation` token extension. We can see this using:

In [14]:
for t in doc:
    print(t, "\t", t._.is_negation)

Jag 	 False
är 	 False
inte 	 True
glad 	 False
men 	 False
jag 	 False
skulle 	 False
inte 	 True
säga 	 False
att 	 False
jag 	 False
är 	 False
ledsen 	 False
. 	 False


We will now simply overwrite the extension with one using the morph tag. First, we will create a function which applied to a token returns whether it is a negation. Secondly, we will overwrite the extensions using the token's `set_extention` method.

In [15]:
from spacy.tokens import Token


def is_negation(token: Token) -> bool:
    """checks is token is a negation

    Args:
        token (Token): A spaCy token

    Returns:
        bool: a boolean indicating whether the token is a negation
    """
    m_dict = token.morph.to_dict()
    return (
        "Polarity" in m_dict  # if is has the polarity attribute
        and m_dict["Polarity"] == "Neg"
    )  # and it is negative


Token.set_extension("is_negation", getter=is_negation, force=True)

Now our negations use the morph tag, which in this case provides the same results so the result isn't that interesting. What we really what it the second part:


## Using the dependency tree for negations
In the following, we will overwrite the `is_negated` extension used by Asent to check if a word is negated. We can start by examining it:

In [16]:
for t in doc:
    print(t, "\t", t._.is_negated)

Jag 	 None
är 	 None
inte 	 None
glad 	 inte
men 	 inte
jag 	 inte
skulle 	 None
inte 	 None
säga 	 inte
att 	 inte
jag 	 inte
är 	 None
ledsen 	 None
. 	 None


Noticably, see that *ledsen* is not negated, although it should be, but we also clearly see the three following words after the negation is negated as was expected from the heuristic rule.

In [21]:
from typing import Optional


def is_negated(token: Token) -> Optional[Token]:
    """checks is token is negated

    Args:
        token (Token): A spaCy token

    Returns:
        Optional[Token]: return the negation if the token is negated
    """
    # only check if a word is negated if it is rated (it is not meaningful to do otherwise)
    if token._.valence:
        for c in token.children:
            # if the token is modified by a negation
            if c.dep_ == "advmod" and c._.is_negation:
                return c
        # or if its head it negated:
        for c in token.head.children:
            if c.dep_ == "advmod" and c._.is_negation:
                return c


Token.set_extension("is_negated", getter=is_negated, force=True)

In [22]:
for t in doc:
    print(t, "\t", t._.valence, t._.is_negation, "-", t._.is_negated)

Jag 	 0.0 False - None
är 	 0.0 False - None
inte 	 0.0 True - None
glad 	 3.1 False - inte
men 	 0.0 False - None
jag 	 0.0 False - None
skulle 	 0.0 False - None
inte 	 0.0 True - None
säga 	 0.0 False - None
att 	 0.0 False - None
jag 	 0.0 False - None
är 	 0.0 False - None
ledsen 	 -2.1 False - inte
. 	 0.0 False - None


In [23]:
asent.visualize(doc)
doc[-2]._.polarity

TokenPolarityOutput(polarity=1.554, token=ledsen, span=inte säga att jag är ledsen)

## Exercise
You will notice that there is no contrastive conjugation for Swedish, but that the part-of-speech tags do include a tag for it (CCONJ). Overwrite the `is_contrastive_conj` extension to include contrastive conjugations.