<a href="https://colab.research.google.com/github/HoseinBahmany/learning-llms/blob/main/spacy/getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_md

In [11]:
import spacy

nlp = spacy.load("en_core_web_md")
doc = nlp("It's been a crazy week!!!")
print([token.text for token in doc])

['It', "'s", 'been', 'a', 'crazy', 'week', '!', '!', '!']


## Customizing the tokenizer

When we work with a specific domain such as medicine, insurance, or finance, we often come across words, abbreviations, and entities that needs special attention. Most domains that you'll process have characteristic words and phrases that need custom tokenization rules. Here's how to add a special case rule to an existing Tokenizer class instance:



In [21]:
import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_md")
doc = nlp("lemme that")
print([d.text for d in doc])

special_case = [{ORTH: "lem"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("lemme", special_case)
doc = nlp("lemme that")
print([d.text for d in doc])

# If you define a special case rule with punctuation, the special case rule will take precedence over the punctuation splitting:
nlp.tokenizer.add_special_case("...lemme...?", [{"ORTH": "...lemme...?"}])
print([w.text for w in nlp("...lemme...?")])

['lemme', 'that']
['lem', 'me', 'that']
['...lemme...?']


## Sentence segmentation

We saw that breaking a sentence into its tokens is not a straightforward task at all. How about breaking a text into sentences? It's indeed a bit more complicated to mark where a sentence starts and ends due to the same reasons of punctuation, abbreviations, and so on.

A Doc object's sentences are available via the doc.sents property:

In [22]:
import spacy

nlp = spacy.load("en_core_web_md")
doc = nlp("I flied to N.Y yesterday. It was around 5 pm.")
print([sent.text for sent in doc.sents])

['I flied to N.Y yesterday.', 'It was around 5 pm.']


Determining sentence boundaries is a more complicated task than tokenization. As a result, spaCy uses the dependency parser to perform sentence segmentation.

## Understanding lemmatization

A lemma is the base form of a token. You can think of a lemma as the form in which the token appears in a dictionary. For instance, the lemma of eating is eat; the lemma of eats is eat; ate similarly maps to eat. Lemmatization is the process of reducing the word forms to their lemmas. The following code is a quick example of how to do lemmatization with spaCy:

In [23]:
import spacy

nlp = spacy.load("en_core_web_md")
doc = nlp("I went there for working and worked for 3 years.")
for token in doc:
  print(token.text, token.lemma_)

I I
went go
there there
for for
working work
and and
worked work
for for
3 3
years year
. .


## Lemmatization in NLU

Suppose that you design an NLP pipeline for a ticket booking system. Your application processes a customer's sentence, extracts necessary information from it, and then passes it to the booking API.

The NLP pipeline wants to extract the form of the travel (a flight, bus, or train), the destination city, and the date. The first thing the application needs to verify is the means of travel:

```
fly – flight – airway – airplane - plane

bus

railway – train
```

We have this list of keywords and we want to recognize the means of travel by searching the tokens in the keywords list. The most compact way of doing this search is by looking up the token's lemma. Consider the following customer sentences:

```
List me all flights to Atlanta.

I need a flight to NY.

I flew to Atlanta yesterday evening and forgot my baggage.
```

In [28]:
import spacy
from spacy.attrs import ORTH, NORM

nlp = spacy.load("en_core_web_md")
# special_case = [{ORTH: 'Angeltown', NORM: 'Los Angeles'}]
# nlp.tokenizer.add_special_case("Angeltown", special_case)

doc = nlp("I'm flying to Angeltown")

for token in doc:
  print(token.text, token.lemma_)

I I
'm be
flying fly
to to
Angeltown Angeltown


## Understanding the difference between lemmatization and stemming

A lemma is the base form of a word and is always a member of the language's vocabulary. The stem does not have to be a valid word at all. For instance, the lemma of improvement is improvement, but the stem is improv. You can think of the stem as the smallest part of the word that carries the meaning.

Both stemming and lemmatization have their own advantages. Stemming gives very good results if you apply only statistical algorithms to the text, without further semantic processing such as pattern lookup, entity extraction, coreference resolution, and so on. Also stemming can trim a big corpus to a more moderate size and give you a compact representation. If you also use linguistic features in your pipeline or make a keyword search, include lemmatization. Lemmatization algorithms are accurate but come with a cost in terms of computation.