# An Introduction to Natural Language in Python using spaCy

## Introduction

This tutorial provides a brief introduction to working with natural language (sometimes called "text analytics") in Pytho, using [spaCy](https://spacy.io/) and related libraries.
Data science teams in industry must work with lots of text, one of the top four categories of data used in machine learning.
Usually that's human-generated text, although not always.

Think about it: how does the "operating system" for business work? Typically, there are contracts (sales contracts, work agreements, partnerships), there are invoices, there are insurance policies, there are regulations and other laws, and so on.
All of those are represented as text.

You may run across a few acronyms: _natural language processing_ (NLP), _natural language understanding_ (NLU), _natural language generation_ (NLG) — which are roughly speaking "read text", "understand meaning", "write text" respectively.
Increasingly these tasks overlap and it becomes difficult to categorize any given feature.

The _spaCy_ framework — along with a wide and growing range of plug-ins and other integrations — provides features for a wide range of natural language tasks.
It's become one of the most widely used natural language libraries in Python for industry use cases, and has quite a large community — and with that, much support for commercialization of research advances as this area continues to evolve rapidly.

## Getting Started

Check out the excellent _spaCy_ [installation notes](https://spacy.io/usage) for a "configurator" which generates installation commands based on which platforms and natural languages you need to support.

Some people tend to use `pip` while others use `conda`, and there are instructions for both.  For example, to get started with _spaCy_ working with text in English and installed via `conda` on a Linux system:


To check proper model go to: https://spacy.io/usage/models

```
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
```

BTW, the second line above is a download for language resources (models, etc.) and the `_sm` at the end of the download's name indicates a "small" model. There's also "medium" and "large", albeit those are quite large. Some of the more advanced features depend on the latter, although we won't quite be diving to the bottom of that ocean in this (brief) tutorial.

Now let's load _spaCy_ and run some code:

In [None]:
!pip install spacy --upgrade

In [None]:
!python -m spacy download pl_core_news_md

In [None]:
import spacy

nlp = spacy.load("pl_core_news_md")

That `nlp` variable is now your gateway to all things _spaCy_ and loaded with the `pl_core_news_md` small model for English.
Next, let's run a small "document" through the natural language parser:

In [None]:
text = "Litwo, Ojczyzno moja, Ty jesteś jak zdrowie, ile Cię trzeba cenić ten tylko się dowie, kto Cię stracił."
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

First we created a [doc](https://spacy.io/api/doc) from the text, which is a container for a document and all of its annotations. Then we iterated through the document to see what _spaCy_ had parsed.

Good, but it's a lot of info and a bit difficult to read. Let's reformat the _spaCy_ parse of that sentence as a [pandas](https://pandas.pydata.org/) dataframe:

In [None]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
    
df

Much more readable!
In this simple case, the entire document is merely one short sentence.
For each word in that sentence _spaCy_ has created a [token](https://spacy.io/api/token), and we accessed fields in each token to show:

 - raw text
 - [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)) – a root form of the word
 - [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
 - a flag for whether the word is a _stopword_ – i.e., a common word that may be filtered out

Next let's use the [displaCy](https://ines.io/blog/developing-displacy) library to visualize the parse tree for that sentence:

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

Does that bring back memories of grade school? Frankly, for those of us coming from more of a computational linguistics background, that diagram sparks joy.

But let's backup for a moment. How do you handle multiple sentences?

There are features for _sentence boundary detection_ (SBD) – also known as _sentence segmentation_ – based on the builtin/default [sentencizer](https://spacy.io/api/sentencizer):

In [None]:
text = "W mieścinie pewnej, prowincji Manchy, której nazwiska nie powiem, żył niedawnymi czasy hidalgo pewien, z liczby tych, co to prócz spisy u siodła, szabliska starego, szkapy chudziny i paru gończych, niewiele co więcej mają. Rosolina powszednia, z baraniny częściej niż z wołowiny wygotowana na obiad, bigosik z resztek obiadu prawie co wieczór na kolację, co piątek soczewica, co sobota jaja sadzone po hiszpańsku, a na niedzielę gołąbeczek jakiś w dodatku do codziennej strawy, zjadały mu corocznie trzy czwarte części całego dochodu. Reszta szła na przyodziewek: na opończę z sukna cienkiego, hajdawery aksamitne z takimiż łapciami i na świtkę z krajowego samodziału dobornego, którą się w powszednie dni tygodnia obchodził."

doc = nlp(text)

for sent in doc.sents:
    print(">", sent)

When _spaCy_ creates a document, it uses a principle of _non-destructive tokenization_ meaning that the tokens, sentences, etc., are simply indexes into a long array. In other words, they don't carve the text stream into little pieces. So each sentence is a [span](https://spacy.io/api/span) with a _start_ and an _end_ index into the document array:

In [None]:
for sent in doc.sents:
    print(">", sent.start, sent.end)

We can index into the document array to pull out the tokens for one sentence:

In [None]:
doc[97:129]

Or simply index into a specific token, such as the verb `went` in the last sentence:

In [None]:
token = doc[2]
print(token.text, token.lemma_, token.pos_)

At this point we can parse a document, segment that document into sentences, then look at annotations about the tokens in each sentence. That's a good start.

## Acquiring Text

Now that we can parse texts, where do we get texts?
One quick source is to leverage the interwebs.
Of course when we download web pages we'll get HTML, and then need to extract text from them.
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a popular package for that.

First, a little housekeeping:

In [None]:
import sys
import warnings

warnings.filterwarnings("ignore")

## Natural Language Understanding

Now let's dive into some of the _spaCy_ features for NLU.
Given that we have a parse of a document, from a purely grammatical standpoint we can pull the [noun chunks](https://spacy.io/usage/linguistic-features#noun-chunks), i.e., each of the noun phrases:

In [None]:
text = "Józef Piłsudski herbu Piłsudski urodził się 5 grudnia 1867 w Zułowie na Wileńszczyźnie, w rodzinie o tradycjach patriotycznych."
doc = nlp(text)


Not bad. The noun phrases in a sentence generally provide more information content – as a simple filter used to reduce a long document into a more "distilled" representation.

We can take this approach further and identify [named entities](https://spacy.io/usage/linguistic-features#named-entities) within the text, i.e., the proper nouns:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_, ent.lemma_)

The _displaCy_ library provides an excellent way to visualize named entities:

In [None]:
displacy.render(doc, style="ent", jupyter=True)

## Wektory

In [None]:
import spacy

doc1 = nlp("Jedziemy na święta do domu.")
doc2 = nlp("Wracamy z wakacji nad morzem.")
doc3 = nlp("Uczę się na egzamin.")

print(doc1, "<->", doc2, doc1.similarity(doc2))
print(doc1, "<->", doc3, doc1.similarity(doc3))

wakacje = doc1[4]
dom = doc2[2]
print(wakacje, "<->", dom, wakacje.similarity(dom))

### Zadanie

1. Korzystając z API do wikipedii (pierwsze zajęcia) pobrać artykuł na dowolny temat i korzystając ze spacy pobrać wszystkie osoby, które w nim występują (korzystając z lematow). Zrobić to analogicznie dla angielskiego (lematyzator do NER działa lepiej po angielsku). Wypisać wszystkie osoby w kolejnosci razem z informacjamą ile razy występują.

2. W zadaniu z twittera zmodyfikować funkcję `create_word_cloud` żeby wypisywała tylko NERy

3. Wybrać dowolną krotką książkę z wolnych lektur. Korzystając z parsowania zależnosciowego wypisać wszystkie przymiotniki okreslające danego bohatera bezpośrednio (dep) albo okreslenie. Analogicznie dla angielskiego, przykład poniżej. Przyjrzeć się strukturze drzewa zależnosciowego i zaproponować ulepszenia - tak, żeby lepiej znajdować interesujące nas okreslenia.

In [None]:
!wget https://wolnelektury.pl/media/book/txt/kamizelka.txt

In [None]:
import re

In [None]:
with open('kamizelka.txt', 'r') as file:
    data = file.read().replace('\n', '')

In [None]:
doc = nlp(data)

In [None]:
okreslenia = []
for sent in doc.sents:
    if re.search("kamizelk", str(sent)):
      for token in sent:
        print(token.text, token.lemma_, token.pos_,  token.dep_, token.head.text,)
        if re.search("kamizelk",token.head.text) and (token.dep_ == "amod" or token.dep_ == "nmod"):
          okreslenia.append(token.lemma_)
      print("\n\n\n")

In [None]:
print(okreslenia)