# NLP - Basics

## What is NLP?

Natural Language Processing (NLP) is a field of computer science that focuses on the interaction between human language and computers. NLP enables computers to understand, interpret, and generate human language. NLP has many applications in journalism from performing complex analysis over written documents to generating parts of articles. In this notebook, we will cover the core concepts of NLP before moving on to some advanced applications.

* **Setup and Packages**: There are several key NLP packages that provide a lot of functionality and are important to know. We will provide a basic overview of each, and will then use them in the following sections.
* **Tokenization**: This critical step involves taking the original text and breaking it up into pieces such as words which is critical for many algorithms.
* **Part-of-Speech (POS) Tagging**: Grammar is a core part of every language and identifying the verbs, subjects, and objects of a sentence is a foundational task for understanding language.
* **Dependency Parsing**: Dependency parsing is similar to POS Tagging, but instead of just finding what part of speech each word is, it finds how the words are connected. For example, we can use this to find what part of the sentence the verb acts on. For example, in the sentence "The boy kicks the ball," the direct object of "kicks" is "ball". 
* **Named Entity Recognition (NER)**: Knowing who was mentioned in a text can be a very useful analysis, and we will cover how to extract names of people and businesses.

In the next section, we will cover representing words as "vectors", sentiment analysis, meaning similarity, and text generation.

### Gensim

We will use this library in the next notebook and will allow us to perform some complex analysis, such as finding how similar two words are.

In [None]:
import gensim

# Words as Data

In the previous lesson we learned the core ideas that enable NLP, and now we will go into some more advanced use cases.



In [None]:
import hazm

## Words as Vectors

Computers do not think in concepts or words but rather mathematically operations, so to do advanced NLP tasks we need to be able to convert the problem into a math one. One key idea is representing words not as text, but rather as numbers. This allows us to do all kinds of mathematical operations.

One key idea is representing words as "vectors." A vector is a series of numbers, for example, `(1, 2, 3, 4)` is a 4 dimensional vector, meaning it has 4 numbers.

We can represent words as by having a vector with as many dimensions as there are words, so "the" might be `(1, 0, 0, 0, ...)` and "and" might be `(0, 1, 0, 0, 0, ...)` and so on. This is how models such as GPT see words.

Another way is to use a few dimensions and have each word be placed closest to words that are similar to it. For example, "chocolate" and "sugar" could represented as `(0.9, 1.0)` and `(1.0, 0.9)` respectively while "sour" could be represented by `(-1.0, -1.0)`.

Manually representing would be impossible because of how many words there are, so these models must be "trained" and it will learn which words are similar to others and find which numbers best represent each word. This means that they must learn from examples given to them.

Typically these will be trained with very large sets of data, but for now, we will be using just a few sentences as it takes a long time to train on large sets of data.

In [None]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

To train a Word2Vec model, once must first convert the input into a series of words. From the previous notebook we know how to do that.
We will load the text and then use `hazm` to convert it into a list of words.

In [None]:
# TODO
examples = hazm.word_tokenize("Hello world. hello world. hello world.")
print(examples[:10])

['Hello', 'world', '.', 'hello', 'world', '.', 'hello', 'world', '.']


Now that we have our data loaded and processed, we can train a simple Word2Vec model.

In [None]:
model = Word2Vec(examples, min_count=1, size=10, workers=3, window=3, sg=1)



Above, we can see we are creating a Word2Vec model with our `examples` data. `min_count` allows us to exclude uncommon words that occur less than that value (here we don't exclude any words). `size` says how many dimensions our word vectors should have. `window` instructs it how close words can be to still be considered related. Finally, `sg` is the training algorithm we want to use, where here we select skip gram.

In [None]:
model.similarity('hello', 'world')

  model.similarity('hello', 'world')


KeyError: ignored

BERT based Named Entity Recognition here: `https://github.com/hooshvare/parsner`