# Common natural language processing tasks and techniques

## Tokenization
Probably the first thing most NLP algorithms have to do is to split the text into tokens, or words. While this sounds simple, having to account for punctuation and different languages' word and sentence delimiters can make it tricky. You might have to use various methods to determine demarcations.

## Embeddings
[Word embeddings](https://en.wikipedia.org/wiki/Word_embedding) are a way to convert your text data numerically. Embeddings are done in a way so that words with a similar meaning or words used together cluster together.

## Parsing & Part-of-speech Tagging
Every word that has been tokenized can be tagged as a part of speech - a noun, verb, or adjective. The sentence the quick red fox jumped over the lazy brown dog might be POS tagged as fox = noun, jumped = verb.
Parsing is recognizing what words are related to each other in a sentence - for instance the quick red fox jumped is an adjective-noun-verb sequence that is separate from the lazy brown dog sequence.

## Word and Phrase Frequencies
A useful procedure when analyzing a large body of text is to build a dictionary of every word or phrase of interest and how often it appears. The phrase the quick red fox jumped over the lazy brown dog has a word frequency of 2 for the. Phrase frequencies can be case insensitive or case sensitive as required.

## N-grams
A text can be split into sequences of words of a set length, a single word (unigram), two words (bigrams), three words (trigrams) or any number of words (n-grams).

For instance the quick red fox jumped over the lazy brown dog with a n-gram score of 2 produces the following n-grams:

- the quick
- quick red
- red fox
- fox jumped
- jumped over
- over the
- the lazy
- lazy brown
- brown dog

## Noun phrase Extraction
In most sentences, there is a noun that is the subject, or object of the sentence. In English, it is often identifiable as having 'a' or 'an' or 'the' preceding it. Identifying the subject or object of a sentence by 'extracting the noun phrase' is a common task in NLP when attempting to understand the meaning of a sentence.

✅ In the sentence "I cannot fix on the hour, or the spot, or the look or the words, which laid the foundation. It is too long ago. I was in the middle before I knew that I had begun.", can you identify the noun phrases?

In the sentence the quick red fox jumped over the lazy brown dog there are 2 noun phrases: quick red fox and lazy brown dog.

## Sentiment analysis
A sentence or text can be analysed for sentiment, or how positive or negative it is. Sentiment is measured in polarity and objectivity/subjectivity. Polarity is measured from -1.0 to 1.0 (negative to positive) and 0.0 to 1.0 (most objective to most subjective).

## Inflection
Inflection enables you to take a word and get the singular or plural of the word.

## Lemmatization
A lemma is the root or headword for a set of words, for instance flew, flies, flying have a lemma of the verb fly.

---

# Translation

A naive translation program might translate words only, ignoring the sentence structure.

 Another approach is to ignore the meaning of the words, and instead use machine learning to detect patterns. This can work in translation if you have lots of text (a corpus) or texts (corpora) in both the origin and target languages.

In [2]:
from textblob import TextBlob

blob = TextBlob(
    "It is a truth universally acknowledged, that \
    a single man in possession of a good fortune, must be in want of a wife!"
)

print(blob.translate(to="fr"))

C'est une vérité universellement reconnue, qu'un homme célibataire en possession d'une bonne fortune, doit avoir besoin d'une femme !


---

## Sentiment Analysis

Another area where machine learning can work very well is sentiment analysis. A non-ML approach to sentiment is to identify words and phrases which are 'positive' and 'negative'. Then, given a new piece of text, calculate the total value of the positive, negative and neutral words to identify the overall sentiment.

This approach is easily tricked as you may have seen in the Marvin task - the sentence Great, that was a wonderful waste of time, I'm glad we are lost on this dark road is a sarcastic, negative sentiment sentence, but the simple algorithm detects 'great', 'wonderful', 'glad' as positive and 'waste', 'lost' and 'dark' as negative. The overall sentiment is swayed by these conflicting words.

The ML approach would be to manually gather negative and positive bodies of text - tweets, or movie reviews, or anything where the human has given a score and a written opinion. Then NLP techniques can be applied to opinions and scores, so that patterns emerge (e.g., positive movie reviews tend to have the phrase 'Oscar worthy' more than negative movie reviews, or positive restaurant reviews say 'gourmet' much more than 'disgusting').

In [4]:
from textblob import TextBlob

quote1 = """It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."""

quote2 = """Darcy, as well as Elizabeth, really loved them; and they were both ever sensible of the warmest gratitude towards the persons who, by bringing her into Derbyshire, had been the means of uniting them."""

sentiment1 = TextBlob(quote1).sentiment
sentiment2 = TextBlob(quote2).sentiment

print("Quote 1 has a sentiment of " + str(sentiment1))
print("Quote 2 has a sentiment of " + str(sentiment2))

Quote 1 has a sentiment of Sentiment(polarity=0.20952380952380953, subjectivity=0.27142857142857146)
Quote 2 has a sentiment of Sentiment(polarity=0.7, subjectivity=0.8)
