# Introduction to Natural Language Processing (NLP)

---
* Rémy Frenoy
* https://www.linkedin.com/in/rfrenoy/
* https://github.com/rfrenoy
* https://github.com/rfrenoy/essec_nlp_course

---

# Disclaimer

* Lot of content in a little time
* Focus on main ideas, lot of links to best resources I know if you want to discover more.
* All content is available, at the end I'll show you how to run the code on your laptops

# Why is NLP an important Topic?

# NLP and machine learning

## What's machine learning?

Tom Mitchell: *"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E"*

* Experience E -> new data
* Class of task T -> Classification, generation, translation, ...
* Performance P -> How good the model is at doing T

The big difference between machine learning and "classical" programming is the fact that *we cannot explicitely program the rules* (it wouldn't evolve with experience *E*). So how do we find the rules?

Instead of computing a result based on a set of predefined criteria, let's say "if the flat is located in the 5th arrondissement of Paris, then its value per square meter is 13000 euros", we will construct some *"vector of criteria"* and some *"vector of weights"*, and our output will be the inner product of the two.

We will **optimize** the choice of values for our vector of weights to *minimize some cost function*, generally using [gradient descent](https://youtu.be/AeRwohPuUHQ).

## An example of regression

|    |   arrondissement |   floor |   max floor in building |   price per meter square |
|---:|-----------------:|--------:|------------------------:|-------------------------:|
|  0 |               17 |       7 |                       7 |                    11000 |
|  1 |                5 |       1 |                       7 |                    10000 |
|  2 |                3 |       4 |                       4 |                    13000 |
|  3 |               20 |       2 |                       3 |                     9500 |
|  4 |               19 |      11 |                      20 |                     8700 |
|  5 |               17 |       1 |                       6 |                     8000 |

Our model will try to find the price from the three first kinds of information. We can express our loss as some kind of difference between the predicted price and the real price, choose the loss function wisely so as to be able to compute gradient descent on it.

## What about now?

|    |   id | text                                                                                                                                  |   is real disaster? |
|---:|-----:|:--------------------------------------------------------------------------------------------------------------------------------------|--------------------:|
|  0 |    1 | Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all                                                                 |                   1 |
|  1 |    4 | Forest fire near La Ronge Sask. Canada                                                                                                |                   1 |
|  2 |    5 | All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected |                   1 |
|  3 |    6 | 13,000 people receive #wildfires evacuation orders in California                                                                      |                   1 |
|  4 |    7 | Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school                                               |                   1 |


# Data Preparation

NLP may be the domain where data preparation has the biggest impact on performance. Choices made during this phase can make the difference between a bad, a good and an excellent model. 

**Choices made during data preparation depends on the context studied.**

*Example: It is usually good practice to standardize the text and put everything in lower or upper case. But if you are trying to classify toxic comments, case carry signal, and this phase of standardization will lower your final model performances.*

## Extracting data and format it with regular expression

In [None]:
import pandas as pd
fake_operation_data = pd.read_excel('fake_operation_data.xlsx')

In [None]:
print(fake_operation_data.to_markdown())

In [None]:
def sanitize_comments(raw_df):
    import re
    new_lines = []
    for i, row in raw_df.iterrows():
        # p = re.compile(r"'?\-? ?([\w | |'|,]*)", re.IGNORECASE)
        p = re.compile(r"'?\-? ?(.*)", re.IGNORECASE)
        new_line_comments = [match for match in p.findall(row.comment) 
                             if match != '']
        new_lines.append(pd.DataFrame({'timestamp': [row.timestamp] * len(new_line_comments),
                                       'comment': new_line_comments}))
    return pd.concat(new_lines)

In [None]:
cleaned_df = sanitize_comments(fake_operation_data)
print(cleaned_df.to_markdown())

A lot of tools exist for you to quickly test your regular expressions, such as https://regex101.com/. The library documentation is available on https://docs.python.org/3/howto/regex.html.

## Tokenisation, lemmatization, stop words, ...

There are a lot of words in a vocabulary, some carrying more meaning than others. The goal of tokenisation is to separate sentences into terms. Lemmatization tries to standardize family of words to their common root to reduce the number of dimensionalities (= making the problem simpler). Stop words remove the words that appear so often that they should not carry differentiating meaning.

**Treat this with care!**

In [None]:
import spacy
nlp = spacy.load('en')

In [None]:
text = cleaned_df.comment.iloc[3]
print(text)
print('---')
doc = nlp(text)
for token in doc:
    print(token)
print()

In [None]:
text = cleaned_df.comment.iloc[3]
print("{:>10s}\t{:>10s}\t{:>10s}".format('Token', 'Lemma', 'Stopword'))
print("-"*50)
doc = nlp(text)
for token in doc:
    print("{:>10s}\t{:>10s}\t{:>10s}".format(str(token), str(token.lemma_), str(token.is_stop)))
print()

## Emojis

A good example showing that language is a continuously-evolving domain, the use of emojis is now very common. So common that taking into account the meaning of emojis can have a tremendous impact on your model! (It depends on your context again... If you work for Doctrine, you should not see emojis that often ;-) )

In [None]:
import requests
import re

r = requests.get('https://unicode.org/Public/emoji/12.0/emoji-test.txt')
lines = r.text.split('\n')
filtered_lines = [line for line in lines if len(line) > 0 and line[0] != '#']

In [None]:
filtered_lines[:5]

In [None]:
p = re.compile(".*# ([^\s])* (.*)")
code, meaning = zip(*[p.findall(line)[0] for line in filtered_lines])

In [None]:
my_emoji_translator = {c: meaning[i] for i,c in enumerate(code)}

In [None]:
my_emoji_translator

In [None]:
text = cleaned_df.comment.iloc[3]
print("{:>10s}\t{:>10s}\t{:>10s}".format('Token', 'Lemma', 'Stopword'))
print("-"*50)
doc = nlp(text)
for token in doc:
    if str(token) in my_emoji_translator:
        emoji_translation = my_emoji_translator[str(token)]
        doc_for_emoji = nlp(emoji_translation)
        for t in doc_for_emoji:
            print("{:>10s}\t{:>10s}\t{:>10s}".format(str(t), str(t.lemma_), str(t.is_stop)))

# Classification

# Generation

# Translation