# Introduction to NLP

## Course Overview

In this course, you will :

* Learn text processing fundamentals, including stemming and lemmatization.
* Apply fundamental NLP techniques to process specific sets of textual data
* Utilize statistical techniques and machine learning to analyze text and build a speech tagging model.

## Lesson Overview

In this lesson, we will begin with some overarching concepts needed to understand how textual language is processed.

* Text Processing
* Feature Extraction
* Modeling

Let's get started!

<img src="img_0.png">

## What makes it hard for computers to understand us?

One feature of our communication type is a lack of a precisely defined structure

Why does that make things difficult: there is ambiguity in the relationship of each item in human communcation that there is not in mathematics or computer programming
> * Math and programming are designed to be as unambiguous as possible and are suited for computers to process

Structured languages are easy to parse and understand for computers, because they are defined by a strict set of rules, or **grammar**. When a statement doesn't match the implied grammar, **the computer doesn't try to guess the meaning**. Humans do.

Computers can process words and phrases to understand: **keywords, parts of speech (POS), named entities (NER), dates, and quantities**. Using this information, computers can then make sense of: **statements, questions, or instructions**. With this, the computer can take that information to analyze documents and: **identify frequent and rare words, identify tone and sentiment, or cluster similar documents together**.

Context is also important. Consider the sentence:
> The sofa didn't fit through the door because it was too narrow.

In that sentence, it is unclear what "it" refers too being too narrow. "It" could refer to either the sofa or the door. Humans know that wide things don't fit through narrow things, so it is the door which is too narrow, but this is ambiguous. So, the computer, unless it **understands the relationship of wide and narrow and fitting through something**, would be unable to disambiguate the sentence above!

## NLP and Pipelines

NLP pipelines typically have three stages:
1. Text Processing
2. Feature Extraction
3. Modeling

### 1: Text Processing:
Goal: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction

### 2: Feature Extraction:
Goal: extract and produce feature representations that are appropriate for the type of model you are planning to use

### 3: Modeling:
Goal: accomplish the NLP task

However, the pipeline may not be perfectly linear. Consider you do the following:
1. process
2. extract features
3. model
4. examine model outcomes and are not pleased with results
5. review extracted features and are not pleased with the results
6. alter the processing
7. complete steps 1 through 3 again and if pleased, complete, else, 4 through 6 as well until you are pleased


### 1: Text Processing - Why do we need to process text

Why can we not feed text in directly? Think about where we get text to begin with.

Consider a web page, it is rendered on the left from the information on the right. We would not want to process the HTML tags.
<img src="img_1.png">

There are many sources of text: PDFs, web, word documents, other file formats, speech recognition system (speech to text), a book scanned using Optical Character Recognition (OCR)
<img src="img_2.png">

Some knowledge of the source medium can help you properly handle the input. In the end, **your goal is to extract plain text that is free of any source specific markers or constructs that are not relevant to your task**. Once you have the cleaned text, further processing may be required. For example, capitalization doesn't usually change the meaning of the word, but capitilization does alter the word's interpretation to the computer. So, the word may need to be lowercased so that all words are treated the same.

Example:
The Four states of matter are: Solid, liquid, gas, and plasma $\rightarrow$ four states matter solid liquid gas plasma

$$\require{cancel}$$
### 2: Feature Extraction - Getting Data Right for a Statistical Model

Text data is represented to computers as ASCII or Unicode that maps every character to a number. Computers store and transmit these values as binary. The values also have an ordering, but does it make sense for the letter "A" to be less than the letter "B"?

$$\text{ASCII}\left(\text{A}\right)=65,\text{ASCII}\left(\text{B}\right)=66, \text{ASCII}\left(\text{C}\right)=67 \land 65<66<67 \cancel{\implies} \text{A}<\text{B}<\text{C}$$

If we let $A<B<C$ stand, then it might mislead our model. Moreover, individual characters dont carry much meaning at all. Words are what we should be concerned with, but computers do not have a standard relationship for words.

For computer vision, images can be represented as the light intensity on the color spectrum (RGB) in a matrix. So, as an analogue, how can we come up with a similar representation for text features so we can model them? **It depends on what kind of model you're using and what you are trying to do.**

*Graph Based Model*
Represent words as symoblic nodes with relationships between them like WordNet
<img src="img_3.png">

*Statistical Models*
Need some sort of numerical representation

<img src="img_4.png">

* Document level task: per-document representation - bag-of-words or doc2vec<br>
Examples:
> * Spam detection
> * Sentiment analysis

<img src="img_5.png">

* Individual word and phrase level task: word level representation - word2vec GloVe<br>
Examples:
> * Text generation
> * Machine translation

<img src="img_6.png">

There are many ways of representing textual information; practice is most likely to inform you of the best for your problem.

### 3: Modeling

* Designing a model
* Fitting its parameters to training data
* Predicting on unseen data

<img src="img_7.png">

By working with numerical features, it allows us to utilize most any machine learning algorithm.

<img src="img_8.png">