<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Natural-Language-Processing-Intro" data-toc-modified-id="Natural-Language-Processing-Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Natural Language Processing Intro</a></span></li><li><span><a href="#Challenges-of-NLP" data-toc-modified-id="Challenges-of-NLP-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Challenges of NLP</a></span></li><li><span><a href="#The-NLP-Pipeline" data-toc-modified-id="The-NLP-Pipeline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The NLP Pipeline</a></span><ul class="toc-item"><li><span><a href="#Text-processing-involves-removing-the-extra-&quot;junk&quot;" data-toc-modified-id="Text-processing-involves-removing-the-extra-&quot;junk&quot;-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Text processing involves removing the extra "junk"</a></span></li><li><span><a href="#Feature-Extraction" data-toc-modified-id="Feature-Extraction-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Feature Extraction</a></span><ul class="toc-item"><li><span><a href="#Graph-representation" data-toc-modified-id="Graph-representation-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Graph representation</a></span></li><li><span><a href="#Document-level" data-toc-modified-id="Document-level-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Document level</a></span></li><li><span><a href="#Words-&amp;-Phrases" data-toc-modified-id="Words-&amp;-Phrases-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Words &amp; Phrases</a></span></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Modeling</a></span></li><li><span><a href="#Overall-Example-of-Pipeline" data-toc-modified-id="Overall-Example-of-Pipeline-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Overall Example of Pipeline</a></span></li></ul></li></ul></div>

# Natural Language Processing Intro

Very structured language (logic, mathematics, programming, etc) vs a natural language that is fluid, complexity, and unstructured.

Computers bridge the gap by processing words & phrases (identifying parts of speech, keywords, etc.), parsing sentences (statements, questions, etc.), and more advanced techiniques like tone & sentiment analysis and document grouping/clustering.

# Challenges of NLP

Difficult to understand meaning for a computer:

> I was led to believe that the Fyre Festival would be an amazing, transcendent event - I was conned.

Ambiguity because of lack of **context** (meaning or *semantics*):

> The pipe couldn't fit through the hole in the wall since it was too big.

versus:

> The pipe couldn't fit through the hole in the wall since it was too small.


# The NLP Pipeline

## Text processing involves removing the extra "junk"

> Can't simply feed in data unprocessed

The first step to NLP is to process the **text** (here I mean the text representation of the natural language being used). Then would be extracting features which we can use to model (we'll see soon how to do this given the features later in this module).

In this section, we will focus on the text processing portion and some feature extraction/selection.

## Feature Extraction

### Graph representation 

WordNet: https://wordnet.princeton.edu/

### Document level 

> bag of words
>
> doc2vec
    

Uses:

+ Sentiment analysis
+ Spam detection


### Words & Phrases 

> word2vec
>
> glove

Uses:
+ Text generation
+ Machine Translation

## Modeling

> Using numerical allows the ML algorithms we know!

## Overall Example of Pipeline

```python

import nltk
nltk.download(['punkt', 'wordnet'])

import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'


def load_data():
    df = pd.read_csv('messages.csv')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    return X, y


def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


def display_results(y_test, y_pred):
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)


def main():
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
    ])

    # train classifier
    pipeline.fit(X_train, y_train)

    # predict on test data
    y_pred = pipeline.predict(X_test)

    # display results
    display_results(y_test, y_pred)


main()
```