In [1]:
%matplotlib inline

import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [9.0, 6.0]

# Session 3: Text Classification and Topic Modeling 
**July 12, 2018**

In this session, we'll begin using machine learning techniques to broaden our text analytics. Now that we have a corpus of documents, we can start building topic models to determine if documents are similar to each other and how to categorize our documents broadly. We can also use text classification methods to create tools that allow us to model documents and utterances more precisely. We'll take a look at a specific kind of classification - sentiment analysis and how it can be used for engaging in deeper analytics. 

## Intro to Machine Learning 

In general, a learning problem considers a set of $n$ instances (or examples) that models are trained on. 

Instances are represented by a multidimensional entry (aka multivariate data) having several attributes or _features_.

The goal of the learning problem is to create a model that predicts a _target_ attribute or feature.

![Feature Space](figures/s3_feature_space.png)

The best way to think about machine learning is problems that use mathematical and statistical methods to find patterns in _high dimensional space_.

### Learning by Example

| Problem Domain                              | Machine Learning Class |
|---------------------------------------------|------------------------|
| Infer a function from labeled data          | Supervised learning    |
| Discover structure of data without feedback | Unsupervised learning  |
| Interact with environment towards a goal    | Reinforcement learning |

_Given examples (data) extract a meaningful pattern upon which to act._

### Algorithms by Output 

| Type of Output                              | Algorithm Category              |
|---------------------------------------------|---------------------------------|
| **Output is one or more discrete classes**  | **Classification (Supervised**) |
| **Output is continuous**                    | **Regression (Supervised)**     |
| **Output is membership in a similar group** | **Clustering (Unsupervised)**   |
| Output is the distribution of inputs        | Density Estimation              |
| Output is simplified from higher dimensions | Dimensionality Reduction        |

_Use training data to fit a model which is then used to predict incoming inputs._

### Classification 

![Classification](figures/s3_classification.png)

Given labeled input data (with two or more labels), fit a function that can determine for any input, what the label is. 

### Regression 

![Regression](figures/s3_regression.png)

Given continuous input data fit a function that is able to predict the continuous value of input given other data.

### Clustering 

![Clustering](figures/s3_clustering.png)

Given data, determine a pattern of associated data points or clusters via their similarity or distance from one another.

## The Bag of Words Model 

In order to do words on machine learning, therefore - we need to represent text numerically somehow.  

![Bag of Words](figures/s3_bag_of_words.png)

We've already noted that words co-occurring together demonstrate statistical significance that might be related to meaning. The bag-of-words model takes advantage of this to create a model that relies on numeric representations of words that are located close together. 

## Vectorization 

Also called feature-extraction; the process of transforming text documents into numeric representations (vectors) that can be used to do machine learning.

![Vector Encoding](figures/s3_vector_encoding.png)

The common method of vectorization is to take the _vocabulary_ of the corpus and order them lexicographically (in alphabetical order). To transform a document into a vector, we simply assign a number for each word that represents the word's relationship to the document.

### One Hot Encoding 

![One Hot Encoding](figures/s3_one_hot_encoding.png)

The simplest method is one-hot-encoding where we simply assign a 1 if the word exists in the document, or a 0 otherwise. This is a very common encoding that is generally used in artificial neural networks.

### Frequency Encoding 

![Frequency Encoding](figures/s3_frequency_encoding.png)

If we believe that the number of times a word appears in a document matters, we can simply count the number of occurrences and use that number in the word's vector position.

### TF-IDF 

![TF-IDF](figures/s3_tfidf_encoding.png)

Term-Frequency, Inverse-Document-Frequency is a measure of a word's relative importance to the document, given its frequency in the entire corpus. 

### word2vec 

![Distributed Representations](figures/s3_distributed_representation.png)

The current state of the art is called _word embeddings_. This representation computes a word vector based on word similarity; e.g. "king" and "queen" will be closer together than "red" and "banana". 

## Topic Modeling 

Unsupervised methods of clustering related documents into _topics_ &mdash; e.g. find groups of related documents based on the terms that they contain. 

![Topic Modeling Pipeline](figures/s3_topic_modeling.png)

## Text Classification 

If we have documents labeled with a tag or class, we can train a model of text to identify those categorizations in other texts. E.g. we can detect bullying, positive or negative sentiment, product categories, etc. 

![Classification Pipeline](figures/s3_classification_pipeline.png)