## Week 10 - Text Clustering / Categorization

#### 10.1 - Clustering Motivation
Clustering Bias states that any two objects can be similar based on how you view them. Must define the perspective of assessing similarity. What can be clustered:
- Documents
- Terms
- Passages/Sentences/Segments
- Websites
- Text Clusters can also be clustered

Why cluster? It is useful for exploratory analysis and text mining:
- Get sense of overall content in collection
- Link text objects
- Create structure on text data
- Induce additional features for classification

#### 10.2 - Clustering Generative Probabilistic Model Part I
Two ways to do text clustering:
- Generative Probabilistic Models
- Similarity-Based Approaches

In text clustering we only allow a document to cover one topic, topic representing a cluster. 

###### Generative Model for CLustering

Input $C, k, v$ with the ouput $\{\, \theta_1, ..., \theta_k \}\,$ and $\{\, c_1, ..., c_n \}\, \in [1, k]$ where $c_i$ represents a cluster the topic belongs to and each document can only be defined by one topic. This is similar to topic mining, except of using $\pi$ there is the use of $C$. So instead of topic coverage for each document we will get the likelihood that a document belongs to a cluster. 

Within the topic mining method think of the $\pi$ as being elements of a matrix where the columns represent the documents and the rows are the topics. $\pi$ is the topic coverage for the document.

Mine one topic:
- Input $C = \{\, d \}\ , v$
- Output $\{\, \theta \}\,$

Why can generative topic model not be used for clustering? Because it has allowed multiple topics to contribute words to document and therefore makes it complex to identify which topic/cluster the document is from. All words in the document must be generated from a single distribution of a topic model defined by the cluster. 

###### Mixture Model for Document Clustering
Once you choose a topic you will stay with distribution to generate all words in document. Only make choice of distribution once for each document. 

How does the mixture model differ from the topic model?
- Choice of distribution is made just once in mixture model
- One distribution generates all words for document in mixture model

Likelihood Function:
$$p(d) = p(\theta_1)\ \prod_{i = 1}^L\ p(x_i | \theta_1) + p(\theta_2) \prod_{i = 1}^L\ p(x_i | \theta_2)$$
$$p(C|\Lambda) = \prod_{j = 1}^N\ p(d_j | \Lambda)$$

which is the probability of choosing a distribution plus the probability of observing a document from that distribution.

#### 10.3 - Clustering Generative Probabilistic Model Part II
How to generalize using $k$ clusters? Use a model that is a mixture of $k$ unigram language models. $\Lambda = ({\theta_i}; {p(\theta_i)}), i \in [1, k]$. Likelihood:
$$p(d | \Lambda) = \sum_{i = 1}^k [p(\theta_i)\ \prod_{w \in V} p(w | \theta_i)^{c(w,d)}]$$

Maximize likelihood by selecting the max parameters for $\Lambda$. This essentially assigns document to cluster with highest probability of generating document.

Parameters:
- Each $\theta_i$ represents the content of cluster $i : p(w | \theta_i)$
- $p(\theta_i)$ indicates the size of cluster $i$. NOTE this does not depend on $d$

Which cluster should document belong to? Assign $d$ to the cluster correpsonding to the topic $\theta_i$ that most likely has been used to generate $d$. Likelihood + prior $p(\theta_i)$ is the Bayesian approach and favor large clusters. Want to choose a cluster that is both large and has the highest probability.

#### 10.4 - Clustering Generative Probabilistic Model Part III
Use the EM algorithm for document clustering:
- randomy set the parameters for $\Lambda$
- Repeat until likelihood $p(C | \Lambda)$ converges


- E-Step infers which distribution has been used to generate document $d$ using hidden variables $z_d$.
$$p^{(n)}(Z_d = i | d) \propto p^{(n)}(\theta_i)\ \prod_{w \in V}p^{(n)}(w | \theta_i)^{c(w, d)}$$
- M-Step is the re-estimation of the parameters. (Choose the distribution)
$$p^{(n + 1)}(\theta_i) \propto \sum_{j = 1}^N\ p^{(n)}(Z_{d_j} = i | d)$$

is the probability of selecting a specific distribution.

$$p^{(n + 1)}(w | \theta_i) \propto \sum_{j = 1}^N\ c(w, d_j)\ p^{(n)}(Z_{d_j} = i | d)$$

is the probability of words within each distribution / cluster. 
$p(Z_d = 1 | d)$ is the Bayesian formula: numerator is probabily of selecting topic/cluster  times the probability of generating words from that topic. Denominator is the sum of all probabilities (normalize). We add a normailizer to avoid underflow. $\bar \theta$ is the average of generating a word form a topic.

To add a normalizer, for the terms in the numerator and denominator divide by the average word distribution.

#### 10.5 -Clustering  Similarity-Based Approaches
General Idea:
- Explicitly define a similarity function to measure similarity between two text objects
- Find an optimal partitioning of data to maximize intra-group similarity (similar within groups ) and minimize inter-group similarity (different between groups)
- Strategies
	- Progressively construct a hierarchy of clusters via bottom-up (group) or top-down (partition) (Hierarchal Agglomerative Clustering)
	- Start with initial clustering and iteratively improve it (K-Means)

###### Agglomerative Hierarchal Clustering
Given a similarity function to measure similarity between two objects, gradually group similar objects together in a bottom-up fashion to form a hierarchy. Stop when a threshold has been achieved.  

Similarity-induced Structure: gradually group objects one-by-one based on which two are most similar, which essentially results in a binary tree. Cutoff/threshold is like pruning this binary tree. Mehtods used to group similar objects include:
- Single-link algorithm: similarity of closest pair. Loose clusters that is sensitive to outliers.
- Complete-link algorithm: similarity of farthest pair. Tight clusters that are also sensitive to outliers. Even the unlikely connections are feasible.
- Average-link algorithm: average of similarity of all pairs. In-between, group decision, not sensitive to outliers.

###### K-Means
Represent each text object as a term vector and assume a similairty function on two objects.
- Start with $k$ randomly selected vectors and assume they are the centroids of $k$ clusters. 
- WIth these $k$ centroids, assign a vector to a cluster whose centroid is the closest to the vector
- Recompute the centroid based on the objects within the cluster
- Repeat until function converges to a local minima (within cluster sum-of-squares)

This process is very similar to the EM Algorithm clustering. K-Means differs in the E-Step because a probabilistic decision is made, but rather a user choice based on the distance of points from the centroids.

#### 10.6 - Clustering Evaluation
###### Direct Evaluation of Text Clusters
Questions to answers: how close are system-generated clusters to the ideal clusters generated by humans. Closeness can be assessed in different ways and can be quantified. Clustering bias is imposed by the human assessors.

Evaluation procedure:
- Given test set, have humans create an ideal clustering result
- Use a system to produce clusters from the same test
- Quantify similarity between the system-generated clusters and the human-generated clusters. 
- Similarity can be measured from multiple perspectives (purity, normalized mutual information, F measure, etc.)

###### Indirect Evaluation of Text Clusters
How useful are the clustering results for the intended applications
- Usefulness is application-specific
- Clustering bias is imposed by the intended application

Evaluation Proceedure
- Create a test set for intended application to quantify the performance of any system for this application
- Choose baseline system to compare with
- Add clustering algorithm to the baseline system
- Compare the performance of the clustering system and the baseline in terms of any performance measure for the application.

###### Approaches
- Strong clusters tend to show up no matter what method is used
- Effectiveness of a method highly depends on whether the desired clustering bias is captured appropriately 
- Deciding the optimal number of clusters is generally more difficult method due to the unsupervised nature

#### 10.7 - Categorization Motivation
###### Text Categorization
Given a set of predefined categories and often a training set of labeled text objects. The goal is to classify a text object into one or more of the categories. Internal categories characterize a text object. External categories characterize an entity associated with the text object.

###### Examples of Text Categorization
- News categorization
- Literature article categorization
- Spam email detection
- Sentiment categorization of product reviews or tweets
- Automatic email sorting

###### Variants
- Binary categorization
- K-Category categorization
- Hierarchal categorization
- Joint categorization: multiple related categorization tasks done in a joint manner

###### Why Text Categorization
- Enrich text representation. Text can now be represented in multiple levels like keywords and categoried.
- Semantic categories assigned and can facilitate aggregation of text content (positive or negative opinions)
- Infer properties of entities associated with text data.
- As long as an entity can be associated with text data, we can use text data to help categorize associated entities

#### 10.8 - Categorization Methods
###### Works well when:
- Categories are well defined
- Categories are easily distinguished based on surface features in text
- Sufficient domain knowledge is available to suggest many effective rules

###### Problems
- Does not scale well
- Cannot handle uncertainty in rules
- Rules ay be inconsistent

###### Automatic Categorization
Use humans to:
- Annotate data sets with category labels
- Provide a set of features to represent each text object that can potentially provide a clue about the category

Use machine learning to learn soft rules for categorization from the training data:
- Figure out which features are most useful
- Optimally combine features to minimize the errors of categorization on training data
- The trained classifier can then be applied to new text object to predict the most likely category a human export would assign to it

###### Discriminative vs. Generative Classifiers
Generative classifiers attempts to model the probability of features and label to compute the label using Baye's Rule. Objective function is likelihood. What data looks like each category.

Discriminative classifiers attempt to model the labels and features directly; what features seperate categories. Function measures errors of categorization on training data. This includes Logistic Regression, SVMs, K-Nearest Neighbors, etc.

#### 10.9 - Categorization  Generative Probabilistic Model
Text clustering is to understand the categories, and text categorization places the documents into predefined categories.
$$cluster(d) = argmax \prod_{w \in V}\ p(w | \theta_i)^{c(w, d)}\ p(\theta_i)$$
$$p(\theta_i | d) = \frac{p(d | \theta_i)\ p(\theta_i)}{\sum_{j = 1}^k\ p(d | \theta_j)\ p(\theta_j)}$$

###### Text Categorization with Naive Bayes Classifier
$$category(d) = argmax_i\ log_p(\theta_i) + \sum_{w \in V}\ c(w, d)\ log\ p(w | \theta_i )$$

How to assure $\theta_i$ represents category $i$? Use the training data which are documents with assigned categories assigned by human experts. To estimate $p(w | \theta_i)$ and $p(\theta_i)$ use the training data. Use the training data to get the prior $\theta_i$ and the likelihood $p(w | \theta_i)$. 

$$p(\theta_i) = \frac{N_i}{\sum_{j = 1}^k N_j} \propto \mathbf{|T_i|}$$

will estimate which category is most popular. $N_i$ is the number of documents in each category.

$$p(w | \theta_i) = \frac{\sum_{j = 1}^{N_i}\ c(w, d_{ij})}{\sum_{w' \in V}\ \sum_{j = 1}^{N_i}\ c(w', d_{ij})}\ \propto c(w, T_i)$$

determines which word is most frequent.

Why smooth in Naive Bayes? This addresses data sparseness, incorporates prior knowledge, and achieves discriminative weighting (IDF Weighting). We do this by adding pseudocounts to words:
$$p(\theta_i) = \frac{N_i + \delta}{\sum_{j = 1}^k\ N_j + k \delta}$$

where $\delta \ge 0$.

$$p(w | \theta_i) = \frac{\sum_{j = 1}^{N_i}\ c(w, d_{ij}) + \mu p(w | \theta_B)}{\sum_{w' \in V}\ \sum_{j = 1}^{N_i}\ c(w', d_{ij}) + \mu}$$
$$p(w | \theta_B) = \frac{1}{\mathbf{|v|}}$$

As $\mu$ approaches infinity the model approaches the background language model. $\theta_B$ is the background model.

###### Anatomy of Naive Bayes Classifier
Assume we have two categories:
$$score(d) = log(\frac{p(\theta_i)}{p(\theta_2)}) + \sum_{w \in V}\ c(w, d)\ log(\frac{p(w | \theta_i)}{p(w | \theta_2)})$$

The larger the score the more likely the document is in category one $\theta_1$. Left of + is the log of the prior and therefore the category bias and can be represented as $\beta_0$. Right of + is the sum of all of the words: count of word times the weight of each word. This can be represented as $\beta_i$. Essentially comapring the probability of the word given two distributions. This can be generalized as:
$$score(d) = \beta_0 + \sum_{i = 1}^M\ f_i \beta_i$$

where $f_i$ is the count of the word. This is close to logistic regression.
