## Week 11 - Text Categorization Continued

#### 11.1 - Discriminative Classifier Part I 
You will have a binary response variable, where each value maps to one of two categories.

###### Conditional Likelihood
$$p(T | \bar \beta) = \prod_{i = 1}^{\mathbf{|T|}}\ p(Y = Y_i | X = X_i, \bar \beta)$$

###### Estimation of Parameters (Logistical Function)
This will map the range of $x$ to 0 and 1. Training data $T$ and parameters $\bar \beta$ leads to the conditional likelihood $p(T | \bar \beta) = \prod_{i = 1}^{\mathbf{|T|}}\ p(Y = Y_i | X = X_i, \bar \beta)$ where:

$$p(Y = 1 | X) = \frac{e^{\beta_0 + \sum_{i = 1}^M\ x_i\beta_i}}{e^{\beta_0 + \sum_{i = 1}^M\ x_i\beta_i} + 1}$$
$$p(Y = 0| X) = \frac{1}{e^{\beta_0 + \sum_{i = 1}^M\ x_i\beta_i} + 1}$$

and Maximum likelihood is:
$$\bar \beta^{*} = argmax\ p(T | \bar \beta)$$

###### KNN
- Find $k$ examples in train most similar to text object to be classified
- Assign category most common in these neighbor text objects
- Improve by considering distance of a neighbor
- A way to directly estimate conditional probability of label
- Need similarity function to measure similarity between two text objects 

Assume $p(\theta_i | d)$ is locally smooth; the same probability for all the $d$ in each region $R$. Estimate $p(\theta_i | R)$ based on the known categories in the region $R$.
$$p(\theta_i | R) = \frac{c(\theta_i, R)}{\mathbf{|R|}}$$

where $\theta_i$ represents documents with category.

#### 11.2 - Discriminative Classifier Part II (Optional)

#### 11.3 - Evaluation Part I
###### General Evaluation Method (Cranfield Methodology)
- Have people create test collection where every document is tagged with desired categories
- Generate categorization decisions with people-made categorization decisions and quantify their similarity
- Higher similarity the better the results

###### Classification Accuracy
Measures the percentage of correct decisions. Matrix columns are the categories $c_i$ and the rows are the documents $d_j$. The elements are a binary classification based on whether or not the document has been classified to match that of a person's classification.
$$accuracy = \frac{Number of Correct Decisions}{Number of Decisions}$$

###### Problems with Classification
Decision error depends on the application; some decisions for some documents or categories are more important than others. To avoid skewed test results, place all instances in a single category.

|*|System (y) |System (n)|
|-|-|-|
|Human (+) |True Positive |False Negative |
|Human (-)  |False Positive |True Negative |

- Precision: $\frac{TP}{TP + FP}$; when system says yes how many are correct
- Recall: $\frac{TP}{TP + FN}$ and answers if document has all categories it should have; does the doc have all the categories it should have.

###### Per-Category Evaluation
- Precision: when system says yes how many are correct
- Recall: has the category been assigned to all the docs of this category

###### F-Measure (Harmonic Mean)
$$F_{\beta} = \frac{(\beta^2 + 1) P * R}{\beta^2 P + R}$$

$$F_1 = \frac{2PR}{P + R}$$

where $F_1$ is more popular to use.

#### 11.4 - Evaluation Part II
###### Macro Average Over All Categories
For each category we will compute precision, recall, and $F_1$. Once computed, then aggregate all the values for all the categories to get overall precision, recall, and $F_1$.
- Arithmatic Mean affected by high values
- Geometric Mean affected by low values

Do this same process for all of the documents, instead of all categories/topics. Macro Averaging is typically more helpful and informative than Micro Averaging.

###### Micro Averaging of Precision and Recall
Pool all decisions and then compute precision and recall (counting the number of cases in each of the four categories in table above.). Aim to treat all instances equally, but this may be commonly not helpful.

###### Ranking
Categorization can be passed to people for:
- Further editing
- Prioritizing a task

In such cases we can evaluate results as a ranked list if the system can give scores for the decisions
- Discovery of spam emails
- Appropriate to frame problem as a ranking problem instead of a categorization problem

###### Categorization Evaluation
- Commonly used measures for relative comparison
	- Accuracy, precision, recall, F-measure
	- Variation: per-document, per-category, micro, macro averaging
- Ranking could be better as ranking

#### 11.5 - Sentiment Analysis
###### Opinion Representation
- Opinion Holder: whose opinion is this? Reviewer $X$
- Opinion Target: what opinion is about? Product $P$
- Opinion Content: what exactly is the opinion? Review Text $T$

###### Enriched Opinion Representation
- Opinion Context: what situation was the opinion expressed (time or location)? Context $C$
- Opinion Statement: what does the opinion tell us about the opinion holder's feeling? Sentiment $S$

###### Variations of Opinions
- Opinion Holder: Individual or Group
- Opinion Target: one entity, group of entities, one attribute of entity, or someone else's opinion
- Opinion Content: 
	- Surface Variation: one sentence or phrase, paragraph, article, etc.
	- Sentiment/Emotion Variation: positive or negative
- Opinion Context:
	- Simple Context: different time, location, etc.
	- Complex Context: potentially includes entire discourse context of an opinion

###### Different Kinds of Opinions in Text Data
- Author Opinion
- Reported Opinion
- Observed Opinion
- Indirect/Inferred Opinions

###### Opinion Mining Pipeline
- Take Text Data
- Generate Opinion Representation Set
	- Opinion Holder
	- Opinion Target
	- Opinion Content & Context to create Opinion Sentiment

###### Why Opinion Mining
- Decision Support
	- Help choose products or services
	- Help decide vote choices
	- Help design new policies
- Understand People
	- People preferences to better serve them
	- Advertising
- Voluntary Survey
	- Business Intelligence
	- Market Research
	- Data-Driven Social Science Research
	- Gain advantage in any prediction

#### 11.6 - Sentiment Classification
###### Task Definition
- Input is an opinionated text object
- Output is a sentiment tag / label
	- Polarity Analysis: Categories like positive or negative or numbered factors
	- Emotion Analysis: Categories based on mood
- Any text classification can be used to do sentiment classification
- Improvements:
	- More sophisticated features for sentiment tagging
	- Consideration of the order of the categories

###### Common Text Features
- Character n-grams: general and robust to spelling/recognition errors, but less discriminative than words
- Word n-grams: Unigrams are often very effective, but not for sentiment analysis. Long n-grams are discriminative, but may cause overfitting.
- Parts-of-Speech Tagging n-grams: tags with categories like noun, verb, adjective, etc.
- Word Classes
	- Syntactic
	- Semantic
	- Empirical Word Clusters
- Frequent Patterns in Text
	- More specific/discriminative than words
	- May generalize better than pure n-grams
- Parse tree-based
	- Even more discriminative, but need to avoid overfitting

Optimize the tradeoff between exhaustivity (high coverage) and specificity (discriminative) is the major goal.

###### Feature Construction for Text Categorization
- Feature design affects categorization accuracy significantly
- Combination of ML, error analysis, and domain knowledge is most effective:
	- Domain knowledge -> seed features, feature space
	- ML -> feature selection, feature learning
	- Error analysis -> feature validation
- NLP enriches text representation
- Optimize tradeoff between exhaustivity and specificity
	- exhaustivity is frequency
	- specificity is discriminative (infrequent)

#### 11.7 - Ordinal Logistic Regression (Optional)