# Features and Classification

### Materials from CSC401, prof Frank Rudzicz, University of Toronto -

### http://www.cs.toronto.edu/~frank/csc401/lectures2018/4_Features_classification.pdf



### $\bullet$ Features

#### $\bullet$ Feature:

A measurable **variable** that is (rather, should be) **distinctive** of something we want to model.

We usually choose features that are useful to **identify** something, i.e. to do **_classification_**.

#### $\bullet$ Feature Vector: 

Values for several features of an observation can be put into a single vector.

Feature Vectors should be useful in **_discriminating_** between categories.

#### $\bullet$ Preprocessing:

Preprocessing involves **preparing** your data to make feature extraction easier or more valid

$\bullet$ **Features in Sentiment Analysis:**

Sentiment Analysis can involve detecting:

> **Stress** or __Frustration__ in a conversation.

>  __Interests__, __Confusion__ or __preferences__. Useful to marketers.

>  __Lies__.

### $\bullet$ Parts of Speech

$\bullet$ Linguists like to group words according to their **structural function** in building sentences.

$\bullet$ **Part-of-speech:** lexical category or morphological class.
> (e.g. Nouns collectively constiture a part of speech, called _Nouns_)

> (other examples: Verb, Adjective, Adverb, Preposition, Pronoun, Determiner, Conjunction, Particles, Auxiliaries, Numerals)

$\bullet$ **Contentful Parts-of-Speech**

**Contentful PoS** usually contain more words:

e.g. an _app_, to _google_ ...

$\bullet$ **Functional Parts-of-Speech**

> **Functional** PoS usually cover a small and fixed number of word types (i.e. a **closed** class)

> Their semantics depend on the contentful words with which they are used (i.e. _I' am __on__ time_, _I am __on__ a robot_)

$\bullet$ **Grammatical Features**

There are several grammatical features that can be associated with words:

> Case 

> Person

> Nnumber

> Gender

These features can **restrict** other words in a sentence.

$\bullet$ **Grammatical Features - Case:** The grammatical form of a noun or a pronoun

> Nominative: the **subject** of a verb

> Accusative: the **direct object** of a verb

> Dative: the **indirect** object of a verb

> Genitive: indicates **posession** (e.g. your **mom**'s book)

$\bullet$ **Grammatical Features - Person:** First, Second, Third

$\bullet$ **Grammatical Features - Number:** broad numerical distinction

$\bullet$ **Grammatical Features - Gender:** typically partitions **nouns** into classes associated with biological gender. **Not** typical in English

$\bullet$ **Agreement**

> Parts-of-Speech should match in certain ways.

> **Articles** have to agree with the **number** of their **noun**.

> **Verbs** have to agree with their **subject**.

$\bullet$ **PoS tagging**

**Tagging**: the process of **assigning** a **part-of-speech** to each word in a sequence.

$\bullet$ The use of tagging:

> **Speech Synthesis:** how to pronounce text (the same word may have different pronunciation in different parts of speech)

> **Information Extraction:** quickly find names and relations

> **Machines Translation:** Identifying grammaticall chunks is useful

$\bullet$ **Tagging as Classification**

We have access to a **sequence of observations** and are expected to decide on the best assignment of a **hidden variable**, i.e. the PoS

$\bullet$ **Rule-based Tagging**

> 1. Start with a dictionary

> 2. Assign all possible tags to words from that dictionary

> 3. Write rules to selectively remove tags

$\bullet$ **Statistical PoS tagging**

Determine the **most-likely** tag sequence $t_{1:n}$ by:

$$
\underset{t_{1:n}}{\operatorname{argmax}} P(t_{1:n}|w_{1:n}) = \underset{t_{1:n}}{\operatorname{argmax}} P(w_{1:n}|t_{1:n})P(t_{1:n}) \text{         (Using Baye's Rule)}
$$ 

Thus, 

$$
\underset{t_{1:n}}{\operatorname{argmax}} P(t_{1:n}|w_{1:n}) \approx \underset{t_{1:n}}{\operatorname{argmax}} \displaystyle \prod_{i}^{n} P(w_i|t_i)P(t_i|t_{i-1}) \text{        (Assuming independence and assuming Markov)}
$$ 

(Note: here the equation is simplified under Markov Assumption, which means that the current tag only depends on the previous tag)

### $\bullet$ Classification

**General Process**:

> We gather a big and relevant **training** corpus.

> We learn our **parameters** (e.g. probabilities) from that corpus to build our **model**.

> Once that model is fixed, we use those probabilities to evaluate **testing** data.

$\bullet$ **K-fold Cross Validation**

splitting all the data into $K$ **partitions** and iteratively testing on each after training on the rest (report means and variances).

$\bullet$ **Types of Classifiers**:

Generative Classifiers model the world:

> Parameters set to maximize likelihood of training data.

> We can generate new observations from these (e.g. hidden Markov models)

Discriminative classifiers emphasize class boudaries:

> Parameters set to minimize error on training data. (e.g. support vector machines, decision trees)

$\bullet$ **Kernel Trick**

We can sometimes linearize a non-linear case by moving the data into a higher dimension with a **kernel function**.

$\bullet$ **Support Vector Machines (SVM)**

The **margin** is the width by which the boundary could be **increased** before it hits a training datum.

> The **maximum margin linear classifier** is the linear classifier with the maximum margin.

> The **support vectors** are those data points against which the margin is pressed.

> The bigger the margin, the less sensitive the boundary is to error.

*The maximum margin helps SVM **generalize** to situations where it is **impossible** to linearly seperate the data.*

We simutaneously:

> **maximize the margin**

> **minimize the misclassification error**

(There is a straightforward approach to solving this system based on **quadratic programming**)

$\bullet$ SVMs are empirically very accurate classifiers
 
> They perform well in situations where data are **static** (i.e. do not change over time)

> SVMs do not generalize as well to **time-variant** systems (Kernel functions tend to not allow for observations of **different lengths**, i.e. all data points have to be of the same dimensionality)

$\bullet$ **Decision Trees**

Consists of **rules** for classifying data that have many **attributes**(features)

> **Decision Nodes**: **Non-terminal**: Consists of a question asked of one of the attributes, and a branch for each possible answer.

> **Leaf Nodes**: **Terminal**: Consist of a single class/category, so no further testing is required.

**ID3** is an algorithm invented by Ross Quinlan to produce decision trees from data, which is: 

> 1. Compute the entropy of asking about each attribute.

> 2. Choose the attribute which reduces the most entropy.

> 3. Make a node asking a question of that attribute. 

> 4. Go to step 1, minus the chosen attribute.