# Classification Based on Item Attributes

<ul>
    <li>Collaborative filtering has "rich get richer" effect, neglecting to recognize newer bands </li> 
    <li>Music Genome Project broke songs down into quantitive attributes </li>
    <li>For example: might have "Genre" attribute with 1 = Rock, 2 = Pop, etc..
</ul> 

Flaw with our usual methods: <b>DISTANCE MEANS NOTHING WITH CATEGORICAL DATA!</b> 
Instead, we have to <b>split the categories out</b> and put them on numerical scale (e.g. 2/5 Rock, 4/5 Rap) 

Once we have done this, we can find <b>distance</b> between any two songs with normal distance methods (Manhattan, Cosine sim., etc.)- this is effectively "rating" the items themselves across a series of attributes

We also gain the ability to <b>EXPLAIN</b> our recommendations! The attributes (vocal style, keyboard intensity, etc.) which are closest can be said to "explain" the recommendation. 

## Problem of Scale

Certain variables exist on different <b>scales</b>- for example: $100,000 net worth vs 3 cars owned. The net worth would dominate our distance calculations 

How can we fix this tremendous problem?? 

## Normalization!

One common method of normalization is <b>bringing data between 0 and 1</b>. Formula looks like this: 

$$ \frac{x - Min_x}{Max_x - Min_x} $$ 

This method, however, is <b>NAIVE!</b> Ok, maybe it's appropriate sometimes, but <b>STANDARD SCORE (Z-SCORE)</b> IS BETTER! and the formula is <b>SHOWN BELOW</b> (denominator is std. deviation)

$$ Z_s = \frac{x - \bar{x}}{\sqrt{\frac{\sum{x_i - \bar{x}^2}}{card(x)}}} $$ 

Outliers often throw off standard deviation, so we use <b>modified standard score</b> 

$$asd = \frac{1}{card(x)}\sum_i{|x_i - median|} $$ 

Modified standard score is <b>(EACH VALUE - MEDIAN) / (Absolute Standard Deviation)

### When to normalize? 

Should normalize when: 
<ul>
    <li>Method calculates distance based on values of their features </li> 
    <li>Scale of the different features is different </li> 
</ul> 

TRADEOFFS: Normalization isn't always necessary, and in fact sometimes <i>reduces</i> accuracy. There is also a <b>computational cost</b> involved with normalization to consider. 

## Back to Pandora! :D 

"Likes" and "Dislikes" can oftentimes group themselves along parameters. For example: if we plot "driving beat" against "dirty guitar" (1-5 scale), we may find the Likes and Dislikes cluster together. 

Simplest method is to <b>assume mystery class (like vs dislike) will be same as nearest neighbor!</b> This is arguably the most rudimentary for of classification. 

### A classifier is a program that uses an object's attributes to figure out which class it belongs to! 

Possible applications (note: classification we have done so far is item-based): 
<ul> 
    <li>Twitter sentiment classification </li> 
    <li>Automatic identification of people in photographs</li> 
    <li>Targeted political ads (classifying people into demographics)</li> 
    <li>Targeted marketing (likely buyers)</li> 
    <li>Health and the Quantified Self</li> 
    <li>Terrorist vs Non-Terrorist (these algorithms could use some work since I get pulled aside every flight I take)</li> 

Modified standard scores are commonly applied devices in classification. 