### Predicting sentiment by topic: An intelligent restaurant review system
- All reviews for restaurant
- Break all reviews into sentences
- Select sentences about "sushi"
- Sentence Sentiment Classifier
- Average predictions

### Classifier applications
- Sentence from review
 - Input: $x$
- Classifier MODEL
 - Output: $y$ ( Predicted class: + or - )
- Output $y$ has more than 2 categories
- Spam filtering
 - e mail -> Spam or No Spam
- Image classification
 - Image pixels -> Predicted object
- Personalized medical diagnosis

### Linear classifiers
- Simple threshold classifier
 - List of positive words: great, awesome, good, amazing..
 - List of negative words: bad, terrible, disgusting, sucks..
 - Count positive & negative words in sentence
   - if # of positive words > # of negative words:
   - $\hat{y} = +$
- Problems with threshold classifier
 - How do we get list of positive/negative words?
 - Words have different degrees of sentiment:
   - Great > good
   - How do we weigh differenct words?
 - Single words are not enough:
   - Good -> Positive
   - Not good -> Negative
- Will use training data to learn a weight for each word


$\text{Score}(x) = \text{weighted count of words in sentence}$
- If $\text{Score}(x) > 0:$
  - $\hat{y} = +$
- Else:
  - $\hat{y} = -$

### Decision boundaries
- $\text{Score}(x) = 1.0 \cdot \text{#awesome} - 1.5 \cdot \text{#awful}$
 - $\text{Score}(x) < 0$ : Negative area
 - **$\text{Score}(x) = 1.0 \cdot \text{#awesome} - 1.5 \cdot \text{#awful} = 0$ : Decision boundary**
 - $\text{Score}(x) > 0$ : Positive area
- Decision boundary separates positive & negative predictions
 - For linear classifiers:
   - When 2 wieghts are non-zero -> **line**
   - When 3 wieghts are non-zero -> **plane**
   - When many wieghts are non-zero -> **hyperplane**
 - For more general classifiers
   - **more complicated shapes**

### Training a classifier = Learning the weights
- Test example
 - -> `Learned classifier`
 - -> `Hide label`
 - Check Correct & Mistakes
- Classification error & accuracy
 - Error measures fraction of mistakes
$$\text{error} = \frac{\text{# of mistakes}}{\text{Total # of sentences}}$$
   - Best possible value is 0.0
 - Often, measure **accuracy**
   - Fraction of correct predictions:
$$\text{accuracy} = \frac{\text{# of corrects}}{\text{Total # of sentences}}$$
   - Best possible value is 1.0
   - error = 1 - accuracy

### Random guess
- For binary classification;
 - Half the time, you'll get it right! (on average)
 - accuracy = 0.5
- For $k$ classes, accuracy = $1/k$
- Is a classifier with 90% accuracy good? Depends...
 - "90% emails sent are spam!"
 - Predicting every email is spam gets you 90% accuracy!!
 - `Mafority class prediction`
 - Amazing performance when there is class imbalance
- So, always be digging in and asking the hard questions about reported accuracies
 - Is there class imbalance?
 - How does it compare to a simple, baseline approach?
   - Random guessing
   - Mafority class
   - ...
 - Most importantly: What accuracy does my application need?
   - What is good enough for my user's experience?
   - What is the impoact of the mistakes we make?

### Confusion matrix

- binary classification

<img src="https://docs.wso2.com/download/attachments/48292444/Binary_Classification_Matrix_Definition.png?version=1&modificationDate=1447821750000&api=v2">

- False Negative(FN) & False Positive(FP) have different impact!
 - Spam filtering : FN -> Annoying // **FP -> Email lost (high risk)**
 - Medical diagnosis : **FN -> Disease not treated // FP -> Wasteful treatment**
 
 
- multiclass classification

<img src="https://docs.wso2.com/download/attachments/48292444/Multiclass_Classification_Matrix_Definition.png?version=1&modificationDate=1447821750000&api=v2">

### Learning curves: How much data do I need?
- The more the merrier
 - But data quality is most important factor
- Theoretical techniques sometimes can bound how much data is needed
 - Typically too loose for practical application
 - But provide guidance
- In practice:
 - More complex models require more data
 - Empirical anaysis can provide guidance
 
 
- Learning curves with amount training data

<img src="https://kwtrnka.files.wordpress.com/2015/05/learning_curve_basic_3.png">


- **Bias of model**
 - Even with infinite data, test error will not go to zero
- More complex models tend to have less bias
 - Sentiment classifier using single words can do OK but..
 - Never classify correctly: "The sushi was *not good*."
 - More complex model: consider pairs of words (bigrams)
 - Less bias -> potentially more accurate, needs more data to learn
- More complex models with less data is not going to do well, because it has more parameters to fit

### Class probabilities
-  Many classifiers provide a confidence level:
 - $P(y|x)$

### Regression ML block diagram
- `Training Data` -> `Featrue extraction` -> $x$
- $x$ -> `ML model` ( $\hat{w}$ ) -> $\hat{y}$
- `Quality metric` -> `ML algorithm` -> RSS ( $y-\hat{y}$ ) -> $\hat{w}$
 - loop, updating the weights or model parameters

### Classification ML block diagram
- `Training Data` -> `Featrue extraction` -> $x$ (word counts)
- $x$ -> `ML model` ( $\hat{w}$ ) -> $\hat{y}$ (predicted sentiment)
- `Quality metric` -> `ML algorithm` -> accuracy ( $y, \hat{y}$ ) -> $\hat{w}$ (weights for each word)
 - loop, updating the weights or model parameters