<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Social Network Analysis ‚Äì An Introduction](20.00-mlpg-Social-Network-Analysis‚ÄìAn-Introduction.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Case Study 2: Develop and evaluate an Anomaly Detection system](21.02-mlpg-CS2-Develop-and-evaluate-an-Anomaly-Detection-system.ipynb) ]>

# 21.1. Case Study 1: ML system design for email Spam detection

## 21.1. Case Study 1: ML system design for email Spam detection
**System design for building a Spam classifier:**
* **Step 1**: _Construct a spam vector that normally contains 10,000 to 50,000 entries, and each entry in this vector represents a word that identifies a spam; `this is a manual effort`_. For example,<br>
  ```
  Spam vector = (buy, deal, discount, offer, early-bird, lucky, hurry, etc)
  ```
  
* **Step 2**: _Construct a vector for each email_: If a word in an email is found in a spam vector, assign its respective entry a 1, else 0.<br>
  _Build feature vectors X_: Choose all the words in an email indicative of spam or not spam. For example,<br>
  ```
  X = (0, 1, 1, 0, ‚Ä¶, 1, ‚Ä¶, 1, ‚Ä¶) for words in an email (Bala, buy, deal, company, ‚Ä¶, discount, ‚Ä¶, offer, ‚Ä¶)
  ```
          
* **Step 3**: _Once we have all our ‚Äúx‚Äù vectors ready (‚Äúx‚Äù is the total # of emails that are both spam and not spam categories), we train our algorithm and can be used to classify if an email is a spam or not_. For example,
  ```
  From: cheapdeals@buyonline.com
  To: bala.chandran@gmail.com
  Subject: Hurry, buy now!
  Deal of the day! Now or never.
  ```

**How to improve the accuracy of spam classifiers?**
* Collect lots of data (doesn‚Äôt help always)
* Develop sophisticated features (e.g., use email header data in spam emails)
* Develop algorithms to process input data in different ways (recognizing misspellings in spam)
* However, it‚Äôs difficult to tell which of the options will be most helpful

**The importance of numerical evaluation:**
* Should the words be treated as the same word?<br>
  `discount`/`discount`s/`discount`ing/`discount`ed<br>
  `fail`/`fail`ing/`fail`ed<br>
  `univer`se/`univer`sal/`univer`sity
* Error analysis may not help decide if this is likely to improve performance
* The only solution is to try it and see if it works or not
* We need a numerical evaluation (e.g., CV error) of algorithm‚Äôs performance with and without stemming, and upper case vs. lower case, for example, 
  ```
  without Stemming: 5% error
  with Stemming: 3% error
  Distinguish upper vs. lower case (Dad/dad): 3.2% error
  ```
* It‚Äôs very important to get error results as a single numerical value to assess the model‚Äôs performance
* From the example above, we should add stemming as a new feature into our model as we see a significant improvement, whereas, we should avoid using upper case vs. lower case as a new feature
* Hence, try new things, get a numerical value for errors, and based on the result decide whether we want to keep the new feature or not

**Error Metrics for Skewed Classes:**
* It‚Äôs sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm _for example, in predicting a cancer diagnosis where 0.5% of the examples have cancer, we find our learning algorithm has a 1% error, however, if we were to simply classify every single example as a 0, then the error would reduce to 0.5% even though we did not improve the algorithm_
* This usually happens with skewed classes, i.e., when a class is very rare in the entire dataset, or, when we have a lot more examples from one class than from the other classes (**outliers**)

* **Precision/Recall** are the metrics used for this purpose
  - Actual: 1, Predicted: 1 ü†ä True Positive
  - Actual: 0, Predicted: 0 ü†ä True Negative
  - Actual: 0, Predicted: 1 ü†ä False Positive
  - Actual: 1, Predicted: 0 ü†ä False Negative
<br><br>
* **Precision**: Of all patients, we predicted where y=1, what fraction actually has cancer?<br>
  $Precision = \frac{(True \ Positives)}{(Total \ \#\  of predicted \ positives)} = \frac{(True \ Positives)}{(True Positives \ + \ False Positives)}$

* **Recall**: Of all the patients that actually have cancer, what fraction did we correctly predict as having cancer?<br>
  $Recall = \frac{(True \ Positives)}{(Total \ \#\  of actual positives)} = \frac{(True \ Positives)}{(True \ Psotives \ + \ False \ Negatives)}$

* **Accuracy**: If a model is performing well, then both Precision and Recall will be high, and hence the, :<br>
  $Accuracy = \frac{(True \ Positives \ + \ True \ Negatives)}{(Total Population)}$

* If an algorithm predicts only negatives, then the precision and F1-score can not be defined (because anything divided by 0 is infinity)

**Precision and Recall Trade-off:**
* One way to increase the confidence of our prediction of 2 classes is to increase the threshold:
  - Predict 1, if  h‚àÖ(x)‚â•0.7
  - Predict 0, if h‚àÖ(x)< 0.7
  This would mean that predict cancer if the patient has a 70% chance, and by doing this, we will have `higher precision` but `lower recall`
* On the other hand, we can lower the threshold to do the opposite
  - Predict 1, if  h‚àÖ(x)‚â•0.3
  - Predict 0, if h‚àÖ(x)< 0.3
  This way, we get a very **safe prediction** but will cause `higher recall` and `lower precision`
* The impact of the threshold is as follows:
  - The greater the threshold, the greater the precision, and the lower the recall
  - The lower the threshold, the greater the recall, and the lower the precision
* If both precision and recall are equally important, then use F-Value which will produce a single number
* F-Value is also known as F-Score or F1-Score<br>
  $F1-Score  =  2 \ * \ \frac{(Precision \ √ó \ Recall)}{(Precision \ + \ Recall)}$
* Both Precision and Recall must be large for F1-Score to be large
* Always, train the Precision and Recall on the CV set so as not to bias the test set

<!--NAVIGATION-->
<br>

<[ [Social Network Analysis ‚Äì An Introduction](20.00-mlpg-Social-Network-Analysis‚ÄìAn-Introduction.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Case Study 2: Develop and evaluate an Anomaly Detection system](21.02-mlpg-CS2-Develop-and-evaluate-an-Anomaly-Detection-system.ipynb) ]>