# Data Mining
General ideas, mostly from Lecture 03.

[Book](https://www-users.cse.umn.edu/~kumar001/dmbook/index.php) by Tan et al.
I ordered a copy and got a strange reprint of parts of the book!

## Hypothesis Testing

$H_0$ = null hypothesis, says the predictor variable has no effect.  
$H_1$ = alternate hypothesis, says the predictor variable has some effect.   

$H_0$ says $mean_1 == mean_2$. This is the null hypothesis.  
$H_1$ says $mean_1 > mean_2$. This is the alternate hypothesis.  

The experimenter tries to reject the null hypothesis, 
leaving the alternate as the best explanation.

From a [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996198/) 
in Industrial Psychiatry Journal:
* A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population.
* A type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population. 

In other words:
* Type I error = False Positive: I claimed a discovery when actually there was nothing there.
* Type II error = False Negative: I overlooked the effect and missed out on a Nobel Prize.

$\alpha$ = Probability of Type I error
(typically set p<5%).  
$\beta$ = Probability of Type II error 
(typically <20% for some amount of difference between means).  

Confusion matrix. Note that people draw confusion matrices this way or flipped.  

|          | Pred pos | Pred neg |  
| :------- | :-------- | :--------- |
| Is pos     |     TP      |$\beta$ FN |   
| Is neg    | $\alpha$ FP |    TN      |  

Given large enough samples 1 and 2 from one population, 
the sample differences have a normal distribution.
Thus, a difference value in the tail is unlikely.

Test statistic: $(mean_1-mean_2) / \sqrt ( std_1 + std_2)$.  
Here, use each population std, i.e. divide by $n_1+n_2$.  

When aiming to reject the null hypothesis,
define the Rejection Region as: test stat > value that puts $\alpha$ == 0.05  

## Learning
Induction: infer the rules from the training data. Build a model.   
Deduction: apply to model to unseen data.  
Supervised: given gold standard labels (as in classification).  
Unsupervised: knowning nothing about the data (as in clustering).  

Eager Learning : Build a model on the training data.   
Lazy Learning : Wail till a prediction is needed e.g. K Nearest Neighbors, Recommenders, spam detection, online and continuously learning systems. 

Classification:   
After supervised learning, 
classifiers find a (linear or non-linear) decision boundary.  
Bayesian classifiers find the Bayes decision boundary, influenced by priors.  
SVM finds a boundary with a margin.  
Rules-based classifiers use greedy algorithms to define a set of rules.  
Rules-based classifiers put the rules in trees: Decision Tree, C4.5, Random Forest.   
Associative classification algorithms build predicate logic: L3, CMAR, 
[CPAR](http://hanj.cs.illinois.edu/pdf/sdm03_cpar.pdf).  

Prediction:  
These use another form of supervised learning: learning from time t-1 to predict time t.   
Time series forecasting.  
Recommender systems.  


## Statistical Measures

Precision = (Predict Yes Correctly)/(Total Yes Predictions) = (TP)/(TP+FP)

Recall = Sensitivity = (Predict Yes Correctly)/(Total Actually Yes) = (TP)/(TP+FN)

Specificity = (Predict No Correctly)/(Total Actually No) = (TN)/(TN+FP)

TPR = Sensitivity = Recall = TP/all_pos = TP/(TP+FN) = 1-FNR  
FPR = FP/N = FP/(FP+TN) = rate of believing the negatives  
FNR = FN/P = rate of missing the positives  
TNR = Specificity = TN/N = TN/(FP+TN) = 1-FPR  
Precision = PosPredictValue = TP/all_pred = TP/(TP+FP) = 1-FDR  
FDR = FalseDiscovRate = FP/P = FP/(TP+FP)  
Accuracy = (TP+TN)/all = (TP+TN)/(P+N)  
Balanced accuracy = (TPR+TNR)/2  

For binary classification:

F-score or F1, is harmonic mean of precision and recall   
  = 2 / [ (1/prec) + (1/recall) ]  
  = 2 * (prec * recall) / [ recall + prec ]  
  = 2 * TP / ( 2 * TP + FN+FP )   
  = TP / [ TP + 1/2 * FN+FP ]   
F1 gives no credit for TN.   

MCC is Mathews Correlation Coefficient   
 = [ TN * TP ] - [ FP * FN ] / sqrt [ (TP+FP)(TP+FN)(TN+FP)(TN+FN) ]

Jaccard index   
 = TP / TP + FP + TN  
Jaccard gives no penalty for FN.

## The bias/variance trade-off
High bias, low variance = underfit.  
Think of a line model of parabolic data.  
A low-degree model has unfavored regions (high bias), 
but test data variance will be the same as train data variance.

High variance, low bias = overfit.   
Think of a squiggly curve model for essentially linear data.  
A high-degree model has no unfavored regions (low bias), 
but test data variance will be higher than train data variance.

Generalization error = out-of-sample error = risk.  
Truth is unknown so we measure empirical risk on out-of-sample.    
Empirical risk = avg loss = sum [ loss(pred,true) ] / n   
or error = sum [ loss squared ].

Train Loss == Test Loss ==> Low variance but possibly high bias (model too simple).  
Train Loss << Test Lost ==> High variance and low bias (model too complex).