## 1. Definition: 
- ML algorithms build models based on sample data, know as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

## 2. Object:
- Data (values of attributes):
  - Numeric: normalization
  - Categorical: mutually exclusive; encoded as numbers; synonymy problem; only $=$ and $!=$ are meaningful
  - Ordinal: have order; encoded as numbers to preserve ordering; cannot add/multiply...only $<$, $>$, and $=$ are meaningful; hard to tell attributes have ordering or not

## 3. Aim:
- Consider which model should be learnt and how to learn models to achieve accurate prediction and analysis with highest possible efficiency.

## 4. Process:
<p align="center">
<img src=images/1-1.png width="400" height="150" alt="5" align=centering>

- 1. Get a limited training set;
- 2. Confirm the hypothesis space (a space including all the possible models);
- 3. Make the learining strategy or the standard to choose a model；
- 4. Use algorithms to solve the optimal model；
- 5. Choose the optimal model；
- 6. Use the model to analyze or predict new data.

## 5. Basic Categories: 
### 5.1 Supervised and Unsupervised model:
#### Supervised
- learning predictive models from labelled data.
<p align="center">
<img src=images/1-2.png width="400" height="250" alt="5" align=centering>

, where $y_{N+1} = arg max_{y}\hat{P}(y|x_{N+1})$ or $y_{N+1} = \hat{f}(x_{N+1})$
- More specifically, there are three problems in supervised learning:
  - Classification
  - Regression
  - Tagging

#### Unsupervised
- learning predictive models from unlabelled data.
<p align="center">
<img src=images/1-3.png width="400" height="300" alt="5" align=centering>

- Reinforcement
- Semi-supervised learning
- Active learning

### 5.2 Probabilistic and deterministic model:
- Main difference is in the inner structure: a probabilistic model can be expressed as joint probability distribution whereas a non-probabilistic model often cannnot. 
- Logistics model can be viewed as both

#### Probabilistic model:
- $P(y|x)$ or $P(z|x)$, $P(x|z)$
- Decision trees; naive bayes; GMM

#### Deterministic model:
- $y = f(x)$
- SVM, KNN, AdaBoost, K-means, neural networks

#### Generative and discriminative approach(model)
- Generative: learn the joint probabilistic distribution $P(X,Y)$, then compute P(Y|X) as the predictive model:
$$ P(Y|X) = \frac{P(X,Y)}{P(X)} $$ 
including naive bayes.
- Discriminative: only get the predictions/decisions, including knn, dt, logistics regression, svm, boosting methods, etc.

### 5.3 Parametric and non-parametric model:
- Whether the dimension of a model is fixed and limited
- P: Naive bayes, logistics regression, k-means, GMM
- NP: DT, SVM, AdaBoost, KNN

### 5.4 Bayesian learning and kernel method:

#### Bayesian learning
- Calculate the probability of a model given certain data or posterior probability. 
<p align="center">
<img src=images/1-4.png width="400" height="150" alt="5" align=centering>

#### Kernal method
- Kernel SVM, PCA, and k-means
<p align="center">
<img src=images/1-5.png width="400" height="200" alt="5" align=centering>

## 6. Three elements of machine learning

### 6.1 model
- for supervised learning, model is the conditional probability distribution and decision function
- hypothesis space for decision functions:
$$ F = {f|Y=f_{\theta}(X),\theta \in R^{n}} $$
- hypothesis sapce for conditional probability distribution:
$$ F = {P|P(Y|X),\theta \in R^{n}} $$

### 6.2 strategy
- according to what standards to learn or choose the optimal model

#### Loss (cost) function and risk function (expected loss):
- Loss(cost) function:
  - 0-1 loss function
  
  $ L(Y,f(X)) = 
  \begin{cases}
  1, Y!=f(X)\\
  0, Y=f(X)
  \end{cases}$

  - quadratic loss function
  $$  L(Y,f(X)) = (Y-f(X))^2 $$
  - absolute loss function
  $$ L(Y,f(X)) = |Y-f(X)| $$
  - logarithmic loss function
  $$ L(Y,P(Y|X)) = -log P(Y|X) $$

- Risk function:
$$ R_{exp}(f) = E_{p}[L(Y,f(X))] = \int_{x\text{x}y} L(y,f(x))P(x,y) \,{\rm d}x{\rm d}y $$

However, we do not know P(X,Y), but can only get empirical loss based on given training data. According to the law of large number, the empirical risk should be close to the expected risk when N is infinite. 
$$ R_{emp}(f) = \frac{1}{N} \sum_{i=1}^N L(y_{i},f(x_{i})) $$

#### Minimization strategy
- ERM(empirical risk minimization)
  - The optimal model is the one with minimal empirical risk, the minimization function:
  $$ \text{min}_{f \in F} \frac{1}{N}\sum_{i=1}^N L(y_{i},f(x_{i}))$$
- SRM(structural risk minimization)
  - To prevent over-fitting problem, a regularizatio item is added, the structural risk:
  $$ R_{srm}(f) = \frac{1}{N}\sum_{i=1}^N L(y_{i},f(x_{i})) + \lambda J(f) $$
  The minimization function:
  $$ \text{min}_{f \in F} \frac{1}{N}\sum_{i=1}^N L(y_{i},f(x_{i}))$$

### 6.3 algorithm
- How to solve the minimization function;
- ML problem is an optimization problem;
- The algorithm of ML is the algorithm to solve the optimization problem.

## 7. Model evaluation and selection

### Training error and test error
- Training error reflects whether it is easy to learn some problem;
- Test error reflects the generalization ability.

### Overfitting problem
- Include many parameters;
- Predict well on the training data;
- Bad prediction on the test data.
<p align="center">
<img src=images/1-6.png width="400" height="250" alt="5" align=centering>

## 8. Regularization and cross-validation

### 8.1 Regularization
- Regularization item is to balance the complexity of a model and its predictability.
- L1 Regularization item:
$$ L_(w) = \frac{1}{N}\sum_{i=1}^N (f(x_{i};w)-y_{i})^2 + \lambda ||w|| $$

- L2 Regularization item:
$$ L_(w) = \frac{1}{N}\sum_{i=1}^N (f(x_{i};w)-y_{i})^2 + \frac{1}{2}\lambda ||w||^2 $$

### 8.2 Cross-validation
- training set is to train models (construct model); validation set is to choose models (pick model and knob setting); test set is to evaluate the final model (estimate the future error rate).
- cross-validation is to choose the optimal model.
- 1) simple: 70-30
- 2) S-fold cross validation
- 3) leave-one-out cross validation (S=N)

### 8.3 Problems with leave-one-out cross validation
- High computational cost;
- Imbalanced classes
- Stratification can help with this issue:
  - randomly split each class into K parts at first
  - assemble $i_{th}$ part from all classes to make the $i_{th}$ fold.

## 9. Generalization ability
### 9.1 Generalization error
- The generalization error, equal to the expected loss, is measured by:
$$ R_{exp}(\hat{f}) = E_{p}[L(Y,\hat{f}(X))] = \int_{x\text{x}y} L(y,\hat{f}(x))P(x,y) \,{\rm d}x{\rm d}y $$

### 9.2 Generalization error bound
- The generalization ability is measured by generalization error bound:
  - Some properties: 1) the generalization error bound is close to 0 when R increases; 2) the generalization error bound is greater if the hypothesis space is larger.

### 9.3 Using testing error as an estimate
- As we don't know the probability distribution in the function above, we cannot calculate the true generalization error. However, we can use testing error to estimate. When the number of test samples is very large, the testing error will be very close to the true generalization error.

## 10. Evaluation

### 10.1 Evaluating regression
- (root) mean squared error;
- mean(median) absolute error;
- correlation coefficiant

### 10.2 Evaluating classification
<p align="center">
<img src=images/1-7.png width="400" height="200" alt="5" align=centering>

- classification error:
$$ \frac{FP+FN}{TP+TN+FP+FN} $$
- accuracy:
$$ \frac{TP+TN}{TP+TN+FP+FN} $$
However, these indicators cannot handle unbalanced classes, but we can use:
- recall rate: % of postives in reality that we classified correctly
$$ \frac{TP}{TP+FN} $$
- precision rate: % of what we predict as positives are actually postive
$$ \frac{TP}{TP+FP} $$

<p align="center">
<img src=images/1-8.png width="400" height="100" alt="5" align=centering>

- ROC:
<p align="center">
<img src=images/1-9.png width="400" height="300" alt="5" align=centering>