Course website: Machine Learning | Coursera.
Supervised Learning: Given "right answers" to train.
Unsupervised Learning: No "right answer".
Regression: Predict continuous-valued output.
Classification: Predict discrete-valued output.
Parameters | |
---|---|
Hypothesis | |
Cost Function | $J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\$ |
Goal | minimize |
Update until converge for
$\theta_0:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\$
$\theta_1:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\$
Parameters | |
---|---|
Hypothesis | |
Cost Function | $J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\$ |
Goal | minimize |
Gradient Descent | $\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\$ |
Feature Scaling: Get every feature into approximately
$x_i:=\frac{x_i}{\text{max}(|x_i|)}\$
Mean Normalization: Make features have approximately
$x_i:=\frac{x_i-\mu_i}{\sigma_i}\$
$x_i:=\frac{x_i-\mu_i}{\text{max}(x_i)-\text{min}(x_i)}\$
Choosing
Intuition: $\frac{\partial}{\partial\theta_1}J(\theta)=\frac{\partial}{\partial\theta_2}J(\theta)=\dots=\frac{\partial}{\partial\theta_j}J(\theta)=0\$
Solution:
Example:
$ \theta=\begin{bmatrix}\theta_0\\theta_1\\theta_2\\theta_3\\theta_4\end{bmatrix}$ $X=\begin{bmatrix}1&x_1^{(1)}&x_2^{(1)}&x_3^{(1)}&x_4^{(1)}\1&x_1^{(2)}&x_2^{(2)}&x_3^{(2)}&x_4^{(2)}\1&x_1^{(3)}&x_2^{(3)}&x_3^{(3)}&x_4^{(3)}\\vdots&\vdots&\vdots&\vdots&\vdots\1&x_1^{(m)}&x_2^{(m)}&x_3^{(m)}&x_4^{(m)}\end{bmatrix}$ $Y=\begin{bmatrix}y^{(1)}\y^{(2)}\y^{(3)}\\vdots\y^{(m)}\end{bmatrix}$
Note that,
Gradient Descent | Normal Equation |
---|---|
Need to choose |
No need to choose |
Need many iterations to converge. | No need to iterate. |
Still works well when |
Need to compute Slow if |
Sigmoid/Logistic Function: $g(x)=\frac{1}{1+e^{-x}}\$
$h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}\$ , which denotes the possibility of
$\begin{aligned}&y=1\\Leftrightarrow\ &P(y=1|x;\theta)=h_\theta(x)=g(\theta^Tx)\ge0.5\\Leftrightarrow \ &\theta^Tx\ge0\end{aligned}$
Add polynomial features like
Minus Log Cost | Mean Square Cost |
---|---|
$\text{Cost}(h_\theta(x),y)=\begin{cases}-\log(h_\theta(x))\ ,y=1\-\log(1-h_\theta(x))\ ,y=0\end{cases}$ | $\text{Cost}(h_\theta(x),y)=\frac{1}{2}(h_\theta(x)-y)^2\$ |
Have a unique global minimum point. (convex) | Have many local minimum points. (non-convex) |
$\begin{aligned}J(\theta)&=\frac{1}{m}\sum_{i=1}^m\text{Cost}(h_\theta(x^{(i)}),y^{(i)})\&=-\frac{1}{m}\sum_{i=1}^m\bigg[y^{(i)}\text{log}(h_\theta(x^{(i)}))+(1-y^{(i)})\text{log}(1-h_\theta(x^{(i)}))\bigg]\end{aligned}$
Examples | Advantages |
---|---|
Conjugate Gradient BFGS L-BFGS |
No need to pick Often faster than gradient descent. |
For classification problem with
Train a logistic regression classifier
To make a prediction on a new
Underfit | Well-fit | Overfit |
---|---|---|
High Bias | Just Right | High Variance |
There are two main options to address the issue of overfitting:
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm.
- Regularization:
- Keep all the features, but reduce the magnitude of parameters
$\theta_j$ . - Regularization works well when we have a lot of slightly useful features.
Add a penalty term to eliminate the parameters. (
$J(\theta)=\frac{1}{m}\sum_{i=1}^m\text{Cost}(h_\theta(x^{(i)}),y^{(i)})+\lambda\sum_{j=1}^n\theta_j^2\$
Regularization Parameter
Gradient Descent | Normal Equation |
---|---|
$\begin{aligned}\theta_0&:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta)\\theta_j&:=\theta_j-\alpha\bigg[\frac{\partial}{\partial\theta_j}J(\theta)+\frac{\lambda}{m}\theta_j\bigg]\&:=\big(1-\frac{\lambda}{m}\big)\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\space,j\in{1,2,\dots,n}\end{aligned}$ |
where $L=\begin{bmatrix}0&0&0&\cdots&0\0&1&0&\cdots&0\0&0&1&\cdots&0\\vdots&\vdots&\vdots&\ddots&\vdots\0&0&0&\cdots&1\end{bmatrix}$ |
$\theta_0:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta)\$
$\begin{aligned}\theta_j&:=\theta_j-\alpha\bigg[\frac{\partial}{\partial\theta_j}J(\theta)+\frac{\lambda}{m}\theta_j\bigg]\&:=\big(1-\frac{\lambda}{m}\big)\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\space,j\in{1,2,...n}\end{aligned}$
First Layer | Intermediate Layer | Last Layer |
---|---|---|
Input Layer | Hidden Layer | Output Layer |
In the example network above, the data flow looks like:
$[x_1,x_2,x_3]\rightarrow\big[a_1^{(2)},a_2^{(2)},a_3^{(2)}\big]\rightarrow h_\theta(x)\$
The values for each of the "activation" nodes is obtained as follows:
$a_1^{(2)}=g\big(\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3\big)\$
$a_2^{(2)}=g\big(\Theta_{20}^{(1)}x_0+\Theta_{21}^{(1)}x_1+\Theta_{22}^{(1)}x_2+\Theta_{23}^{(1)}x_3\big)\$
$a_3^{(2)}=g\big(\Theta_{30}^{(1)}x_0+\Theta_{31}^{(1)}x_1+\Theta_{32}^{(1)}x_2+\Theta_{33}^{(1)}x_3\big)\$
$h_\theta(x)=a_1^{(3)}=g\big(\Theta_{10}^{(2)}a_0^{(2)}+\Theta_{11}^{(2)}a_1^{(2)}+\Theta_{12}^{(2)}a_2^{(2)}+\Theta_{13}^{(2)}a_3^{(2)}\big)\$
Where the addition
The matrix representation of the above computations is like:
Inputs | Weights |
---|---|
$a^{(1)}=x=\begin{bmatrix}x_1\x_2\x_3\end{bmatrix}\overset{\text{add bias}}{\Longrightarrow}\begin{bmatrix}x_0\x_1\x_2\x_3\end{bmatrix}$ | $\Theta^{(1)}=\begin{bmatrix}\Theta_{10}^{(1)}&\Theta_{11}^{(1)}&\Theta_{12}^{(1)}&\Theta_{13}^{(1)}\\Theta_{20}^{(1)}&\Theta_{21}^{(1)}&\Theta_{22}^{(1)}&\Theta_{23}^{(1)}\\Theta_{30}^{(1)}&\Theta_{31}^{(1)}&\Theta_{32}^{(1)}&\Theta_{33}^{(1)}\end{bmatrix}$ |
$a^{(2)}=g\big(z^{(2)}\big)=g\big(\Theta^{(1)}a^{(1)}\big)=\begin{bmatrix}a_1^{(2)}\a_2^{(2)}\a_3^{(2)}\end{bmatrix}\overset{\text{add bias}}{\Longrightarrow}\begin{bmatrix}a_0^{(2)}\a_1^{(2)}\a_2^{(2)}\a_3^{(2)}\end{bmatrix}$ | $\Theta^{(2)}=\begin{bmatrix}\Theta_{10}^{(2)}&\Theta_{11}^{(2)}&\Theta_{12}^{(2)}&\Theta_{13}^{(1)}\end{bmatrix}$ |
$a^{(3)}=g\big(z^{(3)}\big)=g\big(\Theta^{(2)}a^{(2)}\big)=\begin{bmatrix}a_1^{(3)}\end{bmatrix}$ | none |
One-vs-all for neural networks: Define the set of resulting classes like:
$y^{(i)}=\begin{bmatrix}1\0\0\0\end{bmatrix},\begin{bmatrix}0\1\0\0\end{bmatrix},\begin{bmatrix}0\0\1\0\end{bmatrix},\begin{bmatrix}0\0\0\1\end{bmatrix}$
Take the multiclass classification problem as example:
Logistic Regression: $J(\theta)=-\frac{1}{m}\sum_{i=1}^m\bigg[y^{(i)}\text{log}(h_\theta(x^{(i)}))+(1-y^{(i)})\text{log}(1-h_\theta(x^{(i)}))\bigg]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2\$
Neural Network: $J(\theta)=-\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K\bigg[y^{(i)}\text{log}(h_\theta(x^{(i)})k)+(1-y^{(i)})\text{log}(1-h\theta(x^{(i)})k)\bigg]+\frac{\lambda}{2m}\sum{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}\big(\Theta_{j,i}^{(l)}\big)^2\$
The first part: Sum up all the costs for each output class (
The second part: Sum up the squared values of all parameters except bias.
Backpropagation Algorithm | Coursera
Our object is to calculate
It is quite hard to calculate $\frac{\partial J(\Theta)}{\partial\Theta_{i,j}^{(l)}}\$ directly.
By applying the chain rule of derivatives, we get $\frac{\partial J(\Theta)}{\partial\Theta_{i,j}^{(l)}}=\frac{\partial J(\Theta)}{\partial z_j^{(l+1)}}\frac{\partial z_j^{(l+1)}}{\partial\Theta_{i,j}^{(l)}}\$ .
It is easy to deduce that $\frac{\partial z_j^{(l+1)}}{\partial\Theta_{i,j}^{(l)}}=\frac{\partial\bigg(\sum_{k=0}^{N(l)}\Theta_{k,j}^{(l)}a_k^{(l)}\bigg)}{\partial\Theta_{i,j}^{(l)}}=a_i^{(l)}\$ .
So we define "error value"
Since we already got
-
Perform forward propagation to compute
$a^{(l)}$ for$l=1,2,3,...,L$ . -
Compute
$\delta_j^{(L)}=a^{(L)}-y^{(t)}$ . -
Compute
$\delta_j^{(L-1)},\delta_j^{(L-2)},\dots,\delta_j^{(2)}$ , using$\delta^{(l)}=\big(\Theta^{(l)}\big)^T\delta^{(l+1)}\times g'\big(z^{(l)}\big)$ . -
Compute
$\Delta^{(l)}_{i,j}:=\delta_i^{(l+1)}a_j^{(l)}$ , or with vectorization,$\Delta^{(l)}:=\delta^{(l+1)}\big(a^{(l)}\big)^T$ . -
If we didn't apply regularization, then $D_{i,j}^{(l)}=\frac{\partial J(\Theta)}{\partial\Theta_{i,j}^{(l)}}=\frac{1}{m}\Delta^{(l)}_{i,j}\$ .
Otherwise, We have $D_{i,j}^{(l)}=\frac{\partial J(\Theta)}{\partial\Theta_{i,j}^{(l)}}=\begin{cases}\frac{1}{m}\big(\Delta^{(l)}{i,j}+\lambda\Theta{i,j}^{(l)}\big)\space,j\ne0\\frac{1}{m}\Delta^{(l)}_{i,j}\space,j=0\end{cases}$ .
Initialize each
Initialize each
- Randomly initialize the weights.
- Implement forward propagation to get
$h_\Theta(x^{(i)})$ for any$x^{(i)}$ . - Implement the cost function.
- Implement backpropagation to compute partial derivatives.
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
Once we have done some trouble shooting for errors in our predictions by:
- Getting more training examples.
- Trying smaller sets of features.
- Trying additional features.
- Trying polynomial features.
- Increasing or decreasing
$\lambda$ .
One way to break down our dataset into the three sets is:
- Training set: 60%.
- Cross validation set: 20%.
- Test set: 20%.
We can move on to evaluate our new hypothesis:
- Learn
$\Theta$ and minimize$J_{train}(\Theta)$ . - Find the best model according to
$J_{cv}(\Theta)$ . - Compute the test set error
$J_{test}(\Theta)$ .
Training Set Error | Cross Validation Set Error | |
---|---|---|
Underfit (High Bias) | High | High |
Overfit (High Variance) | Low | High |
Training Set Cost | Cross Validation Set Cost | |
---|---|---|
|
High | High |
|
Low | High |
Experiencing High Bias:
Model underfits the training set and cross validation set, getting more training data will not help much.
Experiencing High Variance:
Model overfits the training set, getting more training data is likely to help.
-
Getting more training examples: Fixes high variance
-
Trying smaller sets of features: Fixes high variance
-
Adding features: Fixes high bias
-
Adding polynomial features: Fixes high bias
-
Decreasing
$\lambda$ : Fixes high bias -
Increasing
$\lambda$ : Fixes high variance.
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
- Make sure the quick implementation incorporated a single real number evaluation metric.
To better handle skewed data, we can use precision and recall to evaluate the model.
$\text{precision}=\frac{\text{true positive}}{\text{true positive + false positive}}\$
$\text{recall}=\frac{\text{true positive}}{\text{true positive + false negative}}\$
We can further give a trade off between precision and recall using what is called F1 score.
$\text{F1 score}=2\frac{\text{precision}\times\text{recall}}{\text{precision + recall}}\$
Just like liner regression, we gain the cost function of SVM:
$J(\theta)=C\sum_{i=1}^{m}\big[y^{(i)}\text{cost}_1(\Theta^Tx^{(i)})+(1-y^{(i)})\text{cost}0(\Theta^Tx^{(i)})\big]+\frac{1}{2}\sum{j=1}^{n}\theta_j^2\$
The cost function for each label looks like:
So our hypothesis is:
$h_\theta(x)=\begin{cases}1\space,\Theta^Tx\ge1\0\space,\Theta^Tx\le-1\end{cases}$
Given a trained weight
$\sum_{i=1}^{m}\big[y^{(i)}\text{cost}_1(\Theta^Tx^{(i)})+(1-y^{(i)})\text{cost}_0(\Theta^Tx^{(i)})\big]=0\$
$\begin{aligned}\text{s.t.}\quad&\Theta^Tx^{(i)}\ge1\quad\space\space\space \text{if}\quad y^{(i)}=1\&\Theta^Tx^{(i)}\le-1\quad \text{if}\quad y^{(i)}=0\end{aligned}$
So our final objective is to calculate the following equation:
$\min_\theta\space \frac{1}{2}\sum_{j=1}^{n}\theta_j^2\$
$\begin{aligned}\text{s.t.}\quad&\Theta^Tx^{(i)}\ge1\quad\space\space\space \text{if}\quad y^{(i)}=1\&\Theta^Tx^{(i)}\le-1\quad \text{if}\quad y^{(i)}=0\end{aligned}$
Note: SVM is sensitive to noise.
Take $\Theta=\begin{bmatrix}\theta_1\\theta_2\end{bmatrix}$ as an example: (regard
$\begin{aligned}\min_\theta\space\frac{1}{2}\sum_{j=1}^{2}\theta_j^2&=\frac{1}{2}\big(\theta_1^2+\theta_2^2\big)\&=\frac{1}{2}\bigg(\sqrt{\theta_1^2+\theta_2^2}\bigg)^2\&=\frac{1}{2}\Theta^T\Theta\&=\frac{1}{2}\big|\Theta\big|^2\end{aligned}$
$\begin{aligned}\text{s.t.}\quad&\Theta^Tx^{(i)}\ge1\quad\space\space\space \text{if}\quad y^{(i)}=1\&\Theta^Tx^{(i)}\le-1\quad \text{if}\quad y^{(i)}=0\end{aligned}$
Essentially
Let
$\begin{aligned}\text{s.t.}\quad&\big|\Theta\big|\cdot p^{(i)}\ge1\quad\space\space\space \text{if}\quad y^{(i)}=1\&\big|\Theta\big|\cdot p^{(i)}\le-1\quad \text{if}\quad y^{(i)}=0\end{aligned}$
$\min_\theta\space\frac{1}{2}\big|\Theta\big|^2\$
$\begin{aligned}\text{s.t.}\quad&\big|\Theta\big|\cdot p^{(i)}\ge1\quad\space\space\space \text{if}\quad y^{(i)}=1\&\big|\Theta\big|\cdot p^{(i)}\le-1\quad \text{if}\quad y^{(i)}=0\end{aligned}$
To minimize
For some non-linear classification problems, we can remap
Given
-
Choose each training example
$x^{(i)}$ as the landmark$l^{(i)}$ .We have
$l^{(i)}=x^{(i)}$ for all$i$ .Finally we acquire
$m$ landmarks$l^{(1)},l^{(2)},\dots,l^{(m)}$ in total. -
For each training example
$x^{(i)}$ , using all$m$ landmarks to calculate new features$f^{(i)}$ .The new feature
$f^{(i)}$ have$m$ dimensions, and $f_j^{(i)}=f(x^{(i)},l^{(j)})\$ .Finally we acquire
$m$ features$f^{(1)},f^{(2)},\dots,f^{(m)}$ in total. -
Using
$f^{(1)},f^{(2)},\dots,f^{(m)}$ to train the SVM model.The new training objective is $\min_\theta\space C\sum_{i=1}^{m}\big[y^{(i)}\text{cost}_1(\Theta^Tf^{(i)})+(1-y^{(i)})\text{cost}0(\Theta^Tf^{(i)})\big]+\frac{1}{2}\sum{j=1}^{n}\theta_j^2\$ . (
$n=m$ )
Linear Kernel: Same as no kernel.
Gaussian Kernel: $f(x,l)=\exp(-\frac{|x-l|^2}{2\sigma^2})\$ .
Polynomial Kernel:
And so on...
|
$C\ (=\frac{1}{\lambda})\$ (Cost Function) | |
---|---|---|
Large | Features High bias. |
Small High variance. |
Small | Features High variance. |
Large High bias. |
Suppose we have
SVM | Logistic Regression | Neural Network | |
---|---|---|---|
(e.g. |
Linear Kernel. (avoid overfitting) |
Work fine. | Always work well. |
(e.g. |
Gaussian Kernel. (or other kernels) |
Work fine. (SVM may be better) |
Always work well. (SVM is faster) |
(e.g. |
Linear Kernel. (reduce time cost) |
Work fine. | Always work well. |
Assuming that we have
-
Randomly pick
$K$ training examples$x_1,x_2,\dots,x_K$ .Initialize
$K$ cluster centroids$\mu_1,\mu_2,\dots,\mu_K$ using picked examples.We have
$\mu_i=x_i$ . -
Loop:
For all
$x^{(i)}$ , choose the nearest centroid as its cluster, denotes$c^{(i)}$ .For all clusters, update
$\mu_k$ with the mean value of the points assigned to it. -
Stop when there is no change in each cluster.
Minimize the average distance of all examples to their corresponding centroids.
$J(C,\Mu)=\frac{1}{m}\sum_{i=1}^m|x^{(i)}-\mu_{c^{(i)}}|\$
Elbow Method:
Plot
More Importantly:
Evaluate based on a metric for how well it performs for later purpose.
Assuming that we have
We want to compose each example into a
-
Do feature scaling or mean normalization.
-
Compute "covariance matrix"
$\Sigma$ :$\Sigma=\frac{1}{m}\sum_{i=1}^nx^{(i)}\big(x^{(i)}\big)^T\$ (
$n\times n$ ) -
Compute "eigenvector matrix"
$U$ and "eigenvalue matrix"$V$ of$\Sigma$ :$U=\begin{bmatrix}u^{(1)}&u^{(2)}&\cdots&u^{(n)}\end{bmatrix}$ (
$n\times n$ )$V=\begin{bmatrix}v^{(1)}&v^{(2)}&\cdots&v^{(n)}\end{bmatrix}$ (
$1\times n$ )$u^{(i)}$ is the$i$ th eigenvector of$\Sigma$ . ($n\times 1$ )$v^{(i)}$ is the$i$ th eigenvalue of$\Sigma$ . -
Select the largest
$k$ eigenvalues from$V$ , concatenate the corresponding$k$ eigenvectors together as a new matrix$U'$ .$U'=\begin{bmatrix}u'^{(1)}&u'^{(2)}&\cdots&u'^{(k)}\end{bmatrix}$ (
$n\times k$ ) -
Compute new features matrix
$Z$ .$Z=XU'=\begin{bmatrix}z^{(1)}\z^{(2)}\\vdots\z^{(n)}\end{bmatrix}$ (
$m\times k$ )
Minimize the average distance of all examples to the hyperplane.
$\min_U\space\frac{1}{m}\sum_{i=1}^md\big(x^{(i)},U\big)\$
We can reconstruct approximately our
$X_{re}=Z(U')^T=\begin{bmatrix}x_{re}^{(1)}\x_{re}^{(2)}\\vdots\x_{re}^{(n)}\end{bmatrix}$ (
We can then compute the average information loss:
$L=\frac{1}{m}\sum_{i=1}^m|x^{(i)}-x_{re}^{(i)}|^2\$
So the loss rate is:
$r=\frac{\frac{1}{m}\sum_{i=1}^m|x^{(i)}-x_{re}^{(i)}|^2}{\frac{1}{m}\sum_{i=1}^m|x^{(i)}|^2\}=\frac{\sum_{i=1}^m|x^{(i)}-x_{re}^{(i)}|^2}{\sum_{i=1}^m|x^{(i)}|^2\}\$
We can choose the smallest value of
With the reconstructed examples, our training objective can also be described as:
$\min_U\space\frac{1}{m}\sum_{i=1}^m|x^{(i)}-x_{re}^{(i)}|^2\$
- Speed up computation by composing features.
- Vasualization.
Notice: It is a bad way to use PCA to preventing overfitting. ( This might work, but why not use regularization instead? )
We have a very skewed dataset
One possible way is to use supervised learning algorithms to build a classification model. But we have too little positive examples that our model can't fit all possible "types" of anomaly examples. So future anomaly examples may looking nothing like the previous ones we used for training. As a result, our model using supervised learning algorithm may behave quite bad.
To handle extreme datasets like this, we need to use another method called "density estimation".
Suppose we have
-
Assume that our training examples follow a specific distribution
$D\big(\Theta\big)$ , we have:$x^{(i)}\sim D\big(\Theta\big)$ for$i=1,2,\dots,m$ -
We can then estimate parameters
$\Theta$ and fit this distribution. -
For a new example
$x^{new}$ , we can calculate the possibility that$x^{new}$ follows the distribution$D$ :$p^{new}=P\big(x_1^{new},x_2^{new},\dots,x_n^{new};\Theta\big)$ -
If
$p^{new}<\epsilon$ , then we think that$x^{new}$ is a anomaly point.
Single Variate Gaussian Distribution:
Each feature follows a single variate gaussian distribution. (all features are independent of each other)
$p^{new}=\prod_{k=1}^n P_k\big(x_k^{new};\mu_k,\sigma_k^2\big)\$
Multivariate Gaussian Distribution:
All features together follow a
$p^{new}=P\big(x_1^{new},x_2^{new},\dots,x_n^{new};\mu_1,\mu_2,\dots,\mu_n,\sigma_1^2,\sigma_2^2,\dots,\sigma_n^2)\$
Training Set | Cross Validation Set | Test Set | |
---|---|---|---|
Negative Examples | 60% | 20% | 20% |
Positive Examples | 0% | 50% | 50% |
Suppose we have
We can train a linear regression model for every user.
For movie
The cost function for user
$J\big(\theta^{(j)}\big)=\frac{1}{2m^{(j)}}\sum_{i:r(i,j)=1}^{m^{(j)}}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2+\frac{\lambda}{2m^{(j)}}\sum_{k=1}^n\big(\theta_k^{(j)}\big)^2\$
Combine all
$J\big(\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}\big)=\frac{1}{2n_m}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{m^{(j)}}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2+\frac{\lambda}{2n_m}\sum_{j=1}^{n_u}\sum_{k=1}^n\big(\theta_k^{(j)}\big)^2\$
Given
$J\big(\theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)}\big)=\frac{1}{2n_m}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{m^{(j)}}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2+\frac{\lambda}{2n_m}\sum_{j=1}^{n_u}\sum_{k=1}^n\big(\theta_k^{(j)}\big)^2\$
Similarly, given
$J\big(x^{(1)},x^{(2)},\dots,x^{(n_m)}\big)=\frac{1}{2n_u}\sum_{i=1}^{n_m}\sum_{j:r(i,j)=1}^{m'^{(i)}}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2+\frac{\lambda}{2n_u}\sum_{i=1}^{n_m}\sum_{k=1}^n\big(x_k^{(i)}\big)^2\$
Notice that both function have the same objective with the regularization term removed:
$\min\space\frac{1}{2}\sum_{(i,j):r(i,j)=1}^{(n_m,n_u)}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2\$
So we can combine these two cost functions together:
$J\big(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}\big)=\frac{1}{2}\sum_{(i,j):r(i,j)=1}^{(n_m,n_u)}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n\big(x_k^{(i)}\big)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n\big(\theta_k^{(j)}\big)^2\$
We can randomly initialize
For a person who hasn't rated any movie, the objective turns out to be:
$\begin{aligned}J\big(\theta^{(j)}\big)&=\frac{1}{2m^{(j)}}\sum_{i:r(i,j)=1}^{m^{(j)}}\Big({\theta^{(j)}}^Tx^{(i)}-y^{(i,j)}\Big)^2+\frac{\lambda}{2m^{(j)}}\sum_{k=1}^n\big(\theta_k^{(j)}\big)^2\&=\frac{\lambda}{2m^{(j)}}\sum_{k=1}^n\big(\theta_k^{(j)}\big)^2\end{aligned}$
So the estimates parameter will be like
It is unreasonable to predict that a person will rate every movie 0 score.
We can fix this problem by applying mean normalization for each movie:
$\overline{y^{(i)}}=\frac{1}{m'^{(i)}}\sum_{j:r(i,j)=1}^{n_u}y^{(i,j)}\$ (mean rating of movie
$y^{(i,j)}:=y^{(i,j)}-\overline{y^{(i)}}\$
Then our predicted rating "0" will be a neutral score.
(Each Iteration) | (Batch) Gradient Descent | Stochastic Gradient Descent | Mini-batch Gradient Descent |
---|---|---|---|
number of examples | |||
cost function |
(Each Epoch) | (Batch) Gradient Descent | Stochastic Gradient Descent | Mini-batch Gradient Descent |
---|---|---|---|
number of examples | |||
number of iterations | |||
randomly shuffle | No | Yes | Yes |
Choosing
For some online learning problem, we have a stream of data that changes every time.
We can then train our model only based on the latest data and discard the used data.
Then our model can fit the change of data stream and get the latest features.
-
Repeatedly move a window in a fixed step to detect different parts of a picture.
-
If the content in a window part is likely to be "text", mark it as "text parts".
-
For all continuous "text parts", use a smaller window to do character segmentation.
-
For all characters, do character recognition.
Artificial Data:
We can generate data autonomously using font libraries in the computer.
In this way, we have theoretically unlimited training data.
Crowd Source:
Hiring the crowd to label data.
Special Attentions:
-
Make sure we have a low-bias classifier, otherwise we should increase features in our model instead.
-
How much time it will save us if using artificial data rather than collecting real life data.