# W1
## 1. Introduction to Supervised
### 1.1. Definition
- **Model** (a learning algorithm) to capture larger thing. A good model omits unimportant details while retaining what's important.
- Project: Web Search, Fraud Detection, Movie Recommendations, Vehicle Driver Assistance, Web Advertisement, Social Network, Speech Recognition.

### 1.2. Types of ML  
>- Supervised
>- Unsupervised
>- Semi-supervised  

_Two main modeling approachers:_
- Regression: y is numeric.
- Classification: y is categorical.

### 1.3 Interpretation and Prediction
**a. Interpretation**
- The primary objective is to train a model to find **insights** from the data.
- Workflow: 
    * Gather x,y; Train model by finding //omega// that give best prediction.
    * Focus on //omega// (rather than y_p) to generate insights -> choose **less complex** model, which we can know well, and can understand how feature affect outcome.
**b. Prediction**  
- The primary objective is to make the best prediction.
- Focus on **performance metrics**, which measire the quality of model's prediction.
- Without focusing on interpretability, we risk having a Black-box model.

## 2. Linear Regression
### 2.1. Preprocessing 
```sh
import sklearn.linear_model import LinearRegression
model = LinearRegression()
model = model.fit(X_train, y_train)
y_predict = model.predict(X_test)
```

* Making our target variable normally distributed often will lead to better results
    If our target is not normally distributed, we can apply a transformation to it and then fit our regression to predict the transformed values.

* 2 ways to tell if our target is normally distributed:
    * Visually
    * Using a statistical test
* This is a statistical test that tests whether a distribution is normally distributed or not. It isn't perfect, but suffice it to say: 
```sh
from scipy.stats.mstats import normaltest # D'Agostino K^2 Test
normaltest(data.Y.values)
```
    * This test outputs a "p-value". The _higher_ this p-value is the _closer_ the distribution is to normal.
    * Frequentist statisticians would say that you accept that the distribution is normal (more specifically: fail to reject the null hypothesis that it is normal) if p > 0.05.
 
### 2.2. BoxCox Transformation 
The box cox transformation is a parametrized transformation that tries to get distributions "as close to a normal distribution as possible".

It is defined as:

$$ \text{boxcox}(y_i) = \frac{y_i^{\lambda} - 1}{\lambda} $$

You can think of as a generalization of the square root function: the square root function uses the exponent of 0.5, but box cox lets its exponent vary so it can find the best one.
```sh
from scipy.stats import boxcox
bc_result = boxcox(data.Y)
rs = bc_result[0] # transformed data
lam = bc_result[1] # lambda
```
### 2.3. Evaluation Metrics
- The variability of the data set can be measured with two sums of squares formulas:
The sum of squares of residuals, also called the residual sum of squares:
$${\displaystyle SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,}$$
- The total sum of squares (proportional to the variance of the data):
$${\displaystyle SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}}$$
- The most general definition of the coefficient of determination is:
$${\displaystyle R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}}$$
- In the best case, the modeled values exactly match the observed values, which results in ${\displaystyle SS_{\text{res}}=0}$ and ${\displaystyle R^{2}=1}$. A baseline model, which always predicts ${\displaystyle {\bar {y}}}$, will have ${\displaystyle R^{2}=0}$. Models that have worse predictions than this baseline will have a negative ${\displaystyle R^{2}}$.