## 1. What is the influence of the regularization parameter on a supervised learning algorithm?

The core idea behind regularization is that we are going to prefer models that are simpler, for a certain definition of ‘’simpler’’, even if they lead to more errors on the train set.  https://scipy-lectures.org/packages/scikit-learn/index.html

**Consider logistic regression as an example of a supervised learning algorithm. Regularization parameter: C**

Regularization is designed to penalize model complexity, therefore the higher the value C, the less complex the model, decreasing the error due to variance (**overfit**). Regularization parameters that are too high on the other hand increase the error due to bias (**underfit**). It is important to choose an optimal regularization parameter such that the error is minimized in both directions.

**Understanding the bias-variance trade-off:** http://scott.fortmann-roe.com/docs/BiasVariance.html

In [5]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegressionCV

In [6]:
X, y = load_iris(return_X_y=True)

Choosing C value using cross-validation

In [8]:
# Cs: regularization parameters considered
# C_: will return the best scores for each class
# the lower C is, the more regularization we use

Cs = [0.1, 1.0, 2.0, 3.0]
clf = LogisticRegressionCV(Cs=Cs, cv=5, random_state=0).fit(X, y)
clf.C_

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

array([1., 1., 1.])

The parameter C and logistic regression: https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-regression

Regularization in SVMs: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

Model selection using scikit-learn: https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

### Lasso vs Ridge
Both are regularized linear regression models: they reduce the influence of non-informative features. Ridge reduces the coefficients but it does not shrink them to zero: it minimizes the impact of irrelevant features. Lasso removes some of the featurs.

## 2. Logistic regression vs nearest neighbours

KNN
1. non-parametric. no training
2. works for non-linear data (it can learn non-linear boundaries)
3. predicts the labels
4. it can be slow because it has to keep track of all data. space complexity is probably the weakest point: we need to compute the distance between the new point and each training example
5. prone to overfitting and sensitive to outliers. Using k > 1 helps in getting a smoother, more stable algorithm
6. depends on the choice of the distance function (e.g. Euclidean is sensitive to large deviations in one feature)

For categorical attributes, use Hamming distance (nr of attributes where x1, x2 differ)

Questions to think about: how to resolve ties in nearest neighbours?
How to handle missing values? (trying not to affect the distances too much. e.g. use the average value for the missing attribute)

https://www.youtube.com/watch?v=k_7gMp5wh5A overview of KNN

Log. regression
1. parametric. requires training
2. linear classifier
3. predicts probabilities

https://www.youtube.com/watch?v=-la3q9d7AKQ overview of logistic regression

https://web.stanford.edu/class/stats202/content/lec8-cond.pdf more detailed notes

**Excercise in scikit-learn #1**

**Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test prediction performance on these observations**

In [10]:
# TODO

**Excercise in scikit-learn #2**

**Cross-validation in scikit-learn.**

In [None]:
# TODO

## 3. What happens if we do K-means with increasing numbers of clusters?

If K > nr of clusters, data points of the same label will be split into multiple clusters. If K = total nr of examples, each point will be assigned a different label.

## 4. Suppose we take a dataset and do PCA. We could do classification/regression with the original data, or with the data after PCA. Which approach is better?

PCA is used for reducing dimensionality and it can be used as a preprocessing step before applying a classification/regression algorithm. It can help but it can also hurt since PCA does not use the class labels in dimensionality reduction. https://www.youtube.com/watch?v=7kyOhArH1tg

PCA might remove the dimensions with lower variance which might still be important for the classifier. 

In the class: show an example of applying a classifier with and without PCA.

An existing example: https://towardsdatascience.com/dimensionality-reduction-does-pca-really-improve-classification-outcome-6e9ba21f0a32

More about PCA: