# Monday 12 feb: Introduction  

**Why estimate F?**

&rarr; Compare the variances explained by the model to the unexplained variances.

* which predictors are associated with the response
* what is the relationship
* which predictors can sum up the response

The formula for the F-statistic is: 
$$\large F=\frac{MSR}{MSE}=\frac{SSR/k}{SSE/n-k-1}$$

* MSR (Mean Square Regression) = Sum of Squares Regression the variance explained by the model divided by degrees of freedom k (number of predictors)  &rarr; Measured by summing the sqaures of the differences between predicted values $\hat{y_i}$ and the sample mean $\bar{y}$. $\large SSR = \sum_{i=1}^n (\hat{y_1} - \bar{y})^2$
* MSE (Mean Square Error) = Sum of Squares Error i.e. the unexplained variances divided by the degrees of freedom for the error term which is n - k - 1 &rarr; $\large SSE=\sum_{i=1}^n (\hat{y_i} - y)^2$

In [7]:
f = lambda ssr, sse, n, k: (ssr/k) / (sse / (n - k -1))

In [8]:
f1 = f(10, 5, 50, 10)
f2 = f(10, 5, 50, 20)

**How to predict f(X) ?**

&darr;

* Parametric Methods: make an assumption about the shape of the relationship (linear, quadratic) then we fit the data i.e. optimize MSE by its parameters. Example &rarr; linear regression.

* Non-parametric models: do not make assumptions, but get as close to the data points without being to "rough" i.e. overfitting. Example &rarr; thin-plate spline method.

Trade off between Prediction accuracy and model interpretability: *the higher the flexibillity of the model the more difficult the interpretability of the model* &rarr; *a more flexible model is less suitable for inference where as a more inflexible model is*



MSE: Mean Squared Error
* training MSE &rarr; used for building the model but not interesting for validating prediction capacity.
* Test MSE &rarr; used for assesing model performance.

![image.png](attachment:493ed960-e12e-48df-ad0d-87b40d775daa.png)

The bias-variance trade-off where variance refers to the amount of change that occurres when we estimate the relationship using a different dataset. Bias refers to the error that is introduced by modelling a problem using a simplified model. 

&rarr; *ideally the bias and variance would be low*

&rarr; *more flexible methods have less bias, but higher variance*



Assesment of a classifier is done by calculating the fraction of misclassified instances also called the ***training error rate***

Bayes classifier &rarr; gives a conditional probability and classifies based on this. 

The bayes decission boundry is the boundry at which the probability of either class is 50 percent. 

The bayes classifier optimizes the bayes error rate which is the opposite of the bayes probability. 

![image.png](attachment:8bdc1e89-1639-43a2-8ba6-84f816ec3f25.png)

K-nearest neighbour algorithm is used to estimate the relationship even when their aren't more than 3 points on each instance in the curve. It usese a kernel to capture the neighbourhood of at the required instance and averages its estimation. K is the number of neighbours needed to be captured.

&rarr; *this method is not valid for higher dimensional data*

![image.png](attachment:c6a5b188-fbb7-495a-aa82-1c5439e7ed8e.png)

# questions

**1. For each of parts (a) through (d), indicate whether we would generally
expect the performance of a flexible statistical learning method to be
better or worse than an inflexible method. Justify your answer**

(a) The sample size n is extremely large, and the number of predic-tors p is small.

flexible as this generally gives gives lower variance and lower bias. However inflexible could also work here.

(b) The number of predictors p is extremely large, and the number
of observations n is small.

Flexible as the data is higher dimensional with a large set of predictors. 

(c) The relationship between the predictors and response is highly
non-linear.

Flexible as this does not need to make any assumption to what the relationship between predictor and response looks like.

(d) The variance of the error terms, i.e. σ2 = Var("), is extremely
high.

Flexible as this fits the data to a much higher degree than a inflexible model.

**2. Explain whether each scenario is a classification or regression prob-
lem, and indicate whether we are most interested in inference or pre-
diction. Finally, provide n and p.**

(a) We collect a set of data on the top 500 firms in the US. For each
firm we record profit, number of employees, industry and the
CEO salary. We are interested in understanding which factors
affect CEO salary.

This is a regression problem as salary is most probable a continous prediction in this example. And we are interested in inference as we want to identify the most important features.

(b) We are considering launching a new product and wish to know
whether it will be a success or a failure. We collect data on 20
similar products that were previously launched. For each prod-
uct we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price,
and ten other variables.

This is a classfication problem as were interested in a binary prediction and we want to make an prediction.

(c) We are interested in predicting the % change in the USD/Euro
exchange rate in relation to the weekly changes in the world
stock markets. Hence we collect weekly data for all of 2012. For
each week we record the % change in the USD/Euro, the %
change in the US market, the % change in the British market,
and the % change in the German market.

This is a regression problem as we are working with a continous predictor and the goal is to predict the USD/Euro exchange rate. 

**3. We now revisit the bias-variance decomposition.**

(a) Provide a sketch of typical (squared) bias, variance, training er-
ror, test error, and Bayes (or irreducible) error curves, on a sin-
gle plot, as we go from less flexible statistical learning methods
towards more flexible approaches. The x-axis should represent
the amount of flexibility in the method, and the y-axis should
represent the values for each curve. There should be five curves.
Make sure to label each one.
(b) Explain why each of the five curves has the shape displayed in
part (a).

![image.png](attachment:d4af87b8-b221-4e79-ad71-45fe400787fa.png)

(b) Explain why each of the five curves has the shape displayed in
part (a)

Squared bias goes down the more flexible the model becomes as this is the error that is introduced by fitting real-world data using a simplified model. &rarr; the more flexible the model the higher its fidelity.

Variance is the degree of variation that occures when using a different trainig set &rarr; the more flexible the more it can vary between training datasets.

training error goes down the more flexible the model becomes a.k.a. overfitting

test error goes is parabolic as it goes down as the model become more flexible until its overfitting then it goes up as it is too specific. 

The Bayes error is not dependent on the flexibility of the model, it is merely the probability of not-being of a certain class. 

**4. You will now think of some real-life applications for statistical learn-
ing**

(a) Describe three real-life applications in which classification might
be useful. Describe the response, as well as the predictors. Is the
goal of each application inference or prediction? Explain your
answer.

1. classifying disease states
2. classifying succes rates of an election candidate
3. classifying presence of fish 

**5. What are the advantages and disadvantages of a very flexible (versus
a less flexible) approach for regression or classification? Under what
circumstances might a more flexible approach be preferred to a less
flexible approach? When might a less flexible approach be preferred?**

The more flexible the model the higher the more prone your model becomes to overfitting and the less interpretable your model becomes. A flexible model is more suitable for prediction as this doesn't necessitate an understanding of your model. 

However when inference is the goal to understand the most important features of your model a simple model can help with its interpretability. 

**6. Describe the differences between a parametric and a non-parametric
statistical learning approach. What are the advantages of a para-
metric approach to regression or classification (as opposed to a non-
parametric approach)? What are its disadvantages?**

The difference between parametric and non-parametric is: parameters i.e. a parametric model sums up the data into a reduced amount of parameters which increases model interpretability. While a non-parametric model does not uses parameters and is more of black box. 

**7. The table below provides a training data set containing six observa-
tions, three predictors, and one qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when
X1 = X2 = X3 = 0 using K-nearest neighbors.**

![image.png](attachment:d7198e00-f708-4427-8e71-55391ee62ed9.png)

In [3]:
from scipy.spatial import distance


**a) Compute the Euclidean distance between each observation and
the test point, X1 = X2 = X3 = 0.**

In [4]:
points = [(0,3,0), (2,0,0), (0,1,3), (0,1,2), (-1,0,1), (1,1,1)]
test = (0,0,0)

for point in points:
    d = distance.euclidean(point, test)
    print(d)

3.0
2.0
3.1622776601683795
2.23606797749979
1.4142135623730951
1.7320508075688772


**b) What is our prediction with K = 1? Why?**

k = 1 &rarr; 1.41 since we use only a single neighbour to averaging is possible 

**c) What is our prediction with K = 3? Why?**

In [6]:
(2 + 1.41 + 1.73) / 3

1.7133333333333336

**(d) If the Bayes decision boundary in this problem is highly non-
linear, then would we expect the best value for K to be large or
small? Why?**

The best value for K will be small as this makes the model more flexible ass opposed to when K is large