# Chapter 2 Exercises

## 1.) - For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

#### a.) The sample size *n* is extremely large, and the number of predictors *p* is small

**FIRST/INCORRECT ANSWER**: An inflexible method would generally be better here. With such a large number of samples, we don't want our model reacting heavily to changes in a small amount of samples, since it is more likely to be true that a small amount of sample changes will not have a significant impact on the overall pattern of the relationship between the predictors and the outcomes.

**Why was this answer incorrect?**: The statement that "we don't want our model reacting heavily to changes in a small amount of samples" is true, but "small amount of samples" is relative to the sample size. A flexible model of 10 observations may respond too much to changes in 5 samples, but a model with 1000 observations will not be affected much by 5 changes. My mistake was interpreting "flexible" to mean "responding in an extreme way to a small amount of changes", but this leaves out the context of the model. The correct version of that statment is a "model that is **too** flexible responds in an extreme way to changes in an amount of observations that is **small with respect to the number of total samples**". Any model can be made to be overly-flexible with respect to its observations, but the attribute of flexibility doesn't automatically imply over-fitting.

**CORRECT ANSWER**: A flexible method would be better suited to this situation than an inflexible method. Since there are lots of observations, the risk of overfitting is minimized, and we want to be able to capture the potentially complex patterns that can arise out of a large number of observations.

#### b.) The number of predictors *p* is extremely large, and the number of observations *n* is small.

An inflexible method would be better than a flexible method here. With a small number of observations, we run the risk of overfitting much earlier on the flexibility scale, and a small number of observations most likely won't produce extremely complicated patterns.

#### c.) The relationship between the predictors and response is highly non-linear

This is a situation clearly suited to a flexible method as opposed to an inflexible method. With a highly non-linear relationship, we need flexibility in order to capture that pattern.

#### d.) The variance of the error terms, i.e. $σ^{2} = Var(ε)$, is extremely high.

An inflexible model would be better here. If different training data sets have a large impact on predictions (the definition of high variance), we need to account for then noise in the dataset with a less flexible, high-bias method.

## 2.) Explain whether each scenario is a classifcation or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

#### a.)  We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors afect CEO salary.

This is a regression problem. We would create a model where the profit, number of employees, and industry are used to predict the CEO salary. We are interested in inference here.
`n = 500` and `p = 3`

#### b.) We are considering launching a new product and wish to know whether it will be a *success* or a *failure*. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or a failure, price charged for the product, marketing budget, competition price, and ten other variables.

This is a classification problem, with *success* and *failure* buckets. Prediction is the interest. Here we have `n = 20` and `p = 13`.

#### c.)  We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

This is a regression problem, with the prediction being the percent change in the USD/Euro exchange rate. Prediction is the obvious interest here, with `n = 52` (number of weeks in a year) and `p = 3`.

## 3.) We now revisit the bias-variance decomposition.

#### a.) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

Sketch in LiquidText.

#### b.) Explain why each of the five curves has the shape displayed in part (a).

- Bias - bias has an inverse relationship to flexibility, so when flexibility is low, bias will be high, and will decrease as the flexibility increases, rapidly at first and then more gradually.
- Variance - variance has the opposite behavior from bias. It will start low, but as method flexibility increases, variance will increase along with it.
- Training error - the training area will always decrease as flexibility increases.
- Test error - The training area will have a U-shape, at first decreasing as flexibility increases, but then increasing again as too much flexibility starts to over-fit the data..
- Bayes error - This is a constant (though unknown), so it is represented as a horizontal line whose value is just below the minimum value of the test error

## 4.) You will now think of some real-life applications for statistical learning.

#### a.) Describe three real-life applications in which *classification* might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

- Classifying email into "spam", "promotions", and "social" categories, like Gmail does. The response is an integer that represents one of the three (or more) categories, and the predictors could be things like word choice, frequency of exclamation marks, length of email, amount of attachments, etc. The goal here is prediction. We wouldn't be super concerned with what constituted a spam or social email, just that it was filtered into the correct inbox.
- Classifying a current market as a "bull" or "bear" market. The response would be one of those two categories, and the predictors would be things like recent trade volume, market cap, performance over different time periods, options distributions, and any number of global economic factors. Here inference would be the goal, as we would be interested in how different factors affect the market.
- Classifying a song into any genres. The response would be something like "rock", "pop", or "rap", and the predictors could be song length, frequency of selected words, topic (love story, breakup, partying, reflection, etc.), or even instrumnents used. This application would most likely only be concerned with prediction, to help sort songs into their respective genre buckets, like Spotify or similar apps do.

#### b.) Describe three real-life applications in which *regression* might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

- Predicting rainfall amount on a farm. Response would be some number representing amount of rainfall, and predictors could be previous day's or week's rainfall amount, geographic location, temperature, and cloud cover. Prediction would be the goal here, as that would be the information that would potentially drive a farmer's decision regarding some things.
- Predicting average house prices for a given month/year. Response would be a numerical value representing the average (or maybe median) price of a house in a certain area. Predictors could be current listing count, average/median price of houses in the previous month, average time on the market for recent listings, and population/population growth of the area. The application would probably be more focused on inference, as the interest is in what it is that has an impact on the price of houses.
- Predicting future salary of an individual. The response would be a yearly salary that a person would be expected to make at a given age, and the predictors would be things like educatio completed, zip code, family economic status, race, and field of study (if college was completed). This model would most likely be focused on inference, to learn about what can be done to mitigate negative impacts on future salary.

#### c.) Describe three real-life applications in which *cluster analysis* might be useful

- Finding the types of users of a particular product or service.
- Finding restaurant types and frequencies in a given city.
- Finding patterns in grade distributions for a highschool or university.

## 5.) What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach)? When might a less flexible approach be preferred?

A flexible approach to a problem can allow your model to find complicated patters tha appear in the data. If there is a dataset that contains very non-linear underlying relationships, a flexible approach will be allow you to capture those relationships. The disadvantage comes when the relationships in a dataset are actually relatively simple, maybe linear, because you're then prone to overfitting the data. In these instances, you would prefer a less flexible approach to avoid the problem of overfitting.

## 6.) Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

A parametric approach assumes that we know the general shape of `f`, and can just focus on finding the parameters/weights of the function. Computationaly, this is much easier than trying to figure out what `f` is from scratch; but it also means that a certain amount of your final result's accuracy is going to be entirely dependant on how close you were in estimating the initial shape of the function.

A non-parametric approach could potentially lead to a more accurate final model, but it is much more computationally difficult.

## 7.)