# **Quiz Instructions**
## **About this Quiz**

This is a closed-book, single-person quiz. Do not use any outside help or fill out questions with a partner. I will likely know. The quiz is worth 11 points. The quiz is 4 questions long, with multiple subquestions. The quiz should not take more than 35 minutes for those without accommodations. Take care to start the quiz at least before 11:25 PM PT on quiz day, or you will not be afforded full time to finish.

**About the Questions**

- Be sure to read the instructions for all questions carefully—I've taken great care to explain to you exactly what responses that will earn full-points should contain.

# **Question 1: Bias, Variance, and Their Tradeoffs...**

1. When refer to model selection, what do we mean by a model's bias?
2. When we discuss a model with high variance, what do we mean?
3. What is the bias-variance tradeoff?


# **Answer 1**

**Bias:**
- When we choose a simplistic or overly "rigid" model, we have a model with high bias. This causes issues in prediction what our data is higher variance because our error will rise.

- High bias models are like people with stereotypes about the world based on little data: they need to be more flexible because they aren't basing their choices on the data.

- They tend to perform well on the training data but poorly on unseen data (underfitting).

**Variance:**

- When we choose a model that is overly flexible, we run the risk of memorizing our data and overfitting it. This is the phenomenon of high variance. A model with high variance has learned the noise in the training data, making it overly sensitive to fluctuations and likely to model the random noise in the training data.

**Bias-Variance Tradeoff:**
- The bias-variance tradeoff is the tradeoff between a model's ability to generalize well to new data (low variance) and its ability to fit the training data well (low bias).
- Ideally, a model should have low bias and low variance, but in practice, reducing bias may increase variance and vice versa. This tradeoff highlights the challenge of finding a model with just the right amount of flexibility to capture the underlying structure of the data while still generalizing well to new, unseen data.
- The goal is to find a good balance that minimizes the total error, which is the sum of bias, variance, and irreducible error.

# **Question 2: Encoding Text Variables**

You are provided the following data on tipping behavior at restaurants. In order to use the "day" text data in their Logistic Regression, they convert all string variables ordinally from 1-4.

1. Why or why not is this a good idea?
2. What would you suggest as an alternative method of encoding the data?


In [2]:
import seaborn as sns

tips = sns.load_dataset("tips")

tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
tips['day'].value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

# Answer:
1. It's a bad idea. Ordinal data implies a relationship between the variables that go beyond "one follows the other". For instance, Sunday could be encoded 4, does that mean it's four times Thursday? Probably not. Besides, we don't even have all days in our data, so encoding them values is probably not wise.
2. Any of the following would be sufficient:
- use day counts instead and aggregate your data, if there's enough.
- one hot encode each day (is_thurs, is_fri, etc.)
- convert to embeddings.
- etc.


# **Question 3**

1. Why do we engage in regression like log-log regression? In other words, what do we seek to achieve in log-log regression that we cannot with regular linear regression?
2. How should you interpret the coefficients of a logged dependent variable relative to its logged coefficient?

# **Answer 3**

**Why log-log regression?**

Log-log regression is used to linearize exponential or multiplicative relationships between variables by applying logarithmic transformations to both dependent and independent variables. This enables the use of linear regression techniques for easier analysis and interpretation. It's particularly useful for estimating elasticities, mitigating heteroskedasticity, and resolving scaling issues when variables span several orders of magnitude.

**Interpreting Coefficients in Log-Log Regression:**

In a log-log regression model, both the dependent and independent variables are logged. The coefficients in this model represent elasticities, which are interpretable as **the percentage change in the dependent variable associated with a one percent change in the independent variable, holding other factors constant.**

- For example, if a coefficient of an independent variable is 0.8, this suggests that a 1% increase in that independent variable is associated with an approximate 0.8% increase in the dependent variable, all else being equal. This interpretation is straightforward and often more intuitive when discussing relationships in terms of relative (percentage) change rather than absolute (unit) change.

# **Question 4**

Define the Receiver Operating Characeristic:
1. What does it measure?
2. What two rates comprise it?
3. What is the range of the ROC?
4. How does the AUC relate to the ROC?

# **Answer 4**

1. The ROC measures the tradeoff between the TPR and FPR of a binary classifier along every unique threshold of probabilty 0.0 - 1.0.
2. False positive rate, True positive rate
3. Zero to One
4. AUC: The area under the ROC curve, known as AUC (Area Under the Curve), provides a single metric summarizing the overall discriminative ability of the classifier across all thresholds. A point closer to the top-left corner represents a better trade-off, with higher true positive rates and lower false positive rates.