## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Do not write your name on the assignment.

3. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

4. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

5. The assignment is worth 100 points, and is due on **Tuesday, 17th January 2023 at 11:59 pm**. 

6. There is a **bonus** question worth 5 points.

7. **Five points are for properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 pt); *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.* 
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g., printouts of the working directory should not be included in the final submission  (1 pt).
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt).
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt).

8.  The maximum possible score in the assigment is 95 + 5 (formatting) + 5 (bonus question) = 105 out of 100. There is no partial credit for the bonus question.

## Regression vs Classification; Prediction vs Inference

Explain (1) whether each scenario is a classification or regression problem, and (2) whether we are most interested in inference or prediction. Answers to both parts must be supported by a justification.

### 
Consider a company that is interested in conducting a marketing campaign. The goal is to identify individuals who are likely to respond positively to a marketing campaign, based on observations of demographic variables *(such as age, gender, income, etc.)* measured on each individual. 

*(2+2 points)*

### 
Consider that the company mentioned in the previous question is interested in understanding the impact of advertising promotions in different media types on the company sales. For example, the company is interested in the question, *'how large of an increase in sales is associated with a given increase in radio vis-a-vis TV advertising?'*

*(2+2 points)*

### 
Consider a company selling furniture is interested in the finding the association between demographic characterisitcs of customers (such as age, gender, income, etc.) and their probability of purchase of a particular company product.

*(2+2 points)*

### 
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world
stock markets. Hence we collect weekly data for all of 2022. For each week we record the % change in the USD/Euro, the %
change in the US market, the % change in the British market, and the % change in the German market.

*(2+2 points)*

## RMSE vs MAE

### 
Describe a regression problem, where it will be more appropriate to assess the model accuracy using the root mean squared error (RMSE) metric as compared to the mean absolute error (MAE) metric.

**Note:** Don't use the examples presented in class

*(4 points)*

### 
Describe a regression problem, where it will be more appropriate to assess the model accuracy using the mean absolute error (MAE) metric as compared to the root mean squared error (RMSE) metric.

**Note:** Don't use the examples presented in class

*(4 points)*

## FNR vs FPR

### 
A classification model is developed to predict those customers who will respond positively to a company's tele-marketing campaign. All those customers that are predicted to respond positively to the campaign will be called by phone to buy the product being marketed. If the customer being called purchases the product ($y = 1$), the company will get a profit of \$100. On the other hand, if they are called and they don't purchase ($y = 0$), the company will have a loss of \$1. Among FPR (False positive rate) and FNR (False negative rate), which metric is more important to be minimized to reduce the loss associated with misclassification? Justify your answer. 

In your justification, you must clearly interpret False Negatives (FN) and False Postives (FP) first.

**Assumption:** Assume that based on the past marketing campaigns, around 50% of the customers will actually respond positively to the campaign.

*(4 points)*

### 
Can the answer to the previous question change if the assumption stated in the question is false? Justify your answer.

*(6 points)*

## Petrol consumption

Read the dataset *petrol_consumption_train.csv*. It contains the following five columns: 

`Petrol_tax`: Petrol tax (cents per gallon) 

`Per_capita_income`: Average income (dollars) 

`Paved_highways`: Paved Highways (miles) 

`Prop_license`: Proportion of population with driver's licenses 

`Petrol_consumption`: Consumption of petrol (millions of gallons)

### 
Make a pairwise plot of all the variables in the dataset. Which variable seems to have the highest linear correlation with `Petrol_consumption`? Let this variable be predictor *P*. *Note: If you cannot figure out P by looking at the visualization, you may find the pairwise linear correlation coefficient to identify P.*

*(4 points)*

### 
Fit a simple linear regression model to predict `Petrol_consumption` based on predictor *P* (identified in the previous part). Print the model summary.

*(4 points)*

### 
Interpret the coefficient of *P*. What is the increase in petrol consumption for an increase of 0.05 in *P*?

*(2+2 points)*

### 
Does petrol consumption have a statistically significant relationship with the predictor *P*? Justify your answer.

*(4 points)*

### 
What is the R-squared? Interpret its value.

*(4 points)*

### 
Use the model developed above to estimate the petrol consumption for a state in which 50% of the population has a driver’s license. What are the confidence and prediction intervals for your estimate? Which interval includes the irreducible error?

*(4+3+3+2 = 12 points)*

### 
Use the model developed above to estimate the petrol consumption for a state in which 10% of the population has a driver’s license. Are you getting a reasonable estimate? Why or why not?

*(5 points)*

### 
What is the residual standard error of the model?

*(4 points)*

### 
Using the model developed above, predict the petrol consumption for the observations in *petrol_consumption_test.csv*. Find the RMSE (Root mean squared error). Include the units of RMSE in your answer.

*(5 points)*

### 
Based on the answers to the previous two questions, do you think the model is overfitting? Justify your answer.

*(4 points)*

Make a scatterplot of `Petrol_consumption` vs `Prop_license` using *petrol_consumption_test.csv*. Over the scatterplot, plot the regression line, the prediction interval, and the confidence interval. Distinguish the regression line, prediction interval lines, and confidence interval lines with the following colors. Include the legend as well.

- Regression line: red
- Confidence interval lines: blue
- Prediction interval lines: green

*(4 points)*

Among the confidence and prediction intervals, which interval is wider, and why?

*(1+2 points)*

### 
Find the correlation between `Petrol_consumption` and the rest of the variables in *petrol_consumption_train.csv*. Based on the correlations, a simple linear regression model with which predictor will have the least R-squared value for predicting `Petrol_consumption`. Don't develop any linear regression models.

*(4 points)*

**Bonus point question**

*(5 points - no partial credit)*

### 
Fit a simple linear regression model to predict `Petrol_consumption` based on predictor *P*, but without an intercept term.

*(you must answer this correctly to qualify for earning bonus points)*

### 
Estimate the petrol consumption for the observations in *petrol_consumption_test.csv* using the model in developed in the previous question. Find the RMSE.

*(you must answer this correctly to qualify for earning bonus points)*

### 
The RMSE for the models with and without the intercept are similar, which indicates that both models are almost equally good. However, the R-squared for the model without intercept is much higher than the R-squared for the model with the intercept. Why? Justify your answer.

*(5 points)*