## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Do not write your name on the assignment.

3. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

4. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

5. The assignment is worth 100 points, and is due on **Tuesday, 17th January 2022 at 11:59 pm**. 

6. There is a **bonus** question worth 5 points.

7. **Five points are for properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 pt); *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.* 
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g., printouts of the working directory should not be included in the final submission  (1 pt).
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt).
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt).

8.  The maximum possible score in the assigment is 95 + 5 (formatting) + 5 (bonus question) = 105 out of 100. There is no partial credit for the bonus question.

## Regression vs Classification; Prediction vs Inference

Explain (1) whether each scenario is a classification or regression problem, and (2) whether we are most interested in inference or prediction. Answers to both parts must be supported by a justification.

### 
Consider a company that is interested in conducting a marketing campaign. The goal is to identify individuals who are likely to respond positively to a marketing campaign, based on observations of demographic variables *(such as age, gender, income, etc.)* measured on each individual. 

*(2+2 points)*

1) This scenario is a classification problem. The goal is to identify individuals who are likely to respond positively to a marketing campaign, which requires classifying individuals into two groups: those who will respond positively and those who will not. 

2)In this scenario, the company is most interested in prediction. The goal is to predict which demographic variables are more likely to responde positivly to the marketing campaign.

### 
Consider that the company mentioned in the previous question is interested in understanding the impact of advertising promotions in different media types on the company sales. For example, the company is interested in the question, *'how large of an increase in sales is associated with a given increase in radio vis-a-vis TV advertising?'*

*(2+2 points)*

1) This is a regression problem since the independent variable is continous.
2) This is a inference because the company is tring to undertand the relationship between media types and the company's sales

### 
Consider a company selling furniture is interested in the finding the association between demographic characterisitcs of customers (such as age, gender, income, etc.) and their probability of purchase of a particular company product.

*(2+2 points)*

1)This is a classification problem since the independent variable (purchasing product) is categorical. 
2)This is inference because the company wants to know if there is a positive or negative association between demographics and probability of purchase. 

### 
We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world
stock markets. Hence we collect weekly data for all of 2022. For each week we record the % change in the USD/Euro, the %
change in the US market, the % change in the British market, and the % change in the German market.

*(2+2 points)*

1) Regression becauase independent varaiable (% change in the USD/Euro exchange rate) is continous. 
2) Predicition becuase we are trying to predict the independent variable using the dependent varaible (weekly change in world stock market)

## RMSE vs MAE

### 
Describe a regression problem, where it will be more appropriate to assess the model accuracy using the root mean squared error (RMSE) metric as compared to the mean absolute error (MAE) metric.

**Note:** Don't use the examples presented in class


*(4 points)*

It is appropriate to use Root Mean Squared Error (RMSE) when the data contains large variations or outliers since it is sensitive to large errors or outliers, which will be penalized more than smaller errors. 

An example would be when we want to predict electricity consumption of a county. There will be large varaitions since some counties use a large amount of electricity.

### 
Describe a regression problem, where it will be more appropriate to assess the model accuracy using the mean absolute error (MAE) metric as compared to the root mean squared error (RMSE) metric.

**Note:** Don't use the examples presented in class

*(4 points)*

The MAE is appropriate when the data contains small variations or outliers. The MAE gives equal weight to all errors regardless of their size, so it will not penalize large errors as heavily as the RMSE.

Example: Predicting the number of items sold in a local shop. There will not be large varaitions. 

## FNR vs FPR

### 
A classification model is developed to predict those customers who will respond positively to a company's tele-marketing campaign. All those customers that are predicted to respond positively to the campaign will be called by phone to buy the product being marketed. If the customer being called purchases the product ($y = 1$), the company will get a profit of \$100. On the other hand, if they are called and they don't purchase ($y = 0$), the company will have a loss of \$1. Among FPR (False positive rate) and FNR (False negative rate), which metric is more important to be minimized to reduce the loss associated with misclassification? Justify your answer. 

In your justification, you must clearly interpret False Negatives (FN) and False Postives (FP) first.

**Assumption:** Assume that based on the past marketing campaigns, around 50% of the customers will actually respond positively to the campaign.

*(4 points)*

False Positive Rate= is the rate of getting false positive results. It is calculated by dividing the false postive with the sum of the flase positive and True negative 
Flase Negative Rate= is the rate of getting false negative results. It is calculated by dividing the false negative with the sum of the false negative and True negative


For this case it would be more important to reduce the false NEGATIVE rate since a high false negative rate will result in the missing out of potential sale.  

### 
Can the answer to the previous question change if the assumption stated in the question is false? Justify your answer.

*(6 points)*

Yes it will change. In the case that only 5% of the customers will actually respond positively to the campaign, then we would focus on reducing the false postive rate because we would waaste time calling people who are not interested. 

## Petrol consumption

Read the dataset *petrol_consumption_train.csv*. It contains the following five columns: 

`Petrol_tax`: Petrol tax (cents per gallon) 

`Per_capita_income`: Average income (dollars) 

`Paved_highways`: Paved Highways (miles) 

`Prop_license`: Proportion of population with driver's licenses 

`Petrol_consumption`: Consumption of petrol (millions of gallons)

### 
Make a pairwise plot of all the variables in the dataset. Which variable seems to have the highest linear correlation with `Petrol_consumption`? Let this variable be predictor *P*. *Note: If you cannot figure out P by looking at the visualization, you may find the pairwise linear correlation coefficient to identify P.*

*(4 points)*

In [2]:
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
import numpy as np
import matplotlib.pyplot as plt

In [5]:
train= pd.read_csv("petrol_consumption_train.csv")
# sns.pairplot(train)
train.corr()

Unnamed: 0,Petrol_tax,Per_capita_income,Paved_highways,Prop_license,Petrol_consumption
Petrol_tax,1.0,0.082359,-0.660022,-0.22392,-0.393415
Per_capita_income,0.082359,1.0,0.040256,0.048153,-0.314039
Paved_highways,-0.660022,0.040256,1.0,-0.037998,0.098117
Prop_license,-0.22392,0.048153,-0.037998,1.0,0.718303
Petrol_consumption,-0.393415,-0.314039,0.098117,0.718303,1.0


The varaiable that has the highest correlation with pertol_consumption is Prop_license, which a correlation factor of 
0.718303

### 
Fit a simple linear regression model to predict `Petrol_consumption` based on predictor *P* (identified in the previous part). Print the model summary.

*(4 points)*

In [8]:
Linear_model= smf.ols(formula= "Petrol_consumption~Prop_license", data=train)
model = Linear_model.fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:     Petrol_consumption   R-squared:                       0.516
Model:                            OLS   Adj. R-squared:                  0.503
Method:                 Least Squares   F-statistic:                     40.51
Date:                Mon, 30 Jan 2023   Prob (F-statistic):           1.80e-07
Time:                        19:41:31   Log-Likelihood:                -231.59
No. Observations:                  40   AIC:                             467.2
Df Residuals:                      38   BIC:                             470.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     -267.6155    132.038     -2.027   

### 
Interpret the coefficient of `Prop_license`. What is the increase in petrol consumption for an increase of 0.05 in *P*?
*(2+2 points)*

1) The value of the intercept is 1479.1803, which is the slope and the linear assocation between between X and Y

2) A 0.05 increase in P will result in a 73.95 increase in petrol consumption 

### 
Does petrol consumption have a statistically significant relationship with the predictor *P*? Justify your answer.

*(4 points)*

There is a statistically significant relationship since the p value of prop_license's p value is below 0.05 and 0.01. 

### 
What is the R-squared? Interpret its value.

*(4 points)*

The value of  R-squared is 0.516, which means that 51.6% of the variations in the dependent variable is explained by the independet varaible. 

### 
Use the model developed above to estimate the petrol consumption for a state in which 50% of the population has a driver’s license. What are the confidence and prediction intervals for your estimate? Which interval includes the irreducible error?

*(4+3+3+2 = 12 points)*

1) Confidence interval= (431.6968, 512.252454)
2) Prediction interval= (302.822725, 641.126528)
3) The interval with the irreducible error is the prediction interval. 

In [9]:
prediction = model.get_prediction(pd.DataFrame({'Prop_license':[0.5]}))
prediction.summary_frame(alpha=0.05)


Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper,obs_ci_lower,obs_ci_upper
0,471.974627,19.896237,431.6968,512.252454,302.822725,641.126528


### 
Use the model developed above to estimate the petrol consumption for a state in which 10% of the population has a driver’s license. Are you getting a reasonable estimate? Why or why not?

*(5 points)*

In [None]:
prediction = model.get_prediction(pd.DataFrame({'Prop_license':[0.1]}))
prediction.summary_frame(alpha=0.05)

No its does not make since we are getting negative consumption estimates. 

### 
What is the residual standard error of the model?

*(4 points)*

In [None]:
np.sqrt(model.mse_resid)

81.15342760294635

### 
Using the model developed above, predict the petrol consumption for the observations in *petrol_consumption_test.csv*. Find the RMSE (Root mean squared error). Include the units of RMSE in your answer.

*(5 points)*

1)80.13903941152401
2)millions of gallons

In [None]:
test= pd.read_csv("petrol_consumption_test.csv")
pred_price = model.predict(test['Prop_license'])
np.sqrt(((test.Petrol_consumption - pred_price)**2).mean())

### 
Based on the answers to the previous two questions, do you think the model is overfitting? Justify your answer.

*(4 points)*

In [None]:
train_pred_price = model.predict(train['Prop_license'])
np.sqrt(((train.Petrol_consumption - train_pred_price)**2).mean())

The model is not overfitting since the RMSE for the train (79.09) is close to the RMSE of the test (80.13)

Make a scatterplot of `Petrol_consumption` vs `Prop_license` using *petrol_consumption_test.csv*. Over the scatterplot, plot the regression line, the prediction interval, and the confidence interval. Distinguish the regression line, prediction interval lines, and confidence interval lines with the following colors. Include the legend as well.

- Regression line: red
- Confidence interval lines: blue
- Prediction interval lines: green

*(4 points)*

In [23]:
intervals = model.get_prediction(test)
intervals.summary_frame(alpha=0.05)
interval_table = intervals.summary_frame(alpha=0.05)

In [1]:
sns.scatterplot(x = test.Prop_license, y = test.Petrol_consumption,color = 'orange', s = 10)
sns.lineplot(x = test.Prop_license, y = pred_price, color = 'red')
sns.lineplot(x = test.Prop_license, y = interval_table.mean_ci_lower, color = 'blue')
sns.lineplot(x = test.Prop_license, y = interval_table.mean_ci_upper, color = 'blue',label='_nolegend_')
sns.lineplot(x = test.Prop_license, y = interval_table.obs_ci_lower, color = 'green')
sns.lineplot(x = test.Prop_license, y = interval_table.obs_ci_upper, color = 'green')
plt.legend(labels=["Regression line","Confidence interval", "Prediction interval"])

NameError: name 'sns' is not defined

Among the confidence and prediction intervals, which interval is wider, and why?

*(1+2 points)*

The prediction interval is wider becuase it also includes the incorrectable errors as well. 

### 
Find the correlation between `Petrol_consumption` and the rest of the variables in *petrol_consumption_train.csv*. Based on the correlations, a simple linear regression model with which predictor will have the least R-squared value for predicting `Petrol_consumption`. Don't develop any linear regression models.

*(4 points)*

In [None]:
train.corrwith(train.Petrol_consumption)

Paved highways has the least R-Squared

**Bonus point question**

*(5 points - no partial credit)*

### 
Fit a simple linear regression model to predict `Petrol_consumption` based on predictor *P*, but without an intercept term.

*(you must answer this correctly to qualify for earning bonus points)*

In [None]:
Linear_model2= smf.ols(formula= "Petrol_consumption~Prop_license-1", data=train)
model2= Linear_model2.fit()
model2.summary()


### 
Estimate the petrol consumption for the observations in *petrol_consumption_test.csv* using the model in developed in the previous question. Find the RMSE.

*(you must answer this correctly to qualify for earning bonus points)*

In [None]:
test= pd.read_csv("petrol_consumption_test.csv")
pred_price2 = model2.predict(test['Prop_license'])
np.sqrt(((test.Petrol_consumption - pred_price2)**2).mean())

### 
The RMSE for the models with and without the intercept are similar, which indicates that both models are almost equally good. However, the R-squared for the model without intercept is much higher than the R-squared for the model with the intercept. Why? Justify your answer.

*(5 points)*

The model without intercept is forced to pass through the origin, which means that it is not able to adjust its fit to the data as much as the model with intercept.