## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Do not write your name on the assignment.

3. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

4. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

5. The assignment is worth 100 points, and is due on **Thursday, 26th January 2023 at 11:59 pm**. 

6. **Five points are properly formatting the assignment**. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 pt). *If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.* 
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission  (1 pt)
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

## Multiple linear regression

A study was conducted on 97 men with prostate cancer who were due to receive a radical prostatectomy. The dataset *prostate.csv* contains data on 9 measurements made on these 97 men. The description of variables can be found [here](https://rafalab.github.io/pages/649/prostate.html):

### Training MLR
Fit a linear regression model with `lpsa` as the response and all the other variables as predictors. Write down the equation to predict `lpsa` based on the other eight variables.

*(2+2 points)*

### Model significance
Is the overall regression significant at 5% level? Justify your answer.

*(2 points)*

### Coefficient interpretation
Interpret the coefficient of `svi`.

*(2 points)*

### Variable significance
Report the $p$-values for `gleason` and `age`. What do you conclude about the significance of these variables?

*(2+2 points)*

### Variable significance from confidence interval
What is the 95% confidence interval for the coefficient of `age`? Can you conclude anything about its significance based on the confidence interval?

*(2+2 points)*

### $p$-value
Fit a simple linear regression on `lpsa` against `gleason`. What is the $p$-value for `gleason`?

*(1+1 points)*

### Predictor significance in presence / absence of other predictors
Is the predictor `gleason` statistically significant in the model developed in the previous question *(B.1.6)*? 

Was `gleason` statistically significant in the model developed in the first question *(B.1.1)* with multiple predictors?

Did the statistical significance of `gleason` change in the absence of other predictors? Why or why not?

*(1+1+4 points)*

### Prediction
Predict `lpsa` of a 65-year old man with `lcavol` = 1.35, `lweight` = 3.65, `lbph` = 0.1, `svi` = 0.22, `lcp` = -0.18, `gleason` = 6.75, and `pgg45` = 25 and find 95% prediction intervals.

*(2 points)*

### Variable selection
Find the largest subset of predictors in the model developed in the first question *(B.1.1)*, such that their coefficients are zero, i.e., none of the predictors in the subset are statistically significant. 

Does the model $R$-squared change a lot if you remove the set of predictors identifed above from the model in the first question *(B.1.1)*?

**Hint:** You may use the `f_test()` method to test hypotheses.

*(4+1 points)*

## Using MLR coefficients and variable transformation

The dataset *infmort.csv* gives the infant mortality of different countries in the world. The column `mortality` contains the infant mortality in deaths per 1000 births.

###  Data visualisation
Make the following plots:

1. a boxplot of log(`mortality`) against `region` *(note that a plot of log(`mortality`) against `region` better distinguishes the mortality among regions as compared to a plot of `mortality` against `region`*, 

2. a boxplot of `income` against `region`, and 

3. a scatter plot of `mortality` against `income`. 

What trends do you see in these plots? *Mention the trend separately for each plot.*

*(3+2 points)*

### Removing effect of predictor from response
Europe seems to have the lowest infant mortality, but it also has the highest per capita annual income. We want to see if  Europe still has the lowest mortality if we remove the effect of income from the mortality. We will answer this question with the following steps.

#### Variable transformation
Plot: 

1. `mortality` against `income`, 

2. log(`mortality`) against `income`,

3. `mortality` against log(`income`), and 

4. log(`mortality`) against log(`income`). 

Based on the plots, postulate an appropriate model to predict mortality as a function of income. *Print the model summary.*

*(2+4 points)*

#### Model update
Update the model developed in the previous question by adding `region` as a predictor. Print the model summary.

*(2 points)*

Use the model developed in the previous question to compute `adjusted_mortality` for each observation in the data, where adjusted mortality is the mortality after removing the estimated effect of income. Make a boxplot of log(`adjusted_mortality`) against `region`.

*(4+2 points)*

### Data visualisation after removing effect of predictor from response
From the plot in the previous question: 

1. Does Europe still seem to have the lowest mortality as compared to other regions after removing the effect of income from mortality? 

2. After adjusting for income, is there any change in the mortality comparison among different regions. Compare the plot developed in the previous question to the plot of `log(mortality)` against `region` developed earlier *(B.2.1)* to answer this question.

**Hint:** Do any African / Asian / American countries seem to do better than all the European countries with regard to mortality after adjusting for income? 

*(1+3 points)*

## Variable transformations and interactions

The dataset *soc_ind.csv* contains the GDP per capita of some countries along with several social indicators.

### Training SLR
For a simple linear regression model predicting `gdpPerCapita`. Which predictor will provide the best model fit *(ignore categorical predictors)*? Let that predictor be $P$.

*(2 points)*

### Linearity in relationship
Make a scatterplot of `gdpPerCapita` vs $P$. Does the relationship between `gdpPerCapita` and $P$ seem linear or non-linear?

*(1 + 2 points)*

### Variable transformation
If the relationship identified in the previous question is non-linear, identify and include transformation(s) of the predictor $P$ in the model to improve the model fit. 

Mention the predictors of the transformed model, and report the change in the $R$-squared value of the transformed model as compared to the simple linear regression model with only $P$.

*(4+4 points)*

### Model visualisation with transformed predictor
Plot the regression curve of the transformed model *(developed in the previous question)* over the scatterplot in (b) to visualize model fit. Also make the regression line of the simple linear regression model with only $P$ on the same plot.

*(3 + 1 points)*

### Training MLR with qualitative predictor
Develop a model to predict `gdpPerCapita` with $P$ and `continent` as predictors. 

1. Interpert the intercept term. 

2. For a given value of $P$, are there any continents that **do not** have a signficant difference between their mean `gdpPerCapita` and that of Africa? If yes, then which ones, and why? If no, then why not? Consider a significance level of 5%.

*(4 + 4 points)*

### Variable interaction
The model developed in the previous question has a limitation. It assumes that the increase in mean `gdpPerCapita` with a unit increase in $P$ does not depend on the `continent`. 

1. Eliminate this limitation by including interaction of `continent` with $P$ in the model developed in the previous question. Print the model summary of the model with interactions.

2. Interpret the coefficient of any one of the interaction terms.

*(4 + 4 points)*

### Model visualisation with qualitative predictor
Use the model developed in the previous question to plot the regression lines for Africa, Asia, and Europe. Put `gdpPerCapita` on the vertical axis and $P$ on the horizontal axis. Use a legend to distinguish among the regression lines of the three continents.

*(4 points)*

### Model interpretation
Based on the plot develop in the previous question, which continent has the highest increase in mean `gdpPerCapita` for a unit increase in $P$, and which one has the least? Justify your answer.

*(2+2 points)*