---
# <span style="color:pink">DS3000B - DS9000B Midterm Exam</span>

## <span style="color:pink">Student ID #: _________</span>

## <span style="color:pink">Grade: __ / 100</span>

## <span style="color:pink">General Comments</span>

- This exam integrates knowledge and skills acquired in the first half of the term.

- Using AI agents/chatbots such as ChatGPT and Copilot is considered an act of cheating and you will receive 0 mark for the exam.

- You are allowed to use any other resources on your computer or the internet, but you are **not** allowed to share documents, post questions to forums such as Stack Overflow (this includes use of homework helpers such as Chegg), or communicate in anyway with people inside or outside the exam room.

- Having any communication tools (*e.g.*, Discord, Teams, Slack, Outlook etc.) either web-based or app-based open on your computer (or having them running in the background) is considered an act of cheating and you will receive 0 mark for the exam.

- To finish the midterm in the alloted time, you will have to work efficiently.

- Please read the entirety of each question carefully.

- You must have your work submitted by 6:30PM to the "Assignments" section of the course's site on OWL, *i.e.*, the same place where you originally downloaded the notebook. Late submissions will be scored with 0 mark unless one has an approved accommodation for submitting late.

- To avoid technical difficulties at the time of submission, please initiate your submission process at the latest five minutes before the deadline.

- Some questions demand a **written answer**. Please answer these in full English sentences in a markdown cell right underneath the question.

- For your figures ensure that all axes are labeled in an informative way.

- At the end, before submitting to OWL, restart the kernel and rerun all cells to make sure that your notebook runs error free and as expected.

## <span style="color:pink">Additional Guidance</span>

- If at any point you are asking yourself "are we supposed to...", write your assumptions clearly and proceed according to them.

- If you have no clue how to approach a question, skip it, and move on. Revisit the skipped one(s) after you are done with other questions.

- Where applicable, take advantage of the argument `n_jobs=-1` to speed up processes with parallel computing.

- To navigate within the notebook, better to take advantage of the notebook's table of contents (normally on the left side of the screen). It saves you some time compared to pure scrolling with the mouse. In VScode, it is nested under the "OUTLINE" tab which is by default minimized unless you click it to maximize.

- Please ensure that your results are generated using the provided random seed, where applicable.

---
## <span style="color:orange">Toolbox</span>

In [None]:
from datetime import datetime
import numpy as np
seed = 240229
np.random.seed(seed)
import pandas as pd
pd.set_option('display.max_columns', None)
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold, train_test_split
from sklearn.metrics import  auc, roc_curve, roc_auc_score, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

---
## Question 1 - <span style="color:red">[70]</span> - Classification
For this question you will be working with "Data_Q1.csv", which is a dataset on property price. Below, we provide a detailed description of each variable in the dataset:

|Column Index | Attribute | Description |
| --- | --- | --- |
| 0|ID|Property's identification number|
| 1|MSSubClass|Building class|
| 2|LotArea|Lot size in square feet|
| 3|LandSlope|Slope of property's land|
| 4|HouseStyle|Style of dwelling|
| 5|OverallQual|Overall material and finish quality|
| 6|OverallCond|Overall condition rating|
| 7|YearBuilt|Original construction date|
| 8|MasVnrArea|Masonry veneer walls area in square feet|
| 9|TotalBsmtSF|Total square feet of basement area|
|10|Heating|Type of heating|
|11|CentralAir|Central air conditioning|
|12|1stFlrSF|First Floor square feet|
|13|2ndFlrSF|Second floor square feet|
|14|GrLivArea|Above grade (ground) living area square feet|
|15|FullBath|Full bathrooms above grade|
|16|HalfBath|Half baths above grade|
|17|BedroomAbvGr|Number of bedrooms above basement level|
|18|TotRmsAbvGrd|Total rooms above grade (does not include bathrooms)|
|19|Fireplaces|Number of fireplaces|
|20|GarageCars|Size of garage in car capacity|
|21|GarageArea|Size of garage in square feet|
|22|PavedDrive|Paved driveway|
|23|WoodDeckSF|Wood deck area in square feet|
|24|OpenPorchSF|Open porch area in square feet|
|25|MiscVal|$ value of miscellaneous feature|
|26|YrSold|Year sold|
|27|SalePrice|Sale price in dollars|

### Q1.1 - <span style="color:red">[20]</span> - Data preparation
Load the dataset as a pandas dataframe and perform the following steps:
1. Display its first five rows. <span style="color:green">[2]</span>
2. Print out the number of rows and columns of it? <span style="color:green">[2]</span>
3. Print out the count for each variable type? For example, if you have a dataframe with 5 columns of which 2 are `int64` and 3 are `float64`, your printed output will be like: `float64` 3, `int64` 2. <span style="color:green">[2]</span>
4. Print out the count of rows with missing values. Drop those rows from the dataframe, if any. <span style="color:green">[2]</span>
5. Remove the `Id` column from your dataframe. <span style="color:green">[2]</span>
6. Find the age of the properties using `YearBuilt` variable and replace `YearBuilt` with the new variable `PropertyAge`. For this purpose, use the current year as reference. <span style="color:green">[2]</span>
7. Use `YrSold` variable to calculate how many years ago the property was sold and name that new coulmn `YrsSinceSale` and replace `YrSold`. <span style="color:green">[2]</span>
8. Encode all categorical columns using One-hot encoding. We want to get $k-1$ dummies out of $k$ categorical levels. How many new columns were added to the dataframe? <span style="color:green">[2]</span>
9. Eventually, we want to do a binary classification of properties based on their `SalePrice`. In order to prepare the data for that stage, here we want to bin `SalePrice` based on its median value, *i.e.*, if a property's `SalePrice` is above or equal to the median value of the vector `SalePrice`, the property's `SalePrice` value gets replaced with 1, otherwise 0. <span style="color:green">[2]</span>
10. Report the count of ones and zeros in your updated `SalePrice` attribute. Taking it as the target for classification, will that be a balanced or imbalanced classification problem? <span style="color:green">[2]</span>

In [None]:
#

### Q1.2 - <span style="color:red">[8]</span> - Data Splitting
In the previous question, you converted `SalePrice` to a discrete variable. Separate it from the rest of the attributes to use it as the target variable for your machine learning model. Split, in a stratified and shuffled fashion, your preprocessed dataset by setting aside 30 percent of the data for testing, and the rest for training. Make sure to use the provided random seed for this purpose. Then, print out class distribution for both ytrain and ytest.

In [None]:
#

### Q1.3 - <span style="color:red">[14]</span> - Classifier Model Training and Selection

Using `sklearn.linear_model.LogisticRegression` do the following steps:
1. Initiate two different Logistic Regression models, namely, "model1" and "model2". Both use a `max_iter` of $20000$, and `liblinear` for solver. As for the `penalty` argument, "model1" and "model2" use `l1` and `l2`, respectively.
2. With the area under the Receiver Operating Characteristic curve as your scorer, perform 5-fold stratified and shuffled cross-validation to report the CV score of both models. Choose the best model among the two and train it.

In [None]:
#

### Q1.4 - <span style="color:red">[6]</span> - Evaluate Winner Model on Test Set
Generally, Receiver Operating Characteristic (ROC) curves should be used when there are roughly equal numbers of observations for each class. Precision-Recall (PR) curves should be used when there is a moderate to large class imbalance. Given this information, choose the right type of curve for your winner model and plot it for both training and test sets, also, report the AUC for both sets.

In [None]:
#

### Q1.5 - <span style="color:red">[8]</span> - Fine-Tuning
The default threshold value in Sklearn Logistic Regression is 0.5. You are told by the stakeholders that the maximum false positive rate (FPR) which this project can tolerate is 0.2. Based on this information, choose the threshold value which leads to the highest Recall given $FPR \leq 0.2$. Use the training set to find this threshold value. What would be your new threshold?

In [None]:
#

### Q1.6 - <span style="color:red">[6]</span> - Evaluation of Fine-Tuned Model: Accuracy Score
Report accuracy scores of the model based on the new threshold (*i.e.*, found in the previous question) for both the training and test sets.

In [None]:
#

### Q1.7 - <span style="color:red">[8]</span> - Evaluation of Fine-Tuned Model: Confusion Matrix
Report the confusion matrix over the test set for both the default and new thresholds. Going from the default to new threshold, by what percentage the sum of false negatives and false positives changed? Did this sum decreased or increased?

In [None]:
#

---
## Question 2 - <span style="color:red">[30]</span> - Uncertainty Quantification
We have an apparatus which detects and records the amplitude of certain input signals. We have done repeated experiments with this device and have recorded the measured amplitudes of each input signal in "Data_Q2.csv", which has the following attributes:

|Column Index | Attribute | Description |
| --- | --- | --- |
| 0|Signal|Input physical signal|
| 1|Amplitude|Measured amplitude of input signal|

### Q2.1 - <span style="color:red">[20]</span> - Bootstrap
Using a confidence level of $97\%$, and a resampling number of $5000$, compute and report the **bootstrap** confidence interval (CI) of the **median** amplitude for each input signal.

Feel free to use `scipy.stats.bootstrap` to compute the bootstrap confidence intervals.

In [None]:
#

### Q2.2 - <span style="color:red">[10]</span> - CI Plot
Plot the median of amplitudes for each signal (*i.e.*, x-axis for signal and y-axis for the median of signal amplitude). Your plot must also show the bootstrap CIs' upper and lower bounds. Based on your calculated CIs, for which signal the apparatus is least certain?

In [None]:
#

---
$$ The\;End $$