## W3&W4 post studio exercises (errors, model fitting)

Enter your solution in the cell(s) below each exercise. Add couple of inline comments explaining your code. Don't forget to add comments in markdown cell after each exercise. Missing comments (in markdown cells and/or inline) and late submissions will incur penalties.

Once done, drag&drop your python file to your ADS1002-name github account.

Copy url of this file on github to appropriate folder on Moodle by 09.30am prior your next studio. 

Solutions will be released later in the semester.

Max 10 marks - 2.5 marks per each exercise.

***
We will use 

* [who-health-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/who-health-data.csv)

* [wisconsin-cancer-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/kaggle-wisconsin-cancer.csv)

throughout the exercises. Download the datasets into the same directory as your post-studio notebook.

In [3]:
import pandas as pd
who_data_2015 = (
    pd.read_csv("who-health-data.csv") # Read in the csv data.
    .rename(columns=lambda c: c.strip())      # Clean up column names.
    .query("Year == 2015")                    # Restrict the dataset to records from 2015.
    # Removes two columns which contain a lot of missing data...
    .drop(columns=["Alcohol", "Total expenditure"])
    # ... then drop any rows with missing values.
    .dropna()
)

wisconsin_cancer_biopsies = (
    pd.read_csv("kaggle-wisconsin-cancer.csv")
    # This tidies up the naming of results (M -> malignant, B -> benign)
    .assign(diagnosis=lambda df: df['diagnosis']  
        .map({"M": "malignant", "B": "benign"})
        .astype('category')
    )
)

### Exercise 1

Given the dataframe `ex1_who_with_predictions` below, compute the Mean Absolute Error for the predicted values of life expectancy. You can repeat the process previously shown, or find a function in `sklearn.metrics` to compute this for you.

In [4]:
import pandas as pd
from sklearn.metrics import mean_absolute_error

# Assuming who_data_2015 is already defined as per the given code

ex1_who_with_predictions = (
    who_data_2015[["Schooling", "Life expectancy"]]
    .assign(Predicted=lambda df: df["Schooling"] * 2.3 + 43)
    .dropna()
)

# Calculate the Mean Absolute Error (MAE)
mae = mean_absolute_error(
    ex1_who_with_predictions["Life expectancy"],
    ex1_who_with_predictions["Predicted"]
)

print("Mean Absolute Error:", mae)


Mean Absolute Error: 3.790230769230769


### Exercise 2

Given the classification predictions and actual results in the dataframe `ex2_biopsies_with_predictions` below, compute accuracy, precision and recall. Also find the number of false negatives.

In [5]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Assuming wisconsin_cancer_biopsies is already defined as per the given code

ex2_biopsies_with_predictions = (
    wisconsin_cancer_biopsies
    .assign(prediction=lambda df: df['texture_mean'].lt(20)
        .map({True: "benign", False: "malignant"})
    )
    [['radius_mean', 'texture_mean', 'diagnosis', 'prediction']]
)

# Compute accuracy
accuracy = accuracy_score(
    ex2_biopsies_with_predictions['diagnosis'],
    ex2_biopsies_with_predictions['prediction']
)

# Compute precision
precision = precision_score(
    ex2_biopsies_with_predictions['diagnosis'],
    ex2_biopsies_with_predictions['prediction'],
    pos_label="malignant"
)

# Compute recall
recall = recall_score(
    ex2_biopsies_with_predictions['diagnosis'],
    ex2_biopsies_with_predictions['prediction'],
    pos_label="malignant"
)

# Compute the confusion matrix
conf_matrix = confusion_matrix(
    ex2_biopsies_with_predictions['diagnosis'],
    ex2_biopsies_with_predictions['prediction'],
    labels=["malignant", "benign"]
)

# Number of false negatives is in the first row, second column of the confusion matrix
false_negatives = conf_matrix[0, 1]

# Print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Number of False Negatives:", false_negatives)


Accuracy: 0.7311072056239016
Precision: 0.6311111111111111
Recall: 0.6698113207547169
Number of False Negatives: 70


### Exercise 3

Consider three different predictors for the cancer biopsy screening dataset:

* Predictor A has an accuracy of 0.95, and recall of 0.99
* Predictor B has an accuracy of 0.99, and recall of 0.95
* Predictor C has an accuracy of 0.5, and a recall of 1.0

The test required to collect data from a new patient (on which the predictor will give a predicted diagnosis) is minimally invasive. If the predictor predicts a positive (malignant) diagnosis, the patient will be referred for further screening which can be expensive.

Considering the context, which predictive model (A, B, or C) would likely be preferred for this task? Write your answer in a markdown cell below, and give a brief explanation of your reasoning.

### Preferred Predictive Model: Predictor B

#### Reasoning:
In the context of cancer biopsy screening, the goal is to correctly identify as many true positive (malignant) cases as possible while minimizing the number of false positives.

- **Predictor A** has a very high recall (0.99), meaning it correctly identifies almost all malignant cases. However, its slightly lower accuracy (0.95) suggests that there might be more false positives
  
- **Predictor B** offers a balance with a very high accuracy (0.99) and a reasonable recall (0.95). This model is likely to identify most malignant cases while also reducing the number of false positives, minimizing unnecessary costs.

- **Predictor C** has perfect recall (1.0) but very low accuracy (0.5), indicating that it might predict almost every case as malignant, leading to a large number of false positives. 



### Exercise 4

Choose one different input/feature variable (other than Schooling) and fit a linear regression model to predict Life Expectancy using sklearn. Can you achieve a better error rate than what we found in pre-studio notebook? (RMSE and MAE for Schooling were 4.71 and 3.69, respectively.) Suggest a method to narrow down your choices of variables to use in order to arrive at a good model. 

Hint 1: Correlation.

Hint 2: You can use the functions written in the pre-studio notebook, e.g. prediction_root_mean_squared_error(gradient, intercept), to calculate the model error once you choose your model parameters (features).

In [8]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Step 1: Load the data (assuming who_data_2015 is already defined)

# Exclude non-numeric columns before calculating correlations
numeric_data = who_data_2015.select_dtypes(include=[np.number])

# Calculate correlations with Life Expectancy
correlations = numeric_data.corr()['Life expectancy'].drop('Life expectancy')
print("Correlations with Life Expectancy:\n", correlations)

# Choose the feature with the highest correlation (besides Schooling)
# Let's assume 'GDP' has the highest correlation for this example
best_feature = correlations.abs().idxmax()
print("Best feature for prediction (highest correlation):", best_feature)

# Step 2: Fit the Linear Regression Model using the best feature
X = who_data_2015[[best_feature]].dropna()  # Feature matrix
y = who_data_2015['Life expectancy'].dropna()  # Target variable

model = LinearRegression()
model.fit(X, y)

# Predict Life Expectancy
predictions = model.predict(X)

# Step 3: Calculate RMSE and MAE
rmse = np.sqrt(mean_squared_error(y, predictions))
mae = mean_absolute_error(y, predictions)

print("RMSE:", rmse)
print("MAE:", mae)


Correlations with Life Expectancy:
 Year                                    NaN
Adult Mortality                   -0.731215
infant deaths                     -0.209304
percentage expenditure             0.064494
Hepatitis B                        0.372109
Measles                           -0.049305
BMI                                0.544987
under-five deaths                 -0.241013
Polio                              0.493438
Diphtheria                         0.466223
HIV/AIDS                          -0.620511
GDP                                0.487018
Population                        -0.027594
thinness  1-19 years              -0.459153
thinness 5-9 years                -0.454897
Income composition of resources    0.898059
Schooling                          0.806074
Name: Life expectancy, dtype: float64
Best feature for prediction (highest correlation): Income composition of resources
RMSE: 3.504293642534378
MAE: 2.7371964876087294


Comparison with the Schooling Model:
RMSE and MAE for Schooling:
RMSE: 4.71
MAE: 3.69
RMSE and MAE for Income composition of resources:
RMSE: 3.50
MAE: 2.74

Using "Income composition of resources" as the predictor outperforms the model that used "Schooling" in terms of both RMSE and MAE. 

Narrowing Down Variables plans:

Correlation Analysis: Start by calculating the correlation between each feature and the target variable (Life Expectancy). High absolute correlation indicates a stronger linear relationship.

Model Evaluation: Fit models using the selected features and evaluate their performance using error metrics such as RMSE and MAE. Compare different models to choose the best-performing one.

## Extra exercises

The following exercises with (*) will not be assessed. Use these to check your understanding of topics covered in the past 2 weeks.

### Exercise 5*

The function `model_correct_predictions` below returns the number of correct predictions made by a predictive model for the cancer biopsy dataset, for a given parameter value. This parameter value simply controls the threshold value for radius above which a sample is predicted as malignant.

Try different values of the parameter in this model within the range [0, 30]. Record and plot the resulting accuracy values against the parameter value (similar to the regression cost function example above).

What value of the parameter provides the best error rate? Explain how can you be confident you have found the best result here.

In [4]:
def model_correct_predictions(radius_split_parameter):
    """ Return the number of correct predictions made by the model
    for the given parameter value. """
    data = wisconsin_cancer_biopsies.assign(
        predicted=lambda df: df['radius_mean'].lt(radius_split_parameter)
            .map({True: "benign", False: "malignant"})
    )
    return (data['diagnosis'] == data['predicted']).sum()

model_correct_predictions(12)

369

### Exercise 6*

In examples in pre-studio notebook (W4) we have used root mean squared error (the standard cost function for linear regression) to fit the model parameters. Try re-running the `scipy.optimise` method using mean absolute error. Are the resulting model parameters the same as above? Give some brief reasoning why there might be a difference here.

In [5]:
# Hint: you only need to make one small change in the prediction_error function to do this.

In [6]:
def prediction_root_mean_squared_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    # Note that `squared=False` gives us RMSE. Then we're in the same units as MAE.
    return mean_squared_error(y_true=actual, y_pred=predictions, squared=False)

def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)

### Exercise 7*

We can see above that different methods for determining model parameters arrive at the same result, but what happens if we change the dataset slightly. Experiment by taking several (at least 10) different samples of the data, fitting a linear model for each one, and plotting a histogram of the different gradient and intercept coefficients you find. Is there a significant amount of variation in the parameter values?

In [7]:
sample_data = who_data_2015.sample(30)  # selects a small sample of 30 random rows from the data.