**Instructions:**

- For questions that require coding, you need to write the relevant code and display its output. Your output should either be the direct answer to the question or clearly display the answer in it.
- For questions that require a written answer (sometimes along with the code), you need to put your answer in a Markdown cell. Writing the answer as a comment or as a print line is not acceptable.
- You need to render this file as HTML using Quarto and submit the HTML file. **Please note that this is a requirement and not optional.** A submission cannot be graded until it is properly rendered.

Import all the libraries and tools you need below.

### 1)

Read the data from **X_train.csv**, **y_train.csv**, **X_test.csv**, and **y_test.csv**. The predictors (X's) are rhythmic and timbre features extracted from a number of songs. The responses (y's) are the emotion class labels for each song. **(2.5 points)**

### 2)

Print the first five rows of either `y_train` or `y_test`. You should observe that each observation has multiple class labels and it is possible for an observation to have multiple Class 1 values.

**(2.5 points)**

### 3)

Create a [Random Forest Classifier](https://scikit-learn.org/1.6/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Use 500 trees, so that its variance is reduced adequately. Leave all the other hyperparameters default; tuning their values does not change the results substantially. Use `random_state=2` for reproducibility. **(5 points)**

### 4)

Train the Random Forest on the multi-label data using the **Binary Relevance** approach. You need to check the [scikit-learn documentation](https://scikit-learn.org/stable/api/sklearn.multioutput.html) for the correct object and its usage.

Evaluate the multi-label classifier on the test data, using [Hamming Loss](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.hamming_loss.html).

**(15 points)**

### 5)

What does the Hamming Loss represent in terms of what the model predicts right/wrong? **(10 points)**

### 6)

Train the Random Forest on the multi-label data using the **Classifier Chain** approach. Keep all the inputs (other than the base model) default. You may need to refer back to the scikit-learn documentation for the correct object.

Evaluate the multi-label classifier on the test data, using Hamming Loss. You should see the same performance as Question 4.

**(15 points)**

### 7)

Using the scikit-learn documentation, answer the following about the multi-label model in Question 6:

- Are you using the true or the predicted classes from the previous classifier(s) as the predictors of the next classifier in the chain?
- What is the order of class variables that you use for the chain?

**(10 points)**

### 8)

Repeat Question 6, only with `cv=5` as another input to the Classifier Chain. What does this change about the multi-label model?

You should see a slightly lower performance. Why is this a more realistic evaluation of the model?

**(10 points)**

### 9)

Run the given cell below. It calculates and prints the Variation Inflation Factor (VIF) of each class variable.

VIF is a way to aggregate the multi-collinearity each variable has with all the other variables. The higher the VIF value of a variable is, the higher its total correlation is with all the other variables. **Note that having a correlation/multi-collinearity with other variables means carrying some information about other variables.**

**(0 points)**

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
y = add_constant(y_train)
vif_data = pd.DataFrame()
vif_data["feature"] = y.columns

for i in range(len(y.columns)):
    vif_data.loc[i,'VIF'] = variance_inflation_factor(y.values, i)

print(vif_data)

### 10)

Using the output of Question 9, repeat Question 8, only this time with the **most informative** order of the class variables. (Python starts counting from zero.)

You should see the best model performance in this assignment. Why is this the case?

**(15 points)**

### 11)

Finally, using the predictions of the best Classifier Chain (from Question 10), calculate and print the test accuracy of the model **for each emotion**. Which emotions are predicted the most and least accurately?

**(15 points)**