# Homework
- **due Wed Sep 3 at 3 pm JST**
- if you had trouble setting up our `hsi25_ml-ssc_25.0` conda environment, please install `matminer` separately to complete this HW
- you can install this in a code cell at the top of your notebook using the line: `!pip install matminer` (simply remove the `#` from the following cell and execute it). This may take a few minutes. If you run into issues, you can find more installation instructions at this link: https://hackingmaterials.lbl.gov/matminer/installation.html
- please feel free to email Prof. Bartel (`cbartel@umn.edu`) if you have questions related to the assignment

In [None]:
#!pip install matminer

# Scoring
- (1) = 50 points
    - (a): 8 points
    - (b): 20 points
    - (c): 10 points
    - (d): 12 points

# (1) Consider this paper:
Machine Learning Directed Search for Ultraincompressible, Superhard Materials \
Aria Mansouri Tehrani, Anton O. Oliynyk, Marcus Parry, Zeshan Rizvi, Samantha Couper, Feng Lin, Lowell Miyagi, Taylor D. Sparks, and Jakoah Brgoch* 

J. Am. Chem. Soc. 2018, 140, 31, 9844–9853

## (a) Understanding what they did
**Guidelines**:
- Answer each of the following in 1-3 sentences
- Use markdown cells for your answers
### (1) What was the objective of this paper?
### (2) Was this supervised or unsupervised learning?
### (3) Was this regression or classification?
### (4) Describe their approach to validating their machine learning model.

**Scoring**:
- +1 point for attempting each
- +1 point for suitable answer to each

### Students answer here

## Exploring their data

### Load their data using matminer

In [None]:
from matminer.datasets import load_dataset
import warnings
import pandas as pd
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

df = load_dataset('brgoch_superhard_training')
df.head()

### Convert their feature dicts to columns

In [None]:
brgoch_feat_dicts = df['brgoch_feats'].values
brgoch_feat_names = sorted(list(brgoch_feat_dicts[0].keys()))

for feature in brgoch_feat_names:
    df[feature] = [feat_dict[feature] for feat_dict in brgoch_feat_dicts]

In [None]:
df.head()

In [None]:
df.describe()

### Separate features from targets and non-features

In [None]:
targets = ['bulk_modulus', 'shear_modulus']
non_features = ['formula', 'composition', 'material_id', 'structure', 'brgoch_feats', 'suspect_value']
columns = list(df)
features = [f for f in columns if f not in targets if f not in non_features]

In [None]:
features

## (b) Train a support vector regressor to predict bulk modulus
**Guidelines**:
1. reserve 15% of your data for testing
2. scale your features 
3. identify the best regularization parameter (`C`) to use with a polynomial kernel
4. plot the training and validation RMSE as a function of `C` and breifly discuss why you think the `C` value you chose is best (1-2 sentences)

**Scoring**:
- +2 points for attempting each point in the guidelines
- +3 points for satisfactorily addressing each point

In [None]:
######### Students answer here (and in more cells below)



### Students comment on your solution here (justify your selection of C)


### (c) Use permutation importances to determine which features are most important for this prediction

**Guidelines**:
- use `sklearn.inspection.permutation_importance`
- use `SVR(kernel='poly', C=32)` as your estimator
- permutation importances should be determined on a validation set
    - you should fit your estimator to a subset of `X_train`, `y_train` and determine importances using a different subset
- use `n_repeats = 2` so that it doesn't take too long
- use `scoring = 'neg_root_mean_squared_error'`
- print the 10 most important features
    - print the feature name along with the importance value
- plot a bar chart of all sorted importances
    - the y-axis is the feature importance
    - the x-axis is the index of each feature (index = 0 should correspond with the most important feature)

**Hints**:
- review the [User Guide](https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance) for `permutation_importance`

**Scoring**:
- +4 points for attempting
- +3 points for successfully computing permutation importances
- +1 point for correctly using a validation set
- +1 point for printing the 10 most important features and their corresponding importance values
- +1 point for generating the bar chart

In [None]:
########## Students answer here (and in below cells)


### (d) Consider how feature correlations might influence your results
**Guidelines**:
- compute the pairwise Pearson correlation among three features in your model: `['crystal_radius_feat_1', 'covalent_radius_feat_1', 'ionic_radius_feat_1']`
- for the purposes of this exercise, it's OK to use your full data (ie the whole DataFrame). In practice, you'd want to do this only on the training set
- your printed output should be:
  - PC(feature 1, feature 2) = <the Pearson correlation between feature 1 and feature 2> 
  - PC(feature 1, feature 3) = <the Pearson correlation between feature 1 and feature 3>
  - PC(feature 2, feature 3) = <the Pearson correlation between feature 2 and feature 3>
- discuss the implications of these correlations on: 1) training the model and 2) interpreting the feature importances

**Hints**:
- [this link](https://realpython.com/numpy-scipy-pandas-correlation-python/) may help

**Scoring**:
- +4 points for attempting
- +4 points for correctly computing and printing Pearson correlations
- +2 points for discussing implications for training
- +2 points for appropriately discussing implications for interpretation

In [None]:
######## Students answer here (and in below cells)


### Students comment on your solution here
