# Mud card questions

## Parameter tuning

- **How does GridSearchCV work? What's the difference between GridSearchCV and PredefinedSplit?**
    - it basically loops through all combinations of the parameter grid and trains and evaluates models
    - PredefinedSplit is for data splitting
    - please read through the manual of these two functions

- **This is all really useful example code, but I'm still a little confused by how the gridsearch methods work and what exactly they return. Is there a simple way to describe how these function operate?**
    - I recommend you print out the results dataframe we created, read through the manual, and check out the examples on the sklearn page

- **Is it good to include all the hyperparamters in one GridSearchCV to find the best combination of the hyparameters? Or do we need more steps in finding the best model?**
    - yes, I'd tune all hyper-parameters in one GridSearchCV
    - the only exception might be the critical probability, you'd tune that after GridSearchCV

- **How do you know which hyperparameters you want to tune? Is it a case of trial and error? Also, is the only way to check a hyperparameters importance to run the model?**
    - over time, you'll gain more experience and you'll have a better feeling of which parameters you should tune
    - but for now, try as many parameters as you can and use the validation scores to measure which parameters impact the score the most.

- **Are there any particular advantages to predefining a train/test split in advance and then using GridSearchCV compared to just splitting the data using k-fold splitting and then applying GridSearchCV?**
    - if your data is IID, k-fold is fine
    - if your data is not IID and you need a non-standard way to split your data, predefined split might be better

- **How do we split time series data with group structure?**
    - that's currently not supported by sklearn so you need to write your own custom splitter function

- **I see in the hyper parameter tuning with folds examples you do both preprocessing and model training behind the curtain. Since we can also use train_test_split (predefined split) for GridSearchCV, would you recommend us to do so if I later find out the result are the same for both ways. (I haven't tested this yet.)**
    - I recommend using ML pipelines to combine preprocessing with the actual ML algorithm
    - this is powerful because you can directly use the raw input data and it makes your code shorter and clearer

## Other

- **We've been using different random states while checking the evaluation metrics and it seems like they do have a big effect on the value of the metrics. When building our own ML pipelines, what range of random states does it make sense to try? It feels like it is likely that there will always be a random state that could make the performance better.**
    - DO NOT OPTIMIZE ON THE RANDOM STATE! 
    - there is no guarantee that a lucky state that does well on the test set will also do well on new data you'll get during deployment
    - it doesn't really matter what range you try, it can be 0-9.

- **I'm wondering why for each iteration through a random state, we find the model that performs the best and calculate an accuracy score for the test data? Won't we likely end up with models with different hyperparameters as our optimal model for a given random state. In this case we would be computing the accuracy score for our test data with several different models and these test accuracy scores could ultimately influence our decisions about which model is best overall. In that case our test data has influenced our model. Would it make more sense to save the test data till the very end after our final model has been selected, for example by finding commonalities in the hyperparameters that performed well on the validation data?**
    - excellent question!
    - KFold does roughly what you suggest. If you have 4 folds, the best hyperparameters are selected based on the average of 4 validation scores so it reduces the randomness in the hyperparameters
    - however if you calculate your test score only once, you won't know the uncertainty of your test score due to randomly splitting your dataset

- **For databases where there are fewer data points like there were in this case, is it typical to divide by time? Is there a way to divide for cases where it is not time-dependent data in order to create more data points?**
    - it's project-specific
    - think about what makes sense for your particular project
    - given that some seizures go on for minutes, it makes sense for me to split the seizures into shorter chunks

- **This was a really interesting project to hear about. Since many such projects are probably open source, how can we go about finding them? (Online resources/websites)**
    - look for journals that publish in your area of interest and read papers published in that journal

- **In time-series data, will some models have limitations? If so, what models will we typically use in time-series?**
    - classical ML (the techniques we use in this class) typically aren't the best for time series data
    - you'll learn about LSTMs next term which is the current state of the art

- **Should we consider other scores besides an accuracy score when choosing the optimal hyperparameters? What if the accuracy scores are very similar but one parameter value has a much higher precision?**
    - accuracy should only be used in classification if your dataset is not imbalanced
    - you should choose the score before you choose the hyperparameters and stick with it. 
    - you don't use multiple scores to select hyperparameters
    

- **Would you explain more about function of n_jobs?**
    - it describes how many cores should be used to run the function (e.g., GridSearchCV or RandomForest)

- **Outside of splitting, What common mistake should we be aware of that can cause our model evaluation to be incorrect and/or off its true value?**
    - a couple of times I left my target variable in the feature matrix and the ML model gave perfect predictions :D 
    - use linear models and SVMs without standardizing all features
    - colinear features don't work well with linear models
    - impute values when it doesn't make sense
    - use an incorrect metric (e.g., accuracy when your dataset is imbalanced)
    - generally not understanding your data, your problem, or the models you use


- **Quiz 1 and 2**