# Technical Interview: Model Training and Evaluation

## Defining a Machine Learning Problem

### However, if you use the mean of all the available data before splitting it into training, validation, and test sets, then it inevitably captures traits of the test set. Hence, the ML model will be trained on imputed data that contains latent information about the test set, which sometimes causes the accuracy to increase for no reason other than the way the data was imputed. This is called data leakage. If you want to use imputation, then be sure to split the training, validation, and test sets first, and impute missing values in the training set with the summary statistics of the training set only. If you don’t mention this in the interview or explain it correctly, that’s a pretty obvious oversight in your ML model, unless you can defend your reasoning.

## Interview question 4-1: What’s the difference between feature engineering and feature selection?

- Feature engineering is about creating or transforming features from raw data. This is done to better represent the data and make the data more suitable for ML compared to its raw format. Common techniques include handling missing data, standardizing data formats, and so on.

- Feature selection is about narrowing down relevant ML features to simplify the model and prevent overfitting. Common techniques include PCA (principal component analysis) or using tree-based models’ feature importance to see which features contribute more useful signals.


### Interview question 4-2: How do you prevent data leakage issues while conducting data preprocessing?

- Being cautious with training, validation, and test data splits is one of the most common ways to prevent data leakage. However, things aren’t always so simple. For example, in the case when data imputation is done with the mean value of all observations in the feature, that means the mean value contains information about all observations, not just the training split. In that case, make sure to conduct data imputation with only information about the training split, on the training split. Other examples of data leakage could include time-series splits; we should be careful that we don’t accidentally shuffle and split the time series incorrectly (e.g., using tomorrow to predict today instead of the other way around).

## Interview question 4-3: How do you handle a skewed data distribution during feature engineering, assuming that the minority data class is required for the machine learning problem?

- Sampling techniques,4 such as oversampling the minority data classes, could help during preprocessing and feature engineering (for example, using techniques like SMOTE). It’s important to note that for oversampling, any duplicate or synthetic instances should be generated only from the training data to avoid data leakage with the validation or test set.

## Interview question 4-4: In what scenario would you use a reinforcement learning algorithm rather than, say, a tree-based method?

- RL algorithms are useful when it’s important to learn from trial and error and the sequence of actions is important. RL is also useful when the outcome can be delayed but we want the RL agent to be continuously improving. Examples include game playing, robotics, recommender systems, and so on.
- In contrast, tree-based methods, such as decision trees or random forests, are useful when the problem is static and nonsequential. In other words, it’s not as useful to account for delayed rewards or sequential decision making, and a static dataset (at the time of training) is sufficient.

## Interview question 4-5: What are some common mistakes made during model training, and how would you avoid them?
- Overfitting is a common problem, when the resulting model captures overly complex information in the training data and doesn’t generalize well to new observations. Regularization techniques6 can be used to prevent overfitting.

- Not tuning common hyperparameters could cause models to not perform well since the default hyperparameters might (often) not work directly out of the box to be the best solution.

- Overengineering the problem could also cause issues during model training; sometimes it’s best to try out a simple baseline model before jumping right into very complex models or combinations of models.

## Interview question 4-6: In what scenario might ensemble models be useful?

- When working with imbalanced datasets, where one class significantly outnumbers the others, ensemble methods can help improve the accuracy of results on minority data classes. By using ensemble models and combining multiple models, we can avoid and reduce model bias toward the majority data class.

## Classification metrics
- Classification metrics are used to measure the performance of classification models. As a shorthand, note that TP = true positive, TN = true negative, FP = false positive, and FN = false negative, as illustrated in Figure 4-5. Here are some other terms and values to know:

- Precision = TP / (TP + FP) (as illustrated in Figure 4-6)

- Recall = TP / (TP + FN) (as illustrated in Figure 4-6)

- Accuracy = (TP + TN) / (TP + TN + FP + FN)

### F1 score
- Harmonic mean of precision and recall

### AUC (area under the ROC curve) and ROC (receiver operating characteristic)
- The curve plots the true positive rate against the false positive rate at various thresholds.

### Research the company where you are interviewing and imagine what is valued there on a business level. Doing so can help you engage better in interview questions about model evaluation metrics. For example, in a malware detection ML system, false positives are important to reduce because you don’t want to create alert fatigue, which causes people to lose trust in the malware detection model itself.

## Clustering metrics
### Clustering metrics are used to measure the performance of clustering models. Using clustering metrics may depend on whether you have ground truth labels or not. Here I assume you do not, but if you do, then classification metrics can also be used. Here is a list of terms to be aware of:

### Silhouette coefficient
- Measures the cohesion of an item to other items in its cluster and separation with items in other clusters; ranges from -1 to 1

### Calinski-Harabasz Index
- A score meant to determine the quality of clusters; when the score is higher, it means clusters are dense and well separated

