# Mud card and piazza questions

## Feature engineering

- **Will feature engineering always improve the performance of model because we take into account more features and relations between features?**
    - No, there is no guarantee that the new features will improve your model
    - if you use certain techniques (e.g., random forest), it will not make your model performance worse.
    - the performance of certain other techniques (e.g., nearest neighbors) could in fact decline if too many non-predictive features are added - more on this later.

- **I was a little confused about the automatic feature engineering both conceptually and code**
- **Feature engineering was most confusing for me. Maybe if we went over another example?**

- **I am having trouble working through an real world example for feature engineering, could you walk through a data set and example how in context, the feature engineering improved the predictive power of what we are looking to solve?**
    - I recommend you try it with the adult dataset
    - native country feature: instead of one-hot encoding, look up what the GDP of those countries were in 1990 and replace the country name with the GDP value
    - that's a continuous feature and it will likely improve the model performance because citizens from a rich country will only migrate to the USA if they earn more than what they would earn in their home countries.

- **In regards to Feature engineering for automatic vs manual. Which one do you think is more useful for this class and in general?**
    - manual feature engineering
    - it has the best chances of improving your model

- **Should we try manual feature engineering first? If we know enough about the dataset to select preferred features immediately and throw out the rest, isn't that an efficient way to make this decision?**
    - often you don't know in advance which the preferred features are
    - sometimes you are surprised and features that you wouldn't think will be important
    - we will talk about feature importance metrics to measure how predictive each feature is later

- **Quiz 5 is a hard question for me to understand it conceptually. Hope to discuss it in class.**
- **what is the degree in the polynomial transform signify? Why wasn't abc included as a column in the last quiz?**

- **"For PolynomialFeatures, the degree refers to the maximum degree of all combinations of different features right? So if we set degree = 3 in the last quiz, we should have abc, ab^2. b^3, etc.**
    - yes, that's right but run the code, print the feature names to verify

- **So the meaning of doing auto feature is to try common feature engineering on the data set? And polynomialFeatures is only one of the auto feature method? What are other options? Are we going to dig in more on this?**
    - we won't dig more into this but feel free to check out [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

- **Can you please expand a bit more on manual feature engineering and in what situations you could or should attempt to do this? I understand that it requires more domain-specific knowledge and a greater amount of work, but I guess I am unclear on whether or not it would be appropriate to attempt manual feature engineering without extensive knowledge. You gave us some examples on what can help us engineer our features better but I am not sure if this is just meant to show us what we could do if we were in a good position to do manual feature engineering, or if it is supposed to help us if we wanted to do it but aren't as knowledgable about the relevant domain. Would it be too risky to attempt manual feature engineering if you are not an expert on the subject?**
    - feature engineering is not risky
    - you could engineer a feature, check if/how much it influences the evaluation metric of the model
    - if it helps, keep it
    - if it doesn't help, drop it
    - you could/should always attempt it
    - as you work more with the data and regularly meet with subject matter experts, slowly you willl become a subject matter expert too.

## Feature selection

- **Unless we are specifically interested in determining which variables correlate only linearly with the target variable, what reason do we have for using the F-test? Mutually information seems as though it should generally be a much better estimate.**
    - sometimes you know in advance that you only want to develop a linear or logistic regression model in which case the F-test is better to use.
    - if you know that you'll train more complex models, go for mutual info.

- **We looked at correlations between features (comparison between linear and sine correlations, etc.). It feels like it should be possible to calculate these correlations through EDA when given a dataset, but what if the correlation is not straightforward? (Like a trignometric correlation, or a more complex polynomial).**
    - mutual information captures non-linear correlations, see the example with the sin-wave

- **Bit confused about Quiz 4. I managed to calculate the different f-score, p-values and mi. How can we use these to access which columns of X are the most important? (I think this is a coding issue because my graphs showed x1, x2, x3)**
- **Could we go over the coding procedure for Quiz 4? How do we extract the name of the feature with the highest F score?**
- **In terms of code, how do you get the column names from the K Best feature selection (particularly for a model with many features)?**

- **In the video you mentioned that we should choose whether we are interested in linear relationships vs non-linear relationships before we start the process of feature selection.¬† Is it ever wise to check both f-test and mutual information before we decide on a model or features?¬† Can we choose the best features from both f-test and mutual info and customize the model based off these decisions?**
    - yes, absolutely!
    - you could for example calculate the average feature rank based on the f-test and MI and select features based on that

- **"What about the case of multicollinearity? When would we discard K Best features that are highly correlated, in order to get a simpler model?**
    - also in preprocessing
    - you should consider which of the highly correlated features you keep though. it's not an easy choice and as far as I know, there is no good data-driven way to make this decision

- **I am really confused about the feature selection process and what its overall purpose is. Perhaps some more concrete examples would help.**
    - sometimes you have a lot of features but comperatively low number of features (e.g., in genetics)
    - then you need to select features before you start training models
    - most often however you select features based on feature importance metrics derived using ML models

## Missing data

- **Does a regression analysis only work if one continuous feature is missing variables? What if there are multiple?**
    - I assume you mean multivariate imputation
    - it works if multiple features have missing values, the technique is a bit more complicated then.

- **Is there a comprehensive test to tell whether data is MCAR, MAR, or MNAR, or does that have to be done with more a more interpretive hands-on look at the data?**
    - as far as I know, there is no comprehensive test to distinguish MCAR, MAR, MNAR
    - Little's test can tell you if MAR is present in your data, but it can't distinguish MAR from MCAR and MNAR
    - a hands-on look might help and give you hints but it is not comprehensive

- **Can imputation for single variable be done using strategies other than "mean"? Are they more effective to avoid the std. dev. changing?**
- **Could you please explain more in detail when should we use simpleImputer and which strategy should I apply for, like mean or median? In addition, when should we iterative imputer. What's pros and con for both of them?**
    - if you really want to impute, do multivariate imputation, that's the best.
    - mean and median imputation will always skew your dataset because it reduces the standard deviation or generally the shape of the feature's dsitribution.

- **What are some approaches that we may handle MNAR although it's tough? Or any tips?**
- **Im wondering is there anything at all that can help account for MNAR missingness or do we just have to accept that the model will be biased? are these biased models even still useful?**
- **If it doesn't make sense to impute some value (like garage area if there is no garage), then how can we feel safe imputing it? If we can't, then how are we supposed to build a model on the dataset?**
    - "Every model is wrong but some are useful."
    - We will cover advanced techniques in November that can deal with MNAR.

- **For the name of missing data type, I think what matter is what is the missing data correlated to. So I don't see how 'random' it is. Isn't it really just no correlation, correlated with other features and correlated with itself?**
    - there is a gray zone in between, it's not just no correlation or perfect correlation
    - partial correlation: a mix of MCAR and MAR, or a mix of MAR and MNAR, etc.

## Other

- **How do we know in which category is which when representing categories with numbers? For example, 0 and 1, which is Yes and which is no? Which is female, which is male? Which means NA, which means not?**
    - the feature names encode that information
    - the feature name will be 'sex_female' and 'sex_male' or 'sex_NA' for example after one-hot encoding

- **In the following calls, I have no idea what the purpose of the strings are (eg. 'imputer2', 'ordinal', 'num', 'cat', 'ord'). Are these strings calling a specific function, or are they just names you are giving them?**
``` python
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_ftrs),
('cat', categorical_transformer, cat_ftrs),
('ord', ordinal_transformer, ordinal_ftrs)])
```
    - just names I gave them so we can refer to them by name later if necessary.

- **Can a problem ever be classification and regression?**
    - Not really, sometimes a problem can be solved with both classification and regression but you need to decide whether you apply a classification or a regression model to the problem, or try both.
    - You could have classification problems with more than two classes
    - You could have multiple target variables some of them regression, some of them classification
    - There is also ranking which is a different type of ML problem not often used 