# QUIZ : FEATURE ENGINEERING
---

## 1. Why is feature scaling important when using gradient descent-based algorithms? 
1. It improves the convergence speed of the algorithm 
2. it eliminates the need for data cleaning 
3. It reduces the dimensionality of the data 
4. It ensures that the features are interdependent of each other

The correct answer is:

**1. It improves the convergence speed of the algorithm**

---

### ✅ Explanation:

Feature scaling (like normalization or standardization) is **important in gradient descent-based algorithms** because:

* Gradient descent updates weights based on the **magnitude of feature values**.
* If features have very different scales (e.g., one ranges from 1 to 1000, another from 0 to 1), the **cost function becomes skewed**, leading to **slow or unstable convergence**.
* Scaling makes the **optimization landscape smoother**, allowing the algorithm to take more **consistent and efficient steps** toward the minimum.

---

### ❌ Why the other options are incorrect:

* **2. It eliminates the need for data cleaning** – Data cleaning involves handling missing values, duplicates, outliers, etc., and is unrelated to scaling.
* **3. It reduces the dimensionality of the data** – Dimensionality reduction (e.g., PCA) is a separate process from scaling.
* **4. It ensures that the features are interdependent of each other** – Scaling does **not affect the relationships (dependency/independency)** between features.

---

**So, always scale your features when using gradient descent-based algorithms (like linear regression, logistic regression, neural networks, etc.)!**


## 2. When should you consider using robust scaling? 
1. When your data is normally distributed 
2. When your data contains significant outliers 
3. When you need to normalize data to a 0-1 range 
4. When your data is categorical 

The correct answer is:

**2. When your data contains significant outliers**

---

### ✅ Explanation:

**Robust scaling** (like using **RobustScaler** in scikit-learn) **uses the median and interquartile range (IQR)** for scaling:

* It is **less sensitive to outliers** than standard scaling (which uses mean and standard deviation).
* It transforms the data by **subtracting the median and dividing by the IQR**.
* This makes it ideal when your dataset has **extreme values or outliers** that would distort other scaling methods.

---

### ❌ Why the other options are incorrect:

* **1. When your data is normally distributed** – Use **StandardScaler** (mean = 0, std = 1), not robust scaling.
* **3. When you need to normalize data to a 0-1 range** – Use **MinMaxScaler**, not robust scaling.
* **4. When your data is categorical** – Categorical data should be **encoded** (e.g., one-hot or label encoding), not scaled.

---

**RobustScaler = Best choice when outliers are present.**


## 3. Which of the following encoding techniques is most suitable for high-cardinality categorical variables? 
1. Label Encoding 
2. One-Hot Encoding 
3. Binary Encoding 
4. Frequency Encoding 

The correct answer is:

**3. Binary Encoding**

---

### ✅ Explanation:

**Binary Encoding** is most suitable for **high-cardinality categorical variables** (i.e., variables with a large number of unique categories) because:

* It **reduces the number of dimensions** compared to One-Hot Encoding.
* It works by first converting categories to numeric labels (like Label Encoding), then encoding those numbers into **binary format**, and finally splitting binary digits into separate columns.
* This results in **fewer columns** than One-Hot Encoding and avoids the **ordinal assumptions** of Label Encoding.

---

### ❌ Why the other options are less suitable:

* **1. Label Encoding** – Imposes an **ordinal relationship** which can mislead algorithms if the categories are **nominal**.
* **2. One-Hot Encoding** – Creates **a new column for each unique category**, leading to **sparse and high-dimensional data**, which is inefficient for high-cardinality features.
* **4. Frequency Encoding** – Replaces categories with their frequency, but may **mislead models** if categories with similar frequencies don’t carry similar meaning.

---

**Use Binary Encoding when working with high-cardinality categorical features to balance efficiency and accuracy.**


## 4. What is the main risk of using one-hot encoding on a categorical variable with a very high number of categories? 
1. It can lead to data sparsity 
2. It simplifies the data too much 
3. It increases the likelihood of multicollinearity 
4. It reduces the interpretability of the model

The correct answer is:

**1. It can lead to data sparsity**

---

### ✅ Explanation:

**One-Hot Encoding** creates **one binary column for each category** in the feature. When the categorical variable has a **very high number of unique categories (high cardinality)**:

* It results in a **large number of columns**, most of which contain **zeros**.
* This creates a **sparse matrix**, meaning a lot of memory is used to store mostly zero values.
* Sparse data can **slow down model training**, increase **computational cost**, and even lead to **overfitting** in some models.

---

### ❌ Why the other options are incorrect:

* **2. It simplifies the data too much** – One-hot encoding actually **adds complexity**, not simplification.
* **3. It increases the likelihood of multicollinearity** – One-hot encoding avoids multicollinearity if you **drop one dummy variable** per feature (i.e., avoid the dummy variable trap).
* **4. It reduces the interpretability of the model** – Interpretability may decrease slightly with more features, but **this is not the main risk** compared to sparsity.

---

**Bottom line:**
Use one-hot encoding **cautiously** with high-cardinality variables — consider alternatives like **binary or target encoding** in such cases.


## 5. How does feature hashing differ from one-hot encoding? 
1. Feature hashing creates fewer columns by using a hash function 
2. Feature hashing is only applicable to numerical data 
3. Feature hashing increases the dimensionality of the dataset 
4. Feature hashing provides better interpretability than one-hot encoding

The correct answer is:

**1. Feature hashing creates fewer columns by using a hash function**

---

### ✅ Explanation:

**Feature Hashing** (also known as the **hashing trick**) is an encoding technique that:

* Uses a **hash function to map categories to a fixed number of columns** (i.e., features).
* This makes it **memory-efficient** and ideal for **high-cardinality categorical variables**, where one-hot encoding would create too many columns.
* Unlike one-hot encoding, **the number of output features is fixed** and predefined, regardless of how many unique categories are in the data.

---

### ❌ Why the other options are incorrect:

* **2. Feature hashing is only applicable to numerical data** – Incorrect. Feature hashing is used **specifically for categorical data**.
* **3. Feature hashing increases the dimensionality of the dataset** – Wrong. It actually **reduces** dimensionality compared to one-hot encoding.
* **4. Feature hashing provides better interpretability than one-hot encoding** – False. **One-hot encoding is more interpretable**, since each column clearly represents a category. Feature hashing can **cause collisions**, making interpretation difficult.

---

**In summary:**
Feature hashing is a **dimensionality-reducing** encoding method that trades off interpretability for **efficiency** — great for large-scale or streaming categorical data.


## 6. Which of the following is NOT a common method for detecting outliers? 
1. Box Plot 
2. Z-score 
3. Isolation Forest 
4. Min-Max Scaling

The correct answer is:

**4. Min-Max Scaling**

---

### ✅ Explanation:

**Min-Max Scaling** is a **feature scaling technique**, not a method for detecting outliers. It transforms features to a fixed range (usually 0 to 1) but:

* It does **not detect or identify** outliers.
* In fact, it can be **heavily affected by outliers**, as extreme values can **compress** the rest of the data into a narrow range.

---

### ✅ The other options are **commonly used for outlier detection**:

* **1. Box Plot** – Visual tool that uses the **IQR rule** to identify outliers as points beyond 1.5×IQR from Q1 and Q3.
* **2. Z-score** – Detects outliers by measuring how many **standard deviations** a data point is from the mean (typically, |z| > 3 is considered an outlier).
* **3. Isolation Forest** – An **unsupervised machine learning algorithm** that isolates anomalies using random decision trees.

---

**Bottom line:**
**Min-Max Scaling is used for normalization, not for outlier detection.**


## 7. What is the purpose of using Winsorization to handle outliers? 
1. To remove outliers from the dataset entirely 
2. To replace outliers with a constant value 
3. To reduce the impact of outliers by limiting extreme values 
4. To impute missing values with the mean or median

The correct answer is:

**3. To reduce the impact of outliers by limiting extreme values**

---

### ✅ Explanation:

**Winsorization** is a technique used to **handle outliers** by:

* **Limiting extreme values** in the data to a specified percentile range.
* Instead of removing outliers, it **replaces them with the nearest value within the allowed range** (e.g., replacing values above the 95th percentile with the 95th percentile value).

This helps in:

* **Reducing the influence** of extreme values without deleting data.
* **Improving model robustness** and reducing skewness.

---

### ❌ Why the other options are incorrect:

* **1. To remove outliers from the dataset entirely** – That’s **outlier removal**, not Winsorization.
* **2. To replace outliers with a constant value** – That’s not Winsorization; it replaces them with **threshold values**, not constants.
* **4. To impute missing values with the mean or median** – That’s **imputation**, unrelated to outlier handling.

---

**In summary:**
**Winsorization = Capping extreme values to reduce outlier impact without losing data.**


## 8. In feature scaling, which technique transforms the data to have a mean of 0 and a standard deviation of 1? 
1. Min-Max Scaling 
2. Robust Scaling 
3. Standardization 
4. Log Transformation

The correct answer is:

**3. Standardization**

---

### ✅ Explanation:

**Standardization** (also called **Z-score normalization**) transforms the data so that:

* The **mean becomes 0**
* The **standard deviation becomes 1**

The formula used is:

$$
z = \frac{x - \mu}{\sigma}
$$

Where:

* $x$ = original value
* $\mu$ = mean of the feature
* $\sigma$ = standard deviation of the feature

---

### ❌ Why the other options are incorrect:

* **1. Min-Max Scaling** – Transforms data to a fixed range (typically \[0, 1]), but does **not standardize mean or std. deviation**.
* **2. Robust Scaling** – Uses **median and IQR** for scaling; useful for data with outliers, but doesn’t force mean = 0 or std = 1.
* **4. Log Transformation** – Used to **reduce skewness** in distributions, not to standardize data.

---

**In summary:**
**Standardization = mean 0, standard deviation 1** — commonly used in algorithms like linear regression, SVMs, and neural networks.


## 9. Which feature extraction technique is most suitable for visualizing high-dimensional data in 2 or 3 dimensions? 
1. Principal Component Analysis (PCA) 
2. Linear Discriminant Analysis (LDA) 
3. t-Distributed Stochastic Neighbor Embedding (t-SNE) 
4. Autoencoders

The correct answer is:

**3. t-Distributed Stochastic Neighbor Embedding (t-SNE)**

---

### ✅ Explanation:

**t-SNE (t-Distributed Stochastic Neighbor Embedding)** is a **non-linear** dimensionality reduction technique that is:

* Specifically designed for **visualizing high-dimensional data** in **2 or 3 dimensions**.
* It preserves **local structure** (i.e., similarity between nearby points), making it excellent for **clustering and visualization**.
* Commonly used in **image recognition, NLP, and exploratory data analysis**.

---

### ❌ Why the other options are less suitable for visualization:

* **1. PCA** – A **linear** technique that reduces dimensions while preserving **variance**, but it doesn't capture complex, non-linear relationships as well as t-SNE for visualization.
* **2. LDA** – A supervised technique focused on **maximizing class separability**, not general visualization.
* **4. Autoencoders** – Can reduce dimensions and be used for visualization, but require **neural network training** and are more complex than t-SNE for direct visualization tasks.

---

**In summary:**
**Use t-SNE when your goal is to visualize patterns or clusters in high-dimensional data in 2D or 3D.**


## 10. In time series data, which imputation method is commonly used to fill in missing values? 
1. Mean Imputation 
2. Hot-Deck Imputation 
3. Last Observation Carried Forward (LOCF) 
4. K-Nearest Neighbors (KNN) Imputation

The correct answer is:

**3. Last Observation Carried Forward (LOCF)**

---

### ✅ Explanation:

In **time series data**, the **Last Observation Carried Forward (LOCF)** method is commonly used to fill in missing values by:

* Replacing a missing value with the **last available (previous) observation**.
* Preserving the **temporal order** and making minimal assumptions about future trends.

This method works well when:

* The variable is **stable or changes slowly over time**.
* The missing values are **not too frequent**.

---

### ❌ Why the other options are less suitable for time series:

* **1. Mean Imputation** – Ignores temporal structure by using a **global average**, which can **distort trends** in time series.
* **2. Hot-Deck Imputation** – Fills missing values using **similar records**, not suitable for sequential dependencies.
* **4. K-Nearest Neighbors (KNN) Imputation** – Can work, but is **computationally intensive** and not always effective for **sequential data** like time series.

---

**In summary:**
For time series, **LOCF** is a simple and effective method to impute missing values while **maintaining the time-based nature** of the data.


## 11. What is the main advantage of using K-Nearest Neighbors (KNN) imputation for missing data? 
1. It always provides the most accurate results
2. It can handle non-linear relationships between variables 
3. It is computationally inexpensive 
4. It is unaffected by outliers

The correct answer is:

**2. It can handle non-linear relationships between variables**

---

### ✅ Explanation:

**K-Nearest Neighbors (KNN) imputation** works by:

* Finding the **k most similar records** (neighbors) to the one with missing data (based on distance metrics like Euclidean distance).
* Then **imputing** the missing value using the values from these neighbors (e.g., mean or mode of neighbors’ values).

This method is powerful because:

* It can **capture complex, non-linear relationships** between features that simpler methods like mean or median imputation cannot.
* It adapts based on the **local structure** of the data.

---

### ❌ Why the other options are incorrect:

* **1. It always provides the most accurate results** – No method is perfect; KNN works well in some contexts, but **not always the most accurate**, especially with large or noisy datasets.
* **3. It is computationally inexpensive** – Actually, **KNN is computationally expensive**, especially on large datasets, since it calculates distances for each missing value.
* **4. It is unaffected by outliers** – KNN **can be sensitive to outliers**, as they may influence the selection of neighbors and skew the imputation.

---

**In summary:**
**KNN imputation’s main advantage** is its ability to model **non-linear dependencies**, making it more flexible than traditional imputation techniques.


## 12. What is the risk of using the mean imputation method for missing data? 
1. It introduces bias by inflating variance 
2. It preserves the distribution of the data 
3. It can lead to underfitting 
4. It increases the sample size

The correct answer is:

**3. It can lead to underfitting**

---

### ✅ Explanation:

**Mean imputation** fills in missing values with the **mean of the observed values** for that feature. While it's simple and fast, it comes with significant drawbacks:

* It **reduces variability** in the data.
* It can **mask relationships** between variables.
* By making data artificially uniform, it may cause models to **underfit** (i.e., fail to capture the underlying patterns in the data).

---

### ❌ Why the other options are incorrect:

* **1. It introduces bias by inflating variance** – Actually, mean imputation tends to **reduce variance**, not inflate it.
* **2. It preserves the distribution of the data** – False. It **distorts the distribution**, especially in skewed or multimodal data.
* **4. It increases the sample size** – Imputation doesn’t increase sample size; it **fills in missing values**, keeping the sample size the same.

---

**In summary:**
Mean imputation can **oversimplify the data**, which may lead to **underfitting**, especially in models that rely on capturing subtle patterns.


## 13. What is the main purpose of using interaction terms in feature engineering? 
1. To remove irrelevant features 
2. To capture non-linear relationships between features 
3. To scale features to a common range 
4. To reduce the dimensionality of the dataset

The correct answer is:

**2. To capture non-linear relationships between features**

---

### ✅ Explanation:

**Interaction terms** in feature engineering are **combinations of two or more features** (e.g., $x_1 \times x_2$) used to:

* **Capture complex relationships** between variables that are **not modeled well** by linear terms alone.
* Help models (especially **linear models**) account for **non-linear effects** by introducing new variables that reflect **combined influence**.

This is especially useful in:

* **Regression models**, where the effect of one variable might **depend on the level** of another variable.
* **Polynomial regression** or when adding multiplicative or additive combinations of variables.

---

### ❌ Why the other options are incorrect:

* **1. To remove irrelevant features** – That’s done through **feature selection**, not interaction terms.
* **3. To scale features to a common range** – That’s the job of **feature scaling** (e.g., Min-Max, StandardScaler).
* **4. To reduce the dimensionality of the dataset** – Interaction terms actually **increase** dimensionality by adding new features.

---

**In summary:**
Use **interaction terms** to **model relationships between features** that aren't captured by individual features alone — particularly **non-linear interactions** in otherwise linear models.


## 14. Which feature scaling method is most appropriate for data that follows a power law distribution? 
1. Min-Max Scaling 
2. Log Transformation 
3. Z-Score Scaling 
4. Robust Scaling

The correct answer is:

**2. Log Transformation**

---

### ✅ Explanation:

If your data follows a **power law distribution** (i.e., it is **heavily skewed** with **many small values and a few very large ones**), the most appropriate feature scaling method is:

> **Log Transformation**

Why?

* It **compresses the range** of large values and **spreads out** small values.
* It **reduces skewness**, making the data more **normally distributed** or closer to it.
* This helps models **better interpret patterns**, especially when working with **non-linear or skewed data**.

---

### ❌ Why the other options are not ideal:

* **1. Min-Max Scaling** – Sensitive to outliers and does **not reduce skewness**.
* **3. Z-Score Scaling (Standardization)** – Assumes normal distribution; **not effective for skewed/power law data**.
* **4. Robust Scaling** – Handles outliers well but **does not address skewness or power-law behavior** as effectively as log transformation.

---

**In summary:**
Use **log transformation** to handle **power-law distributed data**, especially before applying models that assume more symmetric or linear relationships.


## 15. What is the main advantage of using polynomial features in regression models? 
1. They reduce model complexity 
2. They can model non-linear relationships 
3. They improve model interpretability 
4. They eliminate the need for feature scaling

The correct answer is:

**2. They can model non-linear relationships**

---

### ✅ Explanation:

**Polynomial features** are used in regression models to:

* **Capture non-linear relationships** between the input features and the target variable by adding terms like $x^2, x^3, x_1x_2$, etc.
* Allow **linear models** (like linear regression) to fit **non-linear data** by including these additional transformed features.

For example, transforming a feature $x$ into $[x, x^2, x^3]$ allows a linear regression model to fit **curved trends** in the data.

---

### ❌ Why the other options are incorrect:

* **1. They reduce model complexity** – False. Polynomial features actually **increase** complexity by adding more terms.
* **3. They improve model interpretability** – Not necessarily. **More complex models** with higher-degree polynomials are often **harder to interpret**.
* **4. They eliminate the need for feature scaling** – Incorrect. In fact, polynomial features can **magnify scale differences**, making **feature scaling even more important**.

---

**In summary:**
Use **polynomial features** to help regression models **learn non-linear patterns** in data without switching to more complex models.


## 16. When is it appropriate to use frequency encoding for categorical variables? 
1. When the categorical variable has no inherent order 
2. When the categorical variables has a very high cardinality 
3. When the categorical variables is ordinal 
4. When the categorical variable is binary

The correct answer is:

**2. When the categorical variable has a very high cardinality**

---

### ✅ Explanation:

**Frequency encoding** replaces each category in a categorical variable with the **number of times it appears** (its frequency) in the dataset. This method is particularly useful:

* When dealing with **high-cardinality categorical variables** (i.e., those with many unique values).
* It helps **avoid the curse of dimensionality** caused by one-hot encoding or other sparse techniques.
* It is **simple, fast, and memory-efficient**, especially for tree-based models like decision trees or random forests.

---

### ❌ Why the other options are less appropriate:

* **1. When the categorical variable has no inherent order** – Frequency encoding doesn’t impose order, but **this isn’t its main use case**.
* **3. When the categorical variable is ordinal** – Ordinal encoding (assigning meaningful ranks) is usually more appropriate.
* **4. When the categorical variable is binary** – Binary variables are already encoded as 0/1 (or can easily be), so frequency encoding is unnecessary.

---

**In summary:**
Use **frequency encoding** when you're working with **high-cardinality categorical features** and want to avoid creating too many columns.


## 17. What is the impact of using log transformation on skewed data? 
1. It increases the skewness of the data 
2. It reduces the skewness, making the data more normally distributed 
3. It increases the range of the data 
4. It has no effect on the distribution of the data

The correct answer is:

**2. It reduces the skewness, making the data more normally distributed**

---

### ✅ Explanation:

**Log transformation** is commonly used to handle **right-skewed (positively skewed)** data. It works by:

* **Compressing large values** more than small ones
* **Reducing the influence of extreme values (outliers)**
* Making the distribution **closer to normal (bell-shaped)**, which many statistical and machine learning models assume

This helps improve:

* **Model performance**
* **Interpretability**
* **Symmetry in data visualization**

---

### ❌ Why the other options are incorrect:

* **1. It increases the skewness of the data** – Opposite of what log transformation does.
* **3. It increases the range of the data** – It **compresses** the range, especially for large values.
* **4. It has no effect on the distribution of the data** – It **does affect** the distribution, especially when the data is skewed.

---

**In summary:**
Use **log transformation** to **reduce skewness** and help models that perform better with **normally distributed inputs**.


## 18. Which technique should be used to handle missing data when the missingness is related to the value itself (MNAR)? 
1. Mean Imputation 
2. Complete Case Analysis 
3. Multiple Imputation with Missingness Indicators 
4. Listwise Deletion

The correct answer is:

**3. Multiple Imputation with Missingness Indicators**

---

### ✅ Explanation:

When data is **Missing Not At Random (MNAR)** — meaning the **likelihood of a value being missing is related to the value itself** (e.g., people with high income choosing not to report it) — standard imputation methods like mean or deletion can introduce bias.

**Multiple Imputation with Missingness Indicators** helps by:

* **Creating a binary indicator variable** that marks whether a value is missing.
* **Using multiple imputation** to generate several plausible values for the missing data, incorporating uncertainty.
* **Preserving the structure and possible dependence** between missingness and the variable itself.

This approach improves model performance and **reduces bias** when missingness contains valuable information.

---

### ❌ Why the other options are incorrect for MNAR:

* **1. Mean Imputation** – Assumes data is Missing Completely At Random (MCAR); **biases results under MNAR**.
* **2. Complete Case Analysis** – Deletes rows with missing data; this can **introduce significant bias and data loss** under MNAR.
* **4. Listwise Deletion** – Same as above; **not suitable** when missingness is informative.

---

**In summary:**
Use **Multiple Imputation with Missingness Indicators** for MNAR situations to **account for the missingness mechanism and reduce bias**.


## 19. How does covariance differ from correlation in terms of scale and interpretability? 
1. Covariance is standardized and easier to interpret, while correlation is dependent on the units of the variables 
2. Covariance is dependent on the units of the variables, making it harder to interpret, while correlation is standardized and easier to compare across datasets 
3. Covariance provides both th e direction and strength of the relationships, while correlation only indicates the direction 
4. Covariance is not sensitive to scale, while correlation is

The correct answer is:

**2. Covariance is dependent on the units of the variables, making it harder to interpret, while correlation is standardized and easier to compare across datasets**

---

### ✅ Explanation:

* **Covariance** measures how two variables vary **together**, but:

  * It is **not standardized**, meaning its value is affected by the **units and scale** of the variables.
  * For example, changing the unit of one variable (e.g., from meters to centimeters) changes the covariance.

* **Correlation** (typically Pearson correlation):

  * Is a **standardized version** of covariance.
  * Ranges between **-1 and +1**, making it **unit-free** and **easy to interpret** across different variables and datasets.
  * Shows both the **strength and direction** of the linear relationship.

---

### ❌ Why the other options are incorrect:

* **1. Covariance is standardized...** – Incorrect. Covariance is **not standardized**, correlation is.
* **3. Covariance provides both direction and strength...** – Covariance does indicate **direction**, but not a standardized **strength** like correlation.
* **4. Covariance is not sensitive to scale...** – False. Covariance **is sensitive** to the **scale and units** of variables.

---

**In summary:**
Covariance is **scale-dependent** and less interpretable, while correlation is **scale-invariant** and easier to understand and compare.


## 20. What does a Pearson Correlation Coefficient of -0.85 indicate about the relationship between two variables? 
1. A weak positive linear relationship 
2. A strong positive linear relationship 
3. a weak negative linear relationship 
4. A strong negative linear relationship

The correct answer is:

**4. A strong negative linear relationship**

---

### ✅ Explanation:

The **Pearson Correlation Coefficient** (r) measures the **strength and direction** of the **linear relationship** between two variables. It ranges from **-1 to +1**:

* **+1** → Perfect **positive** linear relationship
* **0** → **No** linear relationship
* **–1** → Perfect **negative** linear relationship

So:

* An **r = –0.85** means:

  * The two variables are **strongly correlated**
  * As one **increases**, the other **tends to decrease**
  * The relationship is **linear and negative**

---

### ❌ Why the other options are incorrect:

* **1. A weak positive linear relationship** → Would be a small **positive** r (e.g., +0.2)
* **2. A strong positive linear relationship** → Would be r close to **+1**
* **3. A weak negative linear relationship** → Would be r close to **–0.2**

---

**In summary:**
A Pearson correlation of **–0.85** = **strong, negative, linear** relationship between two variables.


## 21. Why is feature engineering considered a crucial step in the machine learning pipeline? 
1. It reduces the need for data preprocessing 
2. It simplifies the data visualization process 
3. It improves model performance by capturing underlying patterns 
4. It guarantees the elimination of all data biases

The correct answer is:

**3. It improves model performance by capturing underlying patterns**

---

### ✅ Explanation:

**Feature engineering** is the process of creating, transforming, or selecting features (input variables) to help machine learning models:

* **Better understand the data**
* **Capture meaningful patterns or relationships**
* **Improve prediction accuracy and generalization**

Good feature engineering can **make a simple model perform exceptionally well**, while poor feature engineering can limit even the most complex algorithms.

---

### ❌ Why the other options are incorrect:

* **1. It reduces the need for data preprocessing** – False. Feature engineering **builds on top of** proper data preprocessing (e.g., handling missing values, scaling).
* **2. It simplifies the data visualization process** – Not its main purpose; visualization is more about **exploration**, not modeling.
* **4. It guarantees the elimination of all data biases** – No technique can **guarantee** this; biases must be carefully identified and addressed separately.

---

**In summary:**
**Feature engineering is crucial** because it **enhances model performance** by revealing deeper, more useful structures in the data.


## 22. Which technique is NOT typically used for handling imbalanced data? 
1. SMOTE 
2. Random Oversampling 
3. Tomek Links 
4. Linear Discriminant Analysis (LDA)

The correct answer is:

**4. Linear Discriminant Analysis (LDA)**

---

### ✅ Explanation:

**Linear Discriminant Analysis (LDA)** is a **classification algorithm and dimensionality reduction technique**, **not a method for handling imbalanced data**.

It works by:

* Finding a linear combination of features that best separates the classes.
* It's **sensitive to class imbalance**, but not designed to fix it.

---

### ✅ The other techniques are commonly used to handle imbalanced datasets:

* **1. SMOTE (Synthetic Minority Over-sampling Technique):**
  Creates synthetic samples for the minority class to balance the dataset.

* **2. Random Oversampling:**
  Randomly duplicates instances of the minority class.

* **3. Tomek Links:**
  A **data cleaning method** that removes overlapping examples from the majority class to make class boundaries clearer.

---

**In summary:**
**LDA is a modeling technique**, not a resampling or balancing method — hence **not typically used** for handling imbalanced data.


## 23. Which imputation method involves replacing missing values with the mean or median of the non-missing data? 
1. Last Observation carried Forward (LOCF) 
2. K-Nearest Neighbors (KNN) Imputation 
3. Simple Imputation 
4. Multiple Imputation

The correct answer is:

**3. Simple Imputation**

---

### ✅ Explanation:

**Simple Imputation** is the process of filling in missing values using a **single statistic** such as:

* **Mean** (for numerical data)
* **Median** (for skewed numerical data)
* **Mode** (for categorical data)

It is:

* **Easy to implement**
* **Fast and efficient**
* Often used as a **baseline method** before trying more advanced techniques

---

### ❌ Why the other options are incorrect:

* **1. Last Observation Carried Forward (LOCF)** – Fills missing values using the **last observed value**, mainly used in **time series**.
* **2. K-Nearest Neighbors (KNN) Imputation** – Fills missing values based on the values of the **nearest neighbors**.
* **4. Multiple Imputation** – Involves creating **multiple versions** of the dataset by imputing values multiple times to reflect uncertainty.

---

**In summary:**
**Simple imputation** uses the **mean, median, or mode** to fill missing data and is the **most basic imputation technique**.


## 24. In the context of missing data, what does MCAR stand for? 
1. Missing Categorical at Random 
2. Missing Completely at Random 
3. Missing Conditionally at Random 
4. Missing Concurrently at Random

The correct answer is:

**2. Missing Completely at Random**

---

### ✅ Explanation:

**MCAR (Missing Completely at Random)** means that the **probability of a data point being missing is entirely unrelated** to:

* The **value of the variable itself**
* **Any other variables** in the dataset

In other words, the missing data is **random and unbiased**. This is the **best-case scenario** for handling missing data because:

* Analyses remain **unbiased** even if you **delete** or **impute** the missing values.
* Techniques like **listwise deletion** or **mean imputation** are relatively safe under MCAR.

---

### ❌ Why the other options are incorrect:

* **1. Missing Categorical at Random** – Not a valid term.
* **3. Missing Conditionally at Random** – Possibly a confusion with **MAR (Missing At Random)**, which is different.
* **4. Missing Concurrently at Random** – Not a recognized concept in missing data theory.

---

**In summary:**
**MCAR = Missing Completely at Random** — no pattern or relationship in the missingness.


## 25. What is the purpose of Principal Component Analysis (PCA) in feature extraction? 
1. To create new features by combining original features 
2. To reduce dimensionality by selecting the most important features 
3. To normalize the data 
4. To remove outliers from the dataset

The correct answer is:

**1. To create new features by combining original features**

---

### ✅ Explanation:

**Principal Component Analysis (PCA)** is a **feature extraction** and **dimensionality reduction** technique that:

* **Creates new features (principal components)** by combining original features through linear transformations.
* These components capture the **maximum variance** in the data with fewer dimensions.
* The new features are **uncorrelated** and ordered by the amount of variance they capture.

So PCA doesn't just select existing features — it **creates new ones** based on combinations of the original variables.

---

### ❌ Why the other options are incorrect:

* **2. To reduce dimensionality by selecting the most important features** – That's **feature selection**, not PCA. PCA creates **new features**, not select existing ones.
* **3. To normalize the data** – PCA requires normalization **before** it’s applied, but **it doesn't perform normalization** itself.
* **4. To remove outliers from the dataset** – PCA does **not remove outliers**; in fact, it can be **sensitive to them**.

---

**In summary:**
**PCA is used to create new, compact features** that preserve most of the data’s variability, helping reduce dimensionality while retaining essential information.


## 26. Which feature selection method uses Lasso regularization to select features? 
1. Filter methods 
2. Wrapper methods 
3. Embedded methods 
4. Recursive feature elimination

The correct answer is:

**3. Embedded methods**

---

### ✅ Explanation:

**Embedded methods** perform feature selection **during the model training process**. A key example of this is **Lasso regularization** (Least Absolute Shrinkage and Selection Operator), which:

* Adds an **L1 penalty** to the loss function.
* **Shrinks some coefficients to exactly zero**, effectively **selecting features** by eliminating the less important ones.
* Helps in both **regularization** and **automatic feature selection**.

---

### ❌ Why the other options are incorrect:

* **1. Filter methods** – Use statistical techniques (like correlation or chi-square tests) **independently of any model**.
* **2. Wrapper methods** – Use a model to evaluate different **combinations of features** (e.g., forward/backward selection) but do **not use regularization**.
* **4. Recursive Feature Elimination (RFE)** – A **wrapper method** that recursively removes the least important features based on model performance, but **not based on Lasso**.

---

**In summary:**
**Lasso-based feature selection** is an example of an **embedded method** — it’s built into the model training process itself.


## 27. What is the primary goal of feature transformation? 
1. To eliminate irrelevant features 
2. To change the distribution or scale of features 
3. To create interaction terms between features 
4. To combine features into a single feature

The correct answer is:

**2. To change the distribution or scale of features**

---

### ✅ Explanation:

**Feature transformation** involves applying mathematical functions to your features to:

* **Adjust the scale** (e.g., using Standardization or Min-Max Scaling),
* **Normalize distributions** (e.g., using Log, Square Root, or Box-Cox transformations),
* **Handle skewed data** to make it more suitable for modeling,
* Improve the **performance and convergence** of machine learning algorithms.

---

### ❌ Why the other options are incorrect:

* **1. To eliminate irrelevant features** → That’s **feature selection**, not transformation.
* **3. To create interaction terms between features** → That’s a part of **feature engineering**, specifically interaction creation.
* **4. To combine features into a single feature** → That can be done via **feature extraction** (like PCA), not the primary goal of transformation.

---

**In summary:**
The main purpose of **feature transformation** is to **change the distribution or scale** of features so that they are better suited for modeling.


## 28. Which of the following is an example of feature creation? 
1. Normalization 
2. Standardization 
3. Creating BMI from height and weight 
4. Log transformation

The correct answer is:

**3. Creating BMI from height and weight**

---

### ✅ Explanation:

**Feature creation** (a part of feature engineering) involves **creating new features** from existing ones to help the model better understand underlying patterns.

* **Example:** Calculating **BMI (Body Mass Index)** from **height and weight** creates a new, more meaningful feature from two raw features.

---

### ❌ Why the other options are incorrect:

* **1. Normalization** → This is a **scaling technique**, not feature creation.
* **2. Standardization** → Another **scaling technique**, not creating new features.
* **4. Log transformation** → This is a **feature transformation**, used to deal with skewness or to stabilize variance, but it doesn't create a new feature from multiple inputs.

---

**In summary:**
Creating **BMI from height and weight** is a classic example of **feature creation**—combining existing features into a new one that may capture more useful information for the model.


## 29. When dealing with imbalanced data, what does the SMOTE technique do? 
1. Removes majority class samples 
2. Adds noise to the minority class 
3. Creates synthetic samples of the minority class 
4. Adjusts class weights during model training

The correct answer is:
**3. Creates synthetic samples of the minority class**

---

### ✅ Explanation:

**SMOTE** stands for **Synthetic Minority Over-sampling Technique**.
It is a popular **oversampling** method used to handle **imbalanced datasets**.

* SMOTE works by **creating new synthetic samples** for the **minority class** rather than just duplicating existing ones.
* It does this by **interpolating** between existing minority class examples and their nearest neighbors.

---

### ❌ Why the other options are incorrect:

* **1. Removes majority class samples** → That’s **undersampling**, not SMOTE.
* **2. Adds noise to the minority class** → SMOTE does not add noise; it creates synthetic, structured samples.
* **4. Adjusts class weights during model training** → This is another technique (like using `class_weight='balanced'` in models), not what SMOTE does.

---

**In short:**
**SMOTE = Synthetic generation of minority class samples** to balance the dataset.


## 30. What is the main advantage of using an Isolation Forest for outlier detection? 
1. It is computationally less expensive than other methods 
2. It requires no tuning or parameters
3. It isolates outliers more quickly than normal points 
4. It works better with categorical data

The correct answer is:
**3. It isolates outliers more quickly than normal points**

---

### ✅ Explanation:

**Isolation Forest** is an ensemble-based outlier detection algorithm that works by **randomly selecting a feature and then randomly selecting a split value** between the maximum and minimum values of the selected feature.

* **Outliers** are more likely to be **isolated (split apart)** early in the process because they are **few and different**.
* Hence, they generally have **shorter path lengths** in the tree structure, allowing the model to detect them quickly.

---

### ❌ Why other options are incorrect:

* **1. It is computationally less expensive than other methods** → While efficient, this is not the primary reason it's used.
* **2. It requires no tuning or parameters** → It does have parameters like `n_estimators`, `max_samples`, etc., though it’s relatively simple to use.
* **4. It works better with categorical data** → It is **not well-suited for categorical data**; it works best with **numerical data**.

---

**In short:**
Isolation Forest is effective because **outliers are easier to isolate**, which the algorithm does faster than for normal data points.


## 31. Which method of feature selection considers all possible subsets of features to find the best one? 
1. Forward Selection 
2. Backward Elimination 
3. Recursive Feature Elimination 
4. Exhaustive Search

The correct answer is:
**4. Exhaustive Search**

---

### ✅ Explanation:

**Exhaustive Search** (also called **best subset selection**) evaluates **all possible combinations of features** to determine which subset performs best according to a specific evaluation metric (like accuracy, AIC, BIC, etc.).

* It is the most **comprehensive** feature selection method.
* However, it is also the **most computationally expensive**, especially when dealing with a large number of features.

---

### ❌ Why other options are incorrect:

* **1. Forward Selection** → Starts with no features and **adds one at a time** based on performance.
* **2. Backward Elimination** → Starts with all features and **removes one at a time** based on least contribution.
* **3. Recursive Feature Elimination (RFE)** → Recursively removes the **least important features** using model-based rankings.

---

**In summary:**
**Exhaustive Search** tries **all subsets**, making it thorough but slow for high-dimensional datasets.


## 32. How does the Curse of Dimensionality affect feature engineering? 
1. It reduces the accuracy of machine learning models 
2. It increases the interpretability of models 
3. It simplifies feature selection 
4. It makes it easier to detect outliers

The correct answer is:
**1. It reduces the accuracy of machine learning models**

---

### ✅ Explanation:

The **Curse of Dimensionality** refers to various problems that arise when working with data in **high-dimensional spaces**:

* **Sparse data**: As dimensions increase, data becomes sparse, making it difficult for models to generalize.
* **Increased computation**: More features = more resources needed to process data.
* **Overfitting**: Models may learn noise instead of patterns due to too many irrelevant features.
* **Distance measures become less meaningful**, affecting algorithms like KNN or clustering.

---

### ❌ Why other options are incorrect:

* **2. Increases interpretability** → More features usually make models **harder** to interpret.
* **3. Simplifies feature selection** → It actually makes feature selection **more complex and crucial**.
* **4. Makes it easier to detect outliers** → High dimensions can **hide** outliers due to sparsity.

---

**In short:**
The Curse of Dimensionality **negatively impacts model performance** by making the data more sparse and the models prone to overfitting.


## 33. What does the term "data leakage" refer to in the context of feature engineering? 
1. The process of accidentally including future information in the training data 
2. The loss of data due to improper handling 
3. The sharing of sensitive data without authorization 
4. The mislabeling of data during preprocessing 

The correct answer is:
**1. The process of accidentally including future information in the training data**

---

### ✅ Explanation:

**Data leakage** occurs when **information from outside the training dataset** — especially from the **future or test data** — is used to train the model. This leads to **unrealistically high performance during training**, but **poor generalization** on unseen data.

#### Examples:

* Including a target variable or a future-derived feature in your training features.
* Using data that would not be available at prediction time.

---

### ❌ Why the other options are incorrect:

* **2. Loss of data due to improper handling** → This is more about data corruption, not leakage.
* **3. Sharing of sensitive data** → This is a **security/privacy breach**, not data leakage in ML context.
* **4. Mislabeling of data** → That's a **labeling error**, not data leakage.

---

**In short:**
**Data leakage** can mislead your model to perform well during training but fail in production. It’s one of the most critical issues to avoid in feature engineering and model building.


## 34. How can high-cardinality categorical variables negatively impact a machine learning model? 
1. They increase the interpretability of the model 
2. They reduce the dimensionality of the dataset 
3. They can lead to overfitting due to the large number of unique values 
4. They always improve the model's accuracy

The correct answer is:
**3. They can lead to overfitting due to the large number of unique values**

---

### ✅ Explanation:

**High-cardinality categorical variables** (e.g., "zip code", "user ID") have **many unique values**. When encoded (e.g., using one-hot encoding), they can:

* **Greatly increase the number of features**, leading to **high dimensionality**
* Cause the model to **memorize** the data instead of generalizing → **overfitting**
* Be **sparse** and less informative for prediction

---

### ❌ Incorrect options:

* **1. Increase interpretability** → No, they make models harder to interpret.
* **2. Reduce dimensionality** → They do the opposite; they increase it.
* **4. Always improve accuracy** → Not true; they often **hurt performance** without proper handling.

---

**In summary:**
High-cardinality categorical variables can make models **complex, overfit-prone**, and **computationally inefficient**. Techniques like **target encoding, frequency encoding, or embedding** can help handle them effectively.


## 35. Which of the following is a disadvantage of using recursive feature elimination (RFE)? 
1. It cannot handle categorical features 
2. It is computationally expensive for large datasets 
3. It ignores feature importance 
4. It does not provide a ranked list of features

The correct answer is:
**2. It is computationally expensive for large datasets**

---

### ✅ Explanation:

**Recursive Feature Elimination (RFE)** works by:

* Fitting the model multiple times
* Recursively removing the **least important features** based on model performance
  This makes it **computationally expensive**, especially for:
* **Large datasets**
* **High-dimensional data**

---

### ❌ Incorrect options:

* **1. It cannot handle categorical features** → Not true. RFE can work with categorical features if they are properly encoded (e.g., one-hot, label encoding).
* **3. It ignores feature importance** → False. RFE is **based on feature importance** (e.g., weights from linear models, feature importance from trees).
* **4. It does not provide a ranked list of features** → Incorrect. RFE **does** provide a ranking of features during the elimination process.

---

**Summary:**
RFE is powerful but can be **slow and resource-intensive** for large datasets. Use it when you have **enough computation power** or combine it with other methods for efficiency.


## 36. How does the Ridge regularization method help in feature selection? 
1. It removes irrelevant features by setting coeficients to zero 
2. It shrinks coefficients less important features towards zero without eliminating them 
3. It increases the coefficients of important features 
4. It selects features based on correlation with the target variable

The correct answer is:
**2. It shrinks coefficients of less important features towards zero without eliminating them**

---

### ✅ Explanation:

**Ridge Regularization** (also known as **L2 regularization**) works by:

* Adding a **penalty** equal to the **square of the magnitude** of coefficients to the loss function.
* This **shrinks the coefficients** of less important features **closer to zero** but **does not set them exactly to zero**.
* Therefore, Ridge **does not perform feature selection** in the strict sense (i.e., it doesn’t eliminate features), but it **helps reduce overfitting** and handles multicollinearity.

---

### ❌ Why the others are incorrect:

* **1. It removes irrelevant features by setting coefficients to zero**
  → This is true for **Lasso (L1 regularization)**, not Ridge.

* **3. It increases the coefficients of important features**
  → No, Ridge typically **shrinks all coefficients**; it doesn’t increase any.

* **4. It selects features based on correlation with the target variable**
  → This sounds like a filter method, not Ridge. Ridge is a **model-based regularization method**, not a correlation-based selection method.

---

### Summary:

* **Ridge (L2):** Shrinks all coefficients but keeps all features.
* **Lasso (L1):** Can shrink some coefficients to **zero**, effectively doing **feature selection**.


## 37. Why is it important to ensure that feature engineering is applied consistently during training and testing phases?
1. To improve model interpretability 
2. To avoid data leakage 
3. To increase the dimensionality of the dataset 
4. To ensure the model is unbiased

The correct answer is:
**2. To avoid data leakage**

---

### ✅ Explanation:

Applying **feature engineering consistently** during both the **training** and **testing** phases is crucial to:

* Ensure that the model sees **data in the same format** during both phases.
* **Avoid data leakage**, which happens when information from outside the training dataset is used to create the model — leading to **overly optimistic performance** and **poor generalization** to unseen data.

For example:

* If you scale features (like normalization or standardization) using the **mean and std of the entire dataset** (including test data), the model may inadvertently learn information from the test set — this is **data leakage**.
* The correct way is to compute transformation parameters **only from the training set** and **apply them** to both training and test sets.

---

### ❌ Why the others are incorrect:

* **1. To improve model interpretability**
  → Consistency doesn’t directly impact interpretability — that's more about feature clarity and model type.

* **3. To increase the dimensionality of the dataset**
  → We usually **want to reduce** or control dimensionality — not increase it unnecessarily.

* **4. To ensure the model is unbiased**
  → While consistency **may** help reduce bias, the primary goal is to **prevent leakage and ensure valid evaluation**.

---

### Summary:

> **Consistent feature engineering ensures that your model is evaluated fairly and generalizes well — by preventing test data from “leaking” into training.**


## 38. Which of the following best describes the main purpose of Principal Component Analysis (PCA)? 
1. To create new features that are correlated with the original data 
2. To transform the data into a set of uncorrelated variables while retaining most of the data's variance 
3. To reduce the sample size of the dataset 
4. To ensure all features have a mean of 0 and a standard deviation of 1

The correct answer is:
**2. To transform the data into a set of uncorrelated variables while retaining most of the data's variance**

---

### ✅ Explanation:

**Principal Component Analysis (PCA)** is a **dimensionality reduction** technique that:

* **Transforms** the original features into a new set of features called **principal components**
* These components are:

  * **Uncorrelated** (orthogonal to each other)
  * Ordered by the amount of **variance** they capture from the original data
* The goal is to **reduce dimensionality** while **preserving as much variance as possible**

---

### ❌ Why the others are incorrect:

* **1. To create new features that are correlated with the original data**
  → PCA creates **uncorrelated** components; they may not be easily interpretable in terms of original features.

* **3. To reduce the sample size of the dataset**
  → PCA reduces the **number of features**, not the number of **samples**.

* **4. To ensure all features have a mean of 0 and a standard deviation of 1**
  → This describes **standardization**, which is often done **before** PCA, but it’s **not** PCA's main purpose.

---

### Summary:

> **PCA helps simplify complex datasets by converting them into a set of uncorrelated variables that still capture the majority of the original variance.**


## 39. When would it be more appropriate to use Spearman's Rank Correlation over Pearson's Correlation? 
1. When the data is continuous and normally distributed 
2. When the relationship between variables is expected to be linear 
3. When the data is ordinal or not normally distributed, or when there are outliers 
4. When you need to measure the strength of a linear relationship between two continuous variables

The correct answer is:
**3. When the data is ordinal or not normally distributed, or when there are outliers**

---

### ✅ Explanation:

**Spearman's Rank Correlation** is a **non-parametric** measure of correlation. It is best used when:

* The data is **ordinal** (ranked), or
* The variables have a **monotonic** relationship (not necessarily linear), or
* The data is **not normally distributed**, or
* There are **outliers** that could distort Pearson’s correlation

---

### ❌ Why the others are incorrect:

* **1. When the data is continuous and normally distributed**
  → Use **Pearson’s correlation** for this case.

* **2. When the relationship between variables is expected to be linear**
  → Again, **Pearson’s correlation** is more appropriate.

* **4. When you need to measure the strength of a linear relationship between two continuous variables**
  → This directly describes **Pearson’s correlation**, not Spearman’s.

---

### Summary:

> **Use Spearman’s Rank Correlation when data is ordinal, not normally distributed, or contains outliers, and you're interested in monotonic relationships.**
