# **Feature Engineering**

Theorectical Questions

**1. What is a parameter?**
- A parameter is a configuration variable internal to the model which is estimated from the data. Examples include weights in linear regression or coefficients in logistic regression.

**2. What is correlation? What does negative correlation mean?**
- Correlation is a statistical measure that expresses the extent to which two variables are linearly related.
- A negative correlation means that as one variable increases, the other decreases.

**3. Define Machine Learning. What are the main components in Machine Learning?**
- Machine Learning is a subset of AI where systems learn from data without being explicitly programmed.

Main components:
- Dataset
- Features
- Algorithms
- Model
- Loss function
- Optimizer

**4. How does loss value help in determining whether the model is good or not?**
- Loss value indicates how far the model's predictions are from the actual values. A lower loss implies a better performing model.

**5. What are continuous and categorical variables?**
- Continuous variables are numerical values that can take any value within a range, such as height, temperature, or income — they are measurable and infinite in precision. In contrast, categorical variables represent distinct groups or categories, like gender, color, or type of vehicle — they are not numerical in nature. Continuous variables are used in regression tasks, while categorical variables are often used in classification tasks. Understanding the type of variable helps in choosing the right analysis or machine learning model.

**6. How do we handle categorical variables in Machine Learning? What are the common techniques?**
- Most machine learning algorithms require numerical input, so categorical variables (like "Male"/"Female" or "Red"/"Blue") must be converted to numbers before training the model. This process is called encoding.

Common Techniques to Handle Categorical Variables:
- Label Encoding: Converts each category to a unique integer. Example: {"Red": 0, "Green": 1, "Blue": 2} Risk: Implies an order, which may not exist. Useful for: Tree-based models (e.g., Decision Trees, Random Forests)
- One-Hot Encoding: Creates separate binary columns for each category. Example: "Red" becomes [1, 0, 0], "Green" → [0, 1, 0] Prevents false order; preferred for most ML models. Can increase dimensionality if many categories.
- Ordinal Encoding: Like label encoding but used when order matters (e.g., {"Low": 0, "Medium": 1, "High": 2}). Good for: Ordered categories with meaningful rankings.
- Target Encoding (Mean Encoding) Replaces each category with the mean of the target variable for that category. Risk of overfitting, so usually combined with cross-validation.

**7. What do you mean by training and testing a dataset?**
- Training is used to teach the model, while testing evaluates its performance on unseen data.

**8. What is sklearn.preprocessing?**
- sklearn.preprocessing is a module in Scikit-learn (a popular Python machine learning library) that provides tools to prepare and transform data before feeding it into a model. It includes functions for scaling, normalizing, encoding categorical variables, handling missing values, and more.

**9. What is a Test set?**
- A test set is a subset of the dataset used to evaluate the trained model's performance.

**10. How do we split data for model fitting (training and testing) in Python?**
**How do you approach a Machine Learning problem?**
- You can use train_test_split from sklearn.model_selection to easily split your dataset into training and testing sets.
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**11. Why do we have to perform EDA before fitting a model to the data?**
- Understand the problem
- Collect data
- Clean and preprocess
- Perform EDA
- Select features
- Choose model
- Train and evaluate
- Tune hyperparameters
- Deploy

**12. What is correlation?**
- Correlation is a statistical measure that shows the strength and direction of a relationship between two variables. It indicates how one variable changes when the other variable changes. The correlation coefficient ranges from -1 to +1.

**13. What does negative correlation mean?**
- A negative correlation means that as one variable increases, the other decreases.

**14. How can you find correlation between variables in Python?**
- Use libraries like pandas or NumPy.
- Calculate the Pearson correlation coefficient, which measures linear correlation between two variables.
- In pandas, use the .corr() method on a DataFrame to get correlations between all columns or between two specific columns.
- In NumPy, use np.corrcoef() to compute the correlation coefficient matrix for arrays.
- The output is a value between -1 and +1 indicating the strength and direction of the relationship.

**15. What is causation? Explain difference between correlation and causation with an example.**
- Causation means that one event or variable directly causes a change in another. In other words, a change in variable A produces a change in variable B.
- Correlation means that two variables have a relationship and tend to move together, but one does not necessarily cause the other. It simply shows an association or pattern between variables. For example, ice cream sales and drowning incidents both increase during summer, so they are correlated because of the season, but ice cream sales do not cause drowning.
- Causation means that one variable directly causes a change in another. It implies a cause-and-effect relationship. For instance, smoking causes lung cancer, where smoking is the actual reason for the increased risk of disease. Unlike correlation, causation proves that one event is the direct result of another.

**16. What is an Optimizer? What are different types of optimizers? Explain each with an example.**
- An optimizer is an algorithm used in machine learning and deep learning to adjust the model’s parameters (like weights) during training to minimize the loss function. The goal of the optimizer is to find the best values for these parameters so the model can make accurate predictions by reducing errors.

Different Types of Optimizers:
- Gradient Descent (GD): The simplest optimizer that updates model parameters by moving them in the direction of the negative gradient of the loss function. It uses all training data at once for each update. Example: For linear regression, gradient descent updates weights step-by-step to reduce prediction error.
- Stochastic Gradient Descent (SGD): Instead of using the entire dataset, SGD updates parameters using one training example at a time. This makes updates faster and adds noise that can help escape local minima. Example: When training a neural network on a large dataset, SGD updates weights after each example rather than waiting for the full batch.
- Mini-batch Gradient Descent: A middle ground between GD and SGD, it updates parameters using small batches of data. This balances speed and stability. Example: Using batches of 32 samples to update weights during deep learning model training.
- Adam (Adaptive Moment Estimation): A popular optimizer combining ideas from momentum and RMSProp, it adapts learning rates for each parameter individually based on first and second moments of gradients. This often leads to faster convergence. Example: Used widely in training deep learning models like CNNs and transformers.
- RMSProp: This optimizer adapts the learning rate for each parameter based on a moving average of recent gradients’ magnitudes, helping with training on noisy or sparse data. Example: Effective in recurrent neural networks (RNNs) for sequence data.

**17. What is sklearn.linear_model ?**
- A module in scikit-learn that provides linear models like LinearRegression, LogisticRegression.

**18. What does model.fit() do? What arguments must be given?**
- model.fit() is a method used in machine learning libraries like Scikit-learn to train or fit a model on your data. When you call fit(), the model learns the relationship between the input features and the target variable by adjusting its internal parameters.

**19. What does model.predict() do? What arguments must be given?**
- model.predict() is a method used to make predictions with a trained machine learning model. After you’ve trained your model using fit(), you use predict() to estimate the output (target variable) for new or unseen input data.

**20. What are continuous and categorical variables?**
- Continuous variables are numeric variables that can take any value within a range. They represent measurements like height, weight, temperature, or time, where values can be decimals and are infinite within a range.
- Categorical variables represent distinct groups or categories and take on a limited, fixed number of possible values. Examples include gender (male/female), colors (red, blue, green), or types of animals. They are often non-numeric or encoded into numbers for modeling.

**21. What is feature scaling? How does it help in Machine Learning?**
- It standardizes the range of features. Helps improve convergence speed and model performance.

**22. How do we perform scaling in Python?**
- Standardization (Z-score Scaling): Scales features to have a mean of 0 and
standard deviation of 1.
 1. from sklearn.preprocessing import StandardScaler
 2. scaler = StandardScaler()
 3. X_scaled = scaler.fit_transform(X)
- Min-Max Scaling (Normalization): Scale features to a fixed range, usually 0 to 1.
 1. from sklearn.preprocessing import MinMaxScaler
 2. scaler = MinMaxScaler()
 3. X_scaled = scaler.fit_transform(X)

**23. What is sklearn.preprocessing?**
- sklearn.preprocessing is a module in the Scikit-learn library that provides various tools and functions to preprocess and transform data before applying machine learning models. It includes methods for scaling, normalizing, encoding categorical variables, imputing missing values, and more.

**24. How do we split data for model fitting (training and testing) in Python?**
- In Python, we split data into training and testing sets using the train_test_split function from Scikit-learn’s model_selection module. This helps evaluate the model’s performance on unseen data.
- Import the function: from sklearn.model_selection import train_test_split
- Use it to split your features (X) and target (y) into training and testing sets:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**25. Explain data encoding?**
- Data encoding is the process of converting categorical (non-numeric) data into a numerical format so that machine learning models can understand and process it. Since most algorithms require numerical input, encoding transforms categories like “red,” “blue,” or “green” into numbers.