# Feature Engineering

1. What is a parameter?

Ans:  
In Machine Learning, a *parameter* is a value that the model learns automatically from the training data during the training process. These values define how the model makes predictions.  
  
For example:  

* In linear regression, the slope and intercept are parameters.

* In neural networks, the weights and biases are parameters.  
Parameters are adjusted during training to minimize error and improve the model’s performance.

2. What is correlation?  
What does negative correlation mean?

Ans:  
*Correlation* is a statistical measure that describes the strength and direction of a relationship between two variables. It usually ranges from –1 to +1.
  
* +1 indicates a perfect positive relationship
* 0 indicates no linear relationship
* –1 indicates a perfect negative relationship
  
A negative correlation means that as one variable increases, the other variable decreases.  
For example, if product price increases and demand decreases, this reflects a negative correlation between price and demand.

3. Define Machine Learning. What are the main components in Machine Learning?

Ans:  
*Machine Learning* is defined as the field of study that gives computers the ability to learn from experience without being *explicitly programmed*.
  
A commonly used formal definition (by Tom Mitchell) states:  
A computer program is said to learn from `Experience (E)` with respect to some `Task (T)` and `Performance measure (P)` if its performance at task T, as measured by P, improves with experience E.

Main Components:  
  
* Task (T) – The specific problem the model is trying to solve.  
Example: Classifying emails as spam or not spam.  
  
* Experience (E) – The data the model learns from.  
Example: A dataset of labeled emails.  
  
* Performance (P) – The metric used to evaluate how well the model performs the task.
Example: Accuracy, precision, recall, or error rate.  
  
In simple terms, machine learning involves improving performance on a task through experience.

4. How does loss value help in determining whether the model is good or not?

Ans:  
The *loss value measures* how far the model’s predictions are from the actual target values. It quantifies the *error* made by the model.
  
* A lower loss value indicates that the predictions are closer to the true values, meaning the model is performing well.
* A higher loss value indicates larger errors, meaning the model needs improvement.
  
During training, the objective is to minimize the loss function. If the loss consistently decreases over iterations, it shows that the model is learning effectively. However, to determine whether a model is truly good, loss should also be evaluated on validation or test data, not just training data, to ensure it generalizes well and is not overfitting.

5. What are continuous and categorical variables?

Ans:  
*Continuous variables* are numerical variables that can take any value within a range, including decimals. They represent measurable quantities.
  
Examples:
* Height
* Weight
* Temperature
* Sales revenue
  
*Categorical variables* are variables that represent distinct groups or categories rather than numeric measurements.  
  
Examples:  
* Gender
* Product category
* Payment method
* Customer segment
  
In summary, continuous variables measure quantities, while categorical variables represent labels or groups.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?

Ans:     
Categorical variables cannot be directly used in most machine learning algorithms because models require numerical input. Therefore, they must be converted into numerical form before training.
  
Common techniques to handle categorical variables:  

1. **Label Encoding:**  
* Each category is assigned a unique integer value.
* Suitable for ordinal data (where order matters), such as Low, Medium, High.
  
2. **One-Hot Encoding:**  
* Creates separate binary columns for each category.
* Suitable for nominal data (no order), such as colors or cities.
  
3. **Ordinal Encoding:**  
* Used when categories have a natural ranking.
* Example: Education level (High School < Bachelor’s < Master’s).
  
4. **Target Encoding:**  
* Replaces categories with the mean of the target variable for that category.
* Often used in high-cardinality features.

7. What do you mean by training and testing a dataset?

Ans:      
Training and testing a dataset refers to splitting the available data into two parts to build and evaluate a machine learning model.
  
**Training Dataset:**  
  
* The training data is used to teach the model.
* The model learns patterns, relationships, and parameters from this data.
  
**Testing Dataset:**  
  
* The testing data is used to evaluate how well the model performs on unseen data.
* It helps determine whether the model generalizes well or is overfitting.
  
In simple terms, the model learns from the training set and is evaluated on the testing set to measure its real-world performance.

8. What is sklearn.preprocessing?

Ans:  
`sklearn.preprocessing` is a module in the Scikit-learn library used for data preprocessing and transformation before training machine learning models.

It provides tools to:  

* Scale numerical data (e.g., StandardScaler, MinMaxScaler)
* Encode categorical variables (e.g., LabelEncoder, OneHotEncoder)
* Normalize data
* Binarize features

Preprocessing is important because many machine learning algorithms perform better when the input data is properly scaled and formatted.

9. What is a Test set?

Ans:  
A test set is a portion of the dataset that is kept separate from the training data and used only to evaluate the performance of a trained model.
  
It contains unseen data, meaning the model has not learned from it during training. This helps measure how well the model generalizes to new, real-world data

10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

Ans:  
In Python, we commonly use `train_test_split` from scikit-learn to divide the dataset into training and testing sets.

In [1]:
from sklearn.model_selection import train_test_split

# Example features (X) and target (y)
X = [[1], [2], [3], [4], [5]]
y = [2, 4, 6, 8, 10]

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

To approach a Machine Learning problem a structured approach typically includes:
  
1. Understand the Problem  
Define the objective, business goal, and evaluation metric.  

2. Collect and Explore Data (EDA)  
Understand distributions, missing values, correlations, and patterns.  

3. Data Preprocessing  
Handle missing values, encode categorical variables, scale features, and split data.  

4. Model Selection  
Choose suitable algorithms based on the problem type (regression, classification, etc.).  
  
5. Model Training  
Train the model using the training dataset.  

6. Model Evaluation  
Evaluate performance using appropriate metrics (accuracy, RMSE, precision, recall, etc.).  
  
7. Hyperparameter Tuning  
Optimize model performance using techniques like Grid Search or Cross-Validation.  
  
8. Deployment and Monitoring  
Deploy the model and monitor performance over time.  
  
This structured workflow ensures clarity, reproducibility, and effective model development.

11. Why do we have to perform EDA before fitting a model to the data?

Ans:  
Performing Exploratory Data Analysis (EDA) before fitting a model is important because it helps you understand the data before making assumptions.
  
EDA allows you to:  
* Detect missing values, duplicates, and inconsistencies
* Identify outliers that may affect model performance
* Understand the distribution of variables
* Examine relationships and correlations between features
* Check for data imbalance in classification problems
  
Without EDA, you risk building a model on flawed or misunderstood data, which can lead to poor accuracy, biased results, or overfitting.  
  
In simple terms, EDA ensures the data is clean, meaningful, and suitable before applying any machine learning algorithm.

12. What is correlation?

Ans:  
Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables.
  
It ranges from –1 to +1:  
* +1 → Perfect positive correlation (both variables increase together)
* 0 → No linear relationship
* –1 → Perfect negative correlation (one increases while the other decreases)
  
Correlation helps understand how strongly two variables are related and whether the relationship is positive or negative.

13. What does negative correlation mean?

Ans:  
A negative correlation means that as one variable increases, the other variable decreases.
For example, if product price increases and demand decreases, this reflects a negative correlation between price and demand.

14.  How can you find correlation between variables in Python?

Ans:  
You can find correlation between variables in Python using NumPy or Pandas.

In [3]:
##using numpy
import numpy as np

x = [10, 20, 30, 40, 50]
y = [15, 25, 35, 45, 60]

correlation = np.corrcoef(x, y)[0][1]
print("Correlation:", correlation)

Correlation: 0.995893206467704


In [4]:
import pandas as pd

data = pd.DataFrame({
    "x": [10, 20, 30, 40, 50],
    "y": [15, 25, 35, 45, 60]
})

correlation = data.corr()
print(correlation)

          x         y
x  1.000000  0.995893
y  0.995893  1.000000


Both methods typically calculate the Pearson correlation coefficient by default.

15. What is causation? Explain difference between correlation and causation with an example.

Ans:  
**Causation** means that one variable directly causes a change in another variable. There is a clear cause-and-effect relationship.

Difference between Correlation and Causation  
* Correlation means two variables move together, but it does not prove that one causes the other.
* Causation means one variable directly influences the other.
  
Example:  
Suppose data shows that ice cream sales and drowning incidents both increase during summer.  
These two variables are positively correlated because they increase at the same time.  
However, ice cream sales do not cause drowning.  
  
The actual cause is a third factor — hot weather — which increases both swimming activity (leading to drowning incidents) and ice cream consumption.

Summary

Correlation = Relationship between variables

Causation = Direct cause-and-effect relationship

**Correlation alone is not enough to conclude causation.**


16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

Ans:  
An optimizer is an algorithm used in machine learning and deep learning to adjust the model’s parameters (weights and biases) in order to minimize the loss function. It updates parameters step-by-step so that the model’s predictions become more accurate.
  
Common Types of Optimizers  
1. Gradient Descent (Batch Gradient Descent):
* This optimizer calculates the gradient using the entire training dataset before updating the parameters.
* Stable but can be slow for large datasets.
* Example: Linear regression training using full dataset at each step.

2. Stochastic Gradient Descent (SGD):
* SGD updates parameters using one training example at a time.
* Faster updates.
* More noisy but can escape local minima.
* Example: Training a logistic regression model using individual samples.

3. Mini-Batch Gradient Descent:
* This is a combination of Batch and SGD. It updates parameters using small batches of data.
* Faster and more stable.
* Most commonly used in practice.
* Example: Neural network training with batch size = 32.

4. Momentum:
* Momentum improves gradient descent by adding a fraction of the previous update to the current one.
* Helps accelerate convergence.
* Reduces oscillations.
* Example: Deep neural network training with faster convergence.

5. RMSProp
* RMSProp adjusts the learning rate based on recent gradient magnitudes.
* Works well for non-stationary problems.
* Often used in recurrent neural networks.

6. Adam (Adaptive Moment Estimation)
* Adam combines Momentum and RMSProp.
* Adapts learning rate automatically.
* Fast convergence.
* Most widely used optimizer in deep learning.
* Example: Training convolutional neural networks in image classification tasks.


17. What is sklearn.linear_model ?


Ans:  
`sklearn.linear_model` is a module in the Scikit-learn library that provides tools for implementing linear models used in regression and classification tasks.

It includes algorithms such as:

* Linear Regression – for predicting continuous values

* Logistic Regression – for classification problems

* Ridge and Lasso Regression – for regularization

* ElasticNet – combination of Ridge and Lasso
 
These models assume a linear relationship between input features and the target variable. The module is widely used because it is simple, efficient, and effective for many real-world machine learning problems.

18. What does model.fit() do? What arguments must be given?

Ans:  
`model.fit()` is used to train a machine learning model. It allows the model to learn patterns from the training data by adjusting its parameters to minimize the loss function.
  
In simple terms, it teaches the model using the provided data.  

The required arguments are:
* X → The input features (independent variables)  
* y → The target variable (dependent variable)  
  
Some models may also accept additional optional arguments (such as sample weights), but X and y are the essential inputs for supervised learning models.

19. What does model.predict() do? What arguments must be given?

Ans:  
`model.predict()` is used to generate predictions from a trained machine learning model. It applies the learned parameters to new input data and returns predicted values or class labels.
  
In simple terms, after training with model.fit(), we use model.predict() to make predictions on unseen data.  
The required argument is:  
  
X → The input feature data for which predictions are to be made.

20. What are continuous and categorical variables?

Ans:  
*Continuous variables* are numerical variables that can take any value within a range, including decimals. They represent measurable quantities.
  
Examples:
* Height
* Weight
* Temperature
* Sales revenue
  
*Categorical variables* are variables that represent distinct groups or categories rather than numeric measurements.  
  
Examples:  
* Gender
* Product category
* Payment method
* Customer segment
  
In summary, continuous variables measure quantities, while categorical variables represent labels or groups.

21. What is feature scaling? How does it help in Machine Learning?

Ans:  
**Feature scaling** is the process of transforming numerical features so that they are on a similar scale or range.
  
In many datasets, features can have different ranges. For example, income may range in thousands, while age ranges between 18 and 60. Without scaling, features with larger values can dominate the model.
  
Common methods:  
  
1. Standardization (Z-score scaling) – centers data around mean = 0 and standard deviation = 1

2. Min-Max scaling – rescales values between 0 and 1

**How it helps in Machine Learning:**
  
* Improves model convergence speed

* Prevents features with large ranges from dominating

* Improves performance of distance-based algorithms (KNN, K-Means)

* Essential for gradient-based algorithms (Linear Regression, Logistic Regression, Neural Networks)

22. How do we perform scaling in Python?

Ans:  
We perform scaling in Python using the sklearn.preprocessing module from Scikit-learn.

In [6]:
##1. Standardization (Z-score Scaling)

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[100], [200], [300], [400], [500]])

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)

[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


* Centers data around mean = 0
* Scales to standard deviation = 1

In [7]:
##2. Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]


* Rescales values between 0 and 1

Important Note:

Always:

Fit the scaler on training data

Transform both training and testing data using the same scaler

This prevents data leakage and ensures proper model evaluation.

23. What is sklearn.preprocessing?

Ans:  
`sklearn.preprocessing` is a module in the Scikit-learn library used for data preprocessing and transformation before training machine learning models.

It provides tools to:  

* Scale numerical data (e.g., StandardScaler, MinMaxScaler)
* Encode categorical variables (e.g., LabelEncoder, OneHotEncoder)
* Normalize data
* Binarize features

Preprocessing is important because many machine learning algorithms perform better when the input data is properly scaled and formatted.

24. How do we split data for model fitting (training and testing) in Python?

Ans:  
In Python, we commonly use `train_test_split` from scikit-learn to divide the dataset into training and testing sets.

In [9]:
from sklearn.model_selection import train_test_split

# Example features (X) and target (y)
X = [[1], [2], [3], [4], [5]]
y = [2, 4, 6, 8, 10]

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

25. Explain data encoding?

Ans:  
Data encoding is the process of converting categorical variables into numerical form so they can be used in machine learning models. Most algorithms require numerical input, so categorical data must be transformed before training.m