## Feature Engineering


## Q1. What is a parameter?

A parameter is a variable that a machine learning model **learns from the training data** to make predictions.  
Example: In linear regression `y = w*x + b`, `w` (weight) and `b` (bias) are parameters.


# Q2: What is correlation?
Ans).
Correlation is a statistical measure that describes the strength and direction of a relationship
between two variables. It ranges from -1 to +1. A higher absolute value indicates a stronger relationship.


# Q3: What does negative correlation mean?
Ans).
Negative correlation means that as one variable increases, the other variable tends to decrease.
For example, if study hours decrease and mistakes increase, these two variables are negatively correlated.


# Q4: Define Machine Learning. What are the main components in Machine Learning?
Ans).
Machine Learning is a branch of Artificial Intelligence where systems learn patterns from data
to make predictions or decisions without being explicit programming.

###Main components:
1. Dataset - The data used for training and testing.
2. Features - Input variables used for predictions.
3. Model - The algorithm that learns patterns from the data.
4. Parameters - Internal variables learned by the model.
5. Hyperparameters - External settings that control the learning process.
6. Evaluation Metrics - Measures that check model performance.


# Q5: How does loss value help in determining whether the model is good or not?
Ans).
Loss is a numerical measure of how well the model’s predictions match the actual outcomes.
A lower loss indicates better model performance. During training, models try to minimize this loss.

# Q6: What are continuous and categorical variables?
Ans).
Continuous variables are numeric variables that can take any value within a range (e.g., height, temperature).
Categorical variables are variables with distinct categories or labels (e.g., gender, color, country).


# Q7: How do we handle categorical variables in Machine Learning? What are the common techniques?
Ans).
Categorical variables need to be converted into numeric format for ML models. Common techniques include:
1. Label Encoding - Assigns each category a unique integer.
2. One-Hot Encoding - Creates binary columns for each category.
3. Ordinal Encoding - Assigns integers based on order if the categories are ordinal.


# Q8: What do you mean by training and testing a dataset?
Ans).
Training data is used to teach the model patterns from data, while testing data is used to evaluate
how well the model generalizes to unseen data. This ensures the model is not overfitting.

# Q9: What is sklearn.preprocessing?
Ans).
sklearn.preprocessing is a module in scikit-learn that provides functions for data preprocessing,
such as scaling, normalization, encoding categorical variables, and transforming features to improve model performance.


- Scaling / Normalization – adjusting numeric features to a similar scale.

- Encoding categorical variables – converting non-numeric data to numeric.

- Handling missing values – preprocessing techniques like imputation.

- Transformations – like polynomial features, logarithmic transformations, etc.

Preprocessing is important because most ML algorithms perform better when data is scaled and numeric.


# Q10: What is a Test set?
Ans).
A test set is a subset of the dataset that the model has not seen during training.
It is used to evaluate the model’s performance and generalization on unseen data.


# Q11: How do we split data for model fitting (training and testing) in Python?
Ans).
In Python, we can use scikit-learn’s train_test_split function:<br>

from sklearn.model_selection import train_test_split<br>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)<br>

This splits data into training (80%) and testing (20%) sets.


# Q12: How do you approach a Machine Learning problem?
Ans).
1. Understand the problem and objectives.
2. Collect and explore the data.
3. Preprocess the data (cleaning, handling missing values, encoding).
4. Perform exploratory data analysis (EDA) to find patterns.
5. Choose the appropriate ML model.
6. Train and validate the model.
7. Evaluate the model using metrics.
8. Optimize hyperparameters and improve performance.


# Q13: Why do we have to perform EDA before fitting a model to the data?
answer_q13 = """
EDA (Exploratory Data Analysis) helps understand data distribution, patterns, and relationships.
It identifies missing values, outliers, and correlations, which informs preprocessing and model selection.


# Q14: How can you find correlation between variables in Python?
Ans).
We can use pandas’ corr() function:

(import pandas as pd
correlation_matrix = df.corr()
print(correlation_matrix))

This computes the Pearson correlation coefficient between numeric variables.


# Q15: What is causation? Explain difference between correlation and causation with an example.
Ans).
Causation means that one variable directly causes a change in another.

Difference:
- Correlation: Measures how two variables move together, but one does not necessarily cause the other.
- Causation: One variable directly affects the other.

Example: Ice cream sales and drowning incidents are positively correlated (more ice cream sold, more drowning),
but ice cream sales do not cause drowning. Heat in summer increases both, which is the actual cause.




# Q16: What is an Optimizer? What are different types of optimizers? Explain each with an example.
Ans).
An optimizer is an algorithm used to update the model’s parameters (weights and biases) during training
to minimize the loss function. Optimizers adjust the model in the direction that reduces error.

Common types of optimizers:

1. Gradient Descent (GD): Updates parameters using the full dataset gradient. Simple but can be slow.
2. Stochastic Gradient Descent (SGD): Updates parameters using one sample at a time. Faster, introduces noise.
3. Mini-batch Gradient Descent: Uses small batches of data for updates. Balances speed and stability.
4. Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning rate. Often used in deep learning.
5. RMSProp: Uses squared gradients to normalize learning rates. Useful for non-stationary objectives.

Example (using Keras for Adam optimizer):

from tensorflow.keras.optimizers import Adam<br>
optimizer = Adam(learning_rate=0.001)


# Q17: What is sklearn.linear_model?
answer_q2 = """
sklearn.linear_model is a module in scikit-learn that contains classes for linear models.
These models assume a linear relationship between input features and target variable.

Examples:
- LinearRegression: For predicting continuous targets.
- LogisticRegression: For classification tasks.
- Ridge, Lasso: Regularized linear models to prevent overfitting.

# Q18: What does model.fit() do? What arguments must be given?
Ans).
model.fit() trains the machine learning model using the training data.
It learns the patterns in the data and updates the model's parameters.

Arguments:
- X: Input features (2D array or DataFrame)
- y: Target variable (1D array or Series)

Example:
from sklearn.linear_model import LinearRegression<br>
model = LinearRegression()<br>
model.fit(X_train, y_train)<br>

# Q19: What does model.predict() do? What arguments must be given?
Ans).
model.predict() uses the trained model to make predictions on new or test data.

Arguments:
- X: Input features for which predictions are required.

Example:<br>
y_pred = model.predict(X_test)


# Q20: What are continuous and categorical variables?
answer_q5 = """
Continuous variables are numeric variables that can take any value in a range (e.g., height, temperature).
Categorical variables have discrete categories or labels (e.g., color, gender, city).


# Q21: What is feature scaling? How does it help in Machine Learning?
Ans).
Feature scaling is the process of transforming features to a similar scale or range.
It helps models converge faster, improves performance, and prevents features with larger magnitudes from dominating.

Techniques:
- Standardization: Scales data to mean=0, std=1
- Min-Max Scaling: Scales data to a fixed range, usually [0,1]

# Q22: How do we perform scaling in Python?
Ans)
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Min-Max Scaling
scaler = MinMaxScaler()
X_minmax = scaler.fit_transform(X)
"""

# Q23: What is sklearn.preprocessing?
Ans).
sklearn.preprocessing is a module in scikit-learn that provides tools for preparing and transforming data
before feeding it to a machine learning model.

It includes:
- Scaling and normalization
- Encoding categorical variables
- Feature transformations

# Q24: How do we split data for model fitting (training and testing) in Python?
answer_q9 = """
We use train_test_split from scikit-learn to split data into training and testing sets.

from sklearn.model_selection import train_test_split<br>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
"""

# Q25: Explain data encoding?
Ans).
Data encoding is the process of converting categorical (non-numeric) data into numeric format so
that machine learning models can process it.

Common techniques:
1. Label Encoding: Assigns an integer to each category.
2. One-Hot Encoding: Creates binary columns for each category.
3. Ordinal Encoding: Assigns integers based on a defined order.

