# **Feature Engineering Assignment Questions**

What is a parameter?
- A parameter is a quantity that helps define a system, function, or model and influences its behavior or outcome.

What is correlation?
What does negative correlation mean?
- Correlation is a statistical measure that shows the strength and direction of the relationship between two variables.
- Negative correlation means that when one variable increases, the other tends to decrease (and vice versa).

Define Machine Learning. What are the main components in Machine Learning?
- Machine Learning is a branch of Artificial Intelligence (AI) that focuses on developing algorithms and models that enable computers to learn patterns from data and make predictions or decisions without being explicitly programmed.
- Machine Learning involves data, a model, a learning algorithm, a loss function, and an evaluation process to build systems that learn patterns and make predictions.

How does loss value help in determining whether the model is good or not?
- Loss value helps measure how well a model is performing. A smaller loss means the model is better at predicting, but you must check it on unseen data to ensure the model is truly good.

What are continuous and categorical variables?
1. Continuous Variables

 - Variables that can take any numerical value within a range (infinite possibilities).

- They are measured, not counted.

- Can have decimal or fractional values.
2. Categorical Variables

- Variables that represent groups or categories, not numerical measurements.

- They are counted (not measured).

- Can be further divided into:

- Nominal: Categories with no order. (e.g., Colors: Red, Blue, Green)

- Ordinal: Categories with a logical order. (e.g., Education level: High School < Bachelor < Master < PhD)

How do we handle categorical variables in Machine Learning? What are the common t
echniques?
- Label Encoding → Simple, but adds false order.

- One-Hot Encoding → Most common, safe for nominal categories.

- Ordinal Encoding → For ordered categories.

- Frequency/Target Encoding → For large datasets, but needs caution.

What do you mean by training and testing a dataset?
1. Training a Dataset

- Training data is the portion of the dataset used to teach the model.

- The model learns patterns, relationships, and adjusts parameters (like weights in regression or neural networks) using this data.

- Example: In predicting house prices, the model learns from data like

   - Size: 1200 sq ft → Price: ₹50 lakhs

   - Size: 1500 sq ft → Price: ₹65 lakhs
2. Testing a Dataset

- Testing data is a separate portion of the dataset used to evaluate the model’s performance after training.

- The model makes predictions on this unseen data, and we compare predictions with the actual results.

- Example: If the test data says:

   - Size: 1800 sq ft → Actual Price: ₹75 lakhs

   - Model predicts ₹73 lakhs → good performance.

What is sklearn.preprocessing?
- sklearn.preprocessing = preprocessing toolkit in scikit-learn to scale, normalize, encode, and transform raw data into a format that ML models can understand better.

What is a Test set?
- A test set is a portion of the dataset that is kept separate from training data and is used only to evaluate the performance of the trained model.

How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?

- Split data → train and test.

- Preprocess → handle missing values, encode, scale.

- Train model → choose algorithm, fit training data.

- Evaluate → test on unseen data.

- Iterate → improve and deploy.

Why do we have to perform EDA before fitting a model to the data?
- EDA is like getting to know your data before modeling. It ensures:

   - Data is clean and reliable.

   - Features are meaningful.

   - You choose the right preprocessing and model

What is correlation?
- Correlation is a statistical measure that describes the relationship between two variables—specifically, how one variable changes when the other changes.

What does negative correlation mean?
- Negative correlation occurs when one variable increases while the other variable decreases, and vice versa. In other words, the two variables move in opposite directions.


How can you find correlation between variables in Python?

In [1]:
import pandas as pd

# Example dataset
data = {
    'Hours_Studied': [2, 3, 5, 8, 10],
    'Exam_Score': [50, 60, 65, 80, 95]
}

df = pd.DataFrame(data)

# Compute correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)


               Hours_Studied  Exam_Score
Hours_Studied       1.000000    0.988652
Exam_Score          0.988652    1.000000


What is causation? Explain difference between correlation and causation with an example.
- Causation (or cause-and-effect) means that a change in one variable directly causes a change in another variable.
1. Correlation but not causation:

  - Ice cream sales and drowning incidents both increase in summer.

- They are positively correlated, but ice cream sales do not cause drowning.

- The hidden factor (temperature) causes both.

2. Causation:

- Smoking increases the risk of lung cancer.

- Here, smoking directly contributes to cancer development.

What is an Optimizer? What are different types of optimizers? Explain each with an example
- In Machine Learning (especially in neural networks), an optimizer is an algorithm that adjusts the model’s parameters (weights and biases) to minimize the loss function during training.

- The goal is to find the best set of parameters so the model predicts accurately.

- Think of it as a guide that helps the model “descend” the error surface to reach the minimum loss

What is sklearn.linear_model ?
- sklearn.linear_model → provides linear regression and classification algorithms.

- Includes regularized regression (Ridge, Lasso) and linear classifiers (LogisticRegression, Perceptron).

- Easy to use and interpretable.

What does model.fit() do? What arguments must be given?
- In scikit-learn, model.fit() is the method used to train a machine learning model.

- When you call fit(), the model learns patterns from the training data by adjusting its internal parameters (weights, coefficients, biases, etc.) to minimize error.

What does model.predict() do? What arguments must be given?
- In scikit-learn, model.predict() is used to make predictions using a trained model.

- After you have trained a model with model.fit(), calling predict() allows you to input new data and get the model’s predicted outputs.

What are continuous and categorical variables?
1. Continuous Variables

- Definition: Variables that can take any numerical value within a range. They are measured, not counted, and can have decimal/fractional values.
2. Categorical Variables

- Definition: Variables that represent distinct categories or groups. They are counted, not measured.

What is feature scaling? How does it help in Machine Learning?
- Feature scaling is the process of rescaling the values of numerical features in a dataset so that they are on a similar scale.

   - Machine Learning algorithms often perform better or converge faster when features have similar ranges, especially if the features vary widely in magnitude.

   - Scaling does not change the relationships between features, only their scale.



How do we perform scaling in Python?

In [2]:
from sklearn.preprocessing import StandardScaler

# Sample data
X = [[1, 100], [2, 200], [3, 300]]

# Initialize scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)
print("Standardized Data:\n", X_scaled)


Standardized Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


What is sklearn.preprocessing?
- sklearn.preprocessing is a module in scikit-learn that provides tools to prepare and transform raw data before feeding it into a Machine Learning model.

   - Raw data often contains different scales, missing values, or categorical labels.
   - Preprocessing makes the data clean, consistent, and suitable for ML algorithms.



How do we split data for model fitting (training and testing) in Python?

In [3]:
from sklearn.model_selection import train_test_split
import numpy as np

# Example dataset
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])  # Features
y = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])          # Labels

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train:", X_train)
print("X_test:", X_test)


X_train: [[ 6]
 [ 1]
 [ 8]
 [ 3]
 [10]
 [ 5]
 [ 4]
 [ 7]]
X_test: [[9]
 [2]]


Explain data encoding?
- Data encoding in Machine Learning is the process of converting categorical data into numerical form so that algorithms can process it.

   - Most ML models (like Linear Regression, SVM, Neural Networks) cannot work directly with text or categorical data.

   - Encoding transforms these categories into numbers without losing the meaning of the data.