

## Feature Engineering

# Q1. What is a parameter?
- A parameter is an internal variable or coefficient of a machine learning model that is learned from the training data.
- Example: In Linear Regression, weights and biases (slope & intercept) are parameters.

# Q2. What is correlation? What does negative correlation mean?
- Correlation shows how two variables move in relation to each other.
- Positive Correlation: Both variables increase or decrease together.
- Negative Correlation: One increases while the other decreases.
- Correlation coefficient values:
  - +1: Perfect positive
  -  0: No linear correlation
  - -1: Perfect negative

# Q3. Define Machine Learning. What are the main components in Machine Learning?
- Machine Learning (ML) is the science of enabling machines to learn patterns from data without being explicitly programmed.
- Main Components:
  1. Dataset: Features (X) + Labels (y)
  2. Model/Algorithm: Mathematical system to find patterns
  3. Loss Function: Measures prediction error
  4. Optimizer: Improves parameters by minimizing loss
  5. Training Process: Model learns from data
  6. Evaluation: Measures accuracy/performance

# Q4. How does loss value help in determining whether the model is good or not?
- The loss value indicates the error between predicted and actual outputs.
- High loss ⇒ Poor predictions
- Low loss ⇒ Better accuracy
- The model's training process aims to minimize this loss using optimizers.

# Q5. What are continuous and categorical variables?
- Continuous Variables: Numeric values that can take infinite values in a range. (e.g., salary, height)
- Categorical Variables: Represent categories or labels. (e.g., color, city, gender)

# Q6. How do we handle categorical variables in Machine Learning? What are the common techniques?
- Machine learning models require numeric input. So, categorical variables must be encoded.
- Common Techniques:
  1. Label Encoding
  2. One-Hot Encoding
  3. Ordinal Encoding
  4. Target Encoding

# Example (One-Hot Encoding):
from sklearn.preprocessing import OneHotEncoder
# encoder = OneHotEncoder()
# encoded_data = encoder.fit_transform(df[['Category']])

# Q7. What do you mean by training and testing a dataset?
- Training Dataset: Data used to fit the model.
- Testing Dataset: Data used to evaluate how well the model performs on unseen data.
- This helps in checking if the model is overfitting or underfitting.

# Q8. What is sklearn.preprocessing?
- It is a module in Scikit-learn used for data preprocessing.
- Includes:
  - Scaling (StandardScaler, MinMaxScaler)
  - Encoding (LabelEncoder, OneHotEncoder)
  - Normalization, binarization

# Q9. What is a Test set?
- The Test Set is a part of the dataset (usually 20%-30%) that is kept aside to evaluate the model's final performance after training.
- It simulates real-world prediction scenarios.

# Q10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
- Approach to a ML Problem:
  1. Understand problem
  2. Load and clean data
  3. Perform EDA
  4. Feature engineering
  5. Split data
  6. Train model
  7. Evaluate and improve

# Q11. Why do we have to perform EDA before fitting a model to the data?
- EDA (Exploratory Data Analysis) helps to:
  - Understand data types
  - Visualize distributions
  - Identify missing values, duplicates, outliers
  - Choose correct preprocessing
  - Avoid data leakage
- This improves model reliability and interpretability.

# Q12. What is correlation?
- Correlation is a statistical measure of how two variables move in relation to one another.
- Formula (Pearson): correlation = cov(X, Y) / (std(X) * std(Y))
- Values range from:
  - -1 = perfect negative
  -  0 = no correlation
  - +1 = perfect positive

# Q13. What does negative correlation mean?
- Negative correlation means that as one variable increases, the other tends to decrease.
- Example: As speed increases, travel time decreases (negative correlation).

# Q14. How can you find correlation between variables in Python?
```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix:
corr_matrix = df.corr()

# Display heatmap:
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
```

# Q15. What is causation? Explain difference between correlation and causation with an example.
- Correlation: Two variables move together (may or may not be linked).
- Causation: One variable directly influences another.

- Example:
  - Correlation: Ice cream sales ↑ and drowning cases ↑ (common cause = summer)
  - Causation: Smoking causes cancer (causal effect)

# Q16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
- An Optimizer is an algorithm that updates model weights to minimize loss during training.
- Types:
  1. SGD (Stochastic Gradient Descent): Updates weights using gradients from single samples.
  2. Adam (Adaptive Moment Estimation): Combines momentum and RMSprop.
  3. RMSprop: Uses exponentially decaying average of gradients.

```python
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
```

# Q17. What is sklearn.linear_model?
- It's a Scikit-learn module with linear models:
  - LinearRegression
  - LogisticRegression
  - Ridge, Lasso

```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
```

# Q18. What does model.fit() do? What arguments must be given?
- It fits/trains the model using the training data.
- Syntax:
```python
model.fit(X_train, y_train)
```
- Arguments:
  - X_train: feature input
  - y_train: target variable

# Q19. What does model.predict() do? What arguments must be given?
- It predicts output for new/unseen data.
- Syntax:
```python
y_pred = model.predict(X_test)
```
- Arguments:
  - X_test: new feature data

# Q20. What are continuous and categorical variables?
- (Repetition of Q5)
- Continuous: Numeric values (e.g., age)
- Categorical: Class labels (e.g., Male, Female)

# Q21. What is feature scaling? How does it help in Machine Learning?
- Feature Scaling standardizes the range of independent variables.
- Helps models that are sensitive to magnitude (like KNN, SVM, Logistic Regression).
- Prevents bias toward larger values.

# Q22. How do we perform scaling in Python?
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
- Other options: MinMaxScaler, RobustScaler

# Q23. What is sklearn.preprocessing?
- (Already answered in Q8)
- A module for preprocessing tasks like:
  - Encoding
  - Scaling
  - Imputation
  - Normalization

# Q24. How do we split data for model fitting (training and testing) in Python?
- (Already answered in Q10)
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

# Q25. Explain data encoding?
- Data encoding is the process of converting categorical variables to numeric.
- Required for ML models.

- Techniques:
  - Label Encoding
  - One-Hot Encoding
  - Binary Encoding
  - Target Encoding

```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['category'])
```
