#**Assignment Questions- Feature Engineering**

**1. What is a parameter?**
--  
A parameter is a variable in a machine learning model that the algorithm learns from the data during training. For example, in linear regression, the weights (coefficients) are parameters that are adjusted to minimize the error between predictions and actual values.

**2. What is correlation?**  
--
Correlation measures the relationship between two variables, showing how changes in one variable are associated with changes in another. It is often quantified using Pearson’s correlation coefficient, which ranges from -1 (perfect negative) to 1 (perfect positive).

**3. What does negative correlation mean?**  
--
Negative correlation means that as one variable increases, the other decreases. For example, if studying time and number of errors on a test have a negative correlation, more study time leads to fewer errors.

**4. Define Machine Learning. What are the main components in Machine Learning?**
--
Machine Learning is a field of computer science where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed. The main components are:
- Data
- Features (variables)
- Model
- Loss function
- Optimizer
- Evaluation metrics

**5. How does loss value help in determining whether the model is good or not?**
--  
The loss value quantifies how well the model’s predictions match the actual outcomes. A lower loss means better performance. During training, the goal is to minimize this loss value.

**6. What are continuous and categorical variables?**
--  
- Continuous variables can take any numeric value within a range (e.g., height, weight).
- Categorical variables represent discrete categories or groups (e.g., color, gender).

**7. How do we handle categorical variables in Machine Learning? What are the common techniques?**  
--
Categorical variables are usually converted into numerical values using encoding techniques such as:
- Label Encoding (assigns each category a unique number)
- One-Hot Encoding (creates binary columns for each category)
- Target Encoding, Count Encoding, Binary Encoding, and Hash Encoding are other advanced methods.

**8. What do you mean by training and testing a dataset?**  
--
Training a dataset means using a portion of the data to teach the model. Testing means evaluating the model’s performance on unseen data to check how well it generalizes.

**9. What is sklearn.preprocessing?**  
--
`sklearn.preprocessing` is a module in the scikit-learn library that provides functions for scaling, encoding, and transforming data before feeding it into a machine learning model.

**10. What is a Test set?**  
--
A test set is a subset of the data that is kept aside during training and used only to evaluate the final performance of the model.

**11. How do we split data for model fitting (training and testing) in Python?**  
--
We typically use the `train_test_split` function from scikit-learn to randomly split the dataset into training and testing sets, for example:  
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

**12. How do you approach a Machine Learning problem?**  
--
- Understand the problem and data
- Clean and preprocess the data
- Engineer features
- Select and train a model
- Evaluate the model
- Tune hyperparameters and iterate

**13. Why do we have to perform EDA before fitting a model to the data?**  
--
Exploratory Data Analysis (EDA) helps us understand the data, detect patterns, spot anomalies, and decide on feature engineering steps, which leads to better model performance.

**14. What is correlation?**  
--
Correlation measures the relationship between two variables, showing how changes in one variable are associated with changes in another. It is often quantified using Pearson’s correlation coefficient, which ranges from -1 (perfect negative) to 1 (perfect positive).

**15. What does negative correlation mean?**  
--
Negative correlation means that as one variable increases, the other decreases. For example, if studying time and number of errors on a test have a negative correlation, more study time leads to fewer errors.

**16. How can you find correlation between variables in Python?**  
--
You can use the `.corr()` method in pandas to compute correlation:
```python
import pandas as pd
df.corr()
```

**17. What is causation? Explain difference between correlation and causation with an example.**  
--
Causation means one variable directly affects another. Correlation means two variables move together, but one doesn’t necessarily cause the other. For example, ice cream sales and drowning deaths are correlated (both rise in summer), but eating ice cream doesn’t cause drowning.

**18. What is an Optimizer? What are different types of optimizers? Explain each with an example.**  
--
An optimizer updates the model parameters to minimize the loss function. Common types:
- Gradient Descent: Updates parameters in the direction of the negative gradient.
- Adam: Adaptive Moment Estimation, combines momentum and adaptive learning rates.
- RMSprop: Uses moving averages of squared gradients for adaptive learning rates.

**19. What is sklearn.linear_model?**  
--
`sklearn.linear_model` is a module in scikit-learn that provides linear models like Linear Regression, Logistic Regression, etc.

**20. What does model.fit() do? What arguments must be given?**  
--
`model.fit()` trains the model on the training data. You must pass the features (X) and target (y), e.g., `model.fit(X_train, y_train)`.

**21. What does model.predict() do? What arguments must be given?**  
--
`model.predict()` uses the trained model to make predictions on new data. You pass the features, e.g., `model.predict(X_test)`.

**22. What are continuous and categorical variables?**  
--  
- Continuous variables can take any numeric value within a range (e.g., height, weight).
- Categorical variables represent discrete categories or groups (e.g., color, gender).

**23. What is feature scaling? How does it help in Machine Learning?**  
--
Feature scaling adjusts the range of features to be similar, usually by normalization (0 to 1) or standardization (mean 0, std 1). It helps algorithms perform better, especially those that rely on distance or assume data is centered, like k-NN, SVM, and PCA[3][6].

**24. How do we perform scaling in Python?**  
--
Using scikit-learn’s `StandardScaler` or `MinMaxScaler`:
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

**25. What is sklearn.preprocessing?**  
--
`sklearn.preprocessing` is a module in the scikit-learn library that provides functions for scaling, encoding, and transforming data before feeding it into a machine learning model.

**26. How do we split data for model fitting (training and testing) in Python?**  
--
We typically use the `train_test_split` function from scikit-learn to randomly split the dataset into training and testing sets, for example:  
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

**27. Explain data encoding.**
--
Data encoding converts categorical variables into numerical values so that machine learning models can use them. Common methods include Label Encoding and One-Hot Encoding. Proper encoding is crucial to avoid data leakage and ensure the model interprets the data correctly.

---