#**Assignment Feature Engineering**


### **1. What is a parameter?**

In Machine Learning, a **parameter** is an internal variable the model learns from data.

* Example: In linear regression, the **slope (m)** and **intercept (b)** of the line are parameters.

---

### **2. What is correlation?**

**Correlation** measures the strength and direction of a relationship between two variables.

* Value ranges from **-1 to +1**:

  * +1 = perfect positive correlation
  * 0 = no correlation
  * -1 = perfect negative correlation

---

### **3. What does negative correlation mean?**

**Negative correlation** means that as one variable increases, the other decreases.

* Example: More exercise (↑), less body fat (↓).

---

### **4. Define Machine Learning. What are the main components in Machine Learning?**

**Machine Learning** is the science of getting computers to learn from data without being explicitly programmed.

**Main components:**

* **Data** – input for training the model
* **Model** – algorithm used to learn patterns
* **Loss Function** – measures how wrong the model is
* **Optimizer** – adjusts model parameters to reduce loss
* **Training** – process of learning from data

---

### **5. How does loss value help in determining whether the model is good or not?**

The **loss value** tells how far the model's prediction is from the actual result.

* **Lower loss** = better performance
* If loss is too **high**, the model is not learning correctly.

---

### **6. What are continuous and categorical variables?**

* **Continuous variable**: Numeric, can take any value (e.g., height, temperature)
* **Categorical variable**: Divides data into categories (e.g., gender, color, country)

---

### **7. How do we handle categorical variables in Machine Learning? What are the common techniques?**

We convert them into numbers using:

* **Label Encoding** – assigns a unique number to each category
* **One-Hot Encoding** – creates separate columns for each category with 0/1 values

---

### **8. What do you mean by training and testing a dataset?**

* **Training set** – Used to train the model (learn patterns)
* **Testing set** – Used to evaluate the model on unseen data

This helps in checking whether the model performs well on new data.

---

### **9. What is `sklearn.preprocessing`?**

It’s a module in **Scikit-learn** used for:

* Scaling data (e.g., StandardScaler, MinMaxScaler)
* Encoding categorical variables
* Normalizing features

Example:

```python
from sklearn.preprocessing import StandardScaler
```

---

### **10. What is a Test set?**

The **Test set** is a portion of the data not used during training.
It's used **only** to evaluate how well the model performs on **new, unseen** data.

---

### **How do we split data for model fitting (training and testing) in Python?**

Use **train\_test\_split** from Scikit-learn:

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

This splits data into 80% training and 20% testing.

---

### **How do you approach a Machine Learning problem?**

✅ **Step-by-step approach:**

1. **Understand the problem** – Define objective (classification/regression)
2. **Collect data** – Gather relevant data
3. **Clean and preprocess** – Handle missing values, encode categorical variables
4. **Explore data** – Use visualizations and correlations
5. **Split data** – Train/Test sets
6. **Choose model** – e.g., Decision Tree, SVM, Linear Regression
7. **Train the model** – Fit model on training data
8. **Evaluate the model** – Use metrics like accuracy, MSE, etc.
9. **Tune parameters** – Use GridSearch or Cross-validation
10. **Deploy** – Use model in real-world application



### **11. Why do we have to perform EDA before fitting a model to the data?**

**EDA (Exploratory Data Analysis)** helps you understand the structure, patterns, and quality of the data before modeling.

**Why it's important:**

* Detect **missing values**, **outliers**, or **duplicate rows**
* Understand data **distribution** and **relationships**
* Choose the **right model** or **preprocessing technique**

---

### **12. What is correlation?**

**Correlation** measures the **strength and direction** of the relationship between two variables.

* Ranges from -1 to +1

---

### **13. What does negative correlation mean?**

A **negative correlation** means when one variable increases, the other decreases.

* Example: More hours of study → Less number of mistakes

---

### **14. How can you find correlation between variables in Python?**

Use **Pandas**:

```python
import pandas as pd

df.corr()
```

This shows a matrix of correlation between numeric columns.

---

### **15. What is causation? Explain the difference between correlation and causation with an example.**

**Causation** means **one variable directly affects** another.

✅ **Difference:**

| Aspect  | Correlation                          | Causation                                 |
| ------- | ------------------------------------ | ----------------------------------------- |
| Meaning | Variables move together              | One variable **causes** change in another |
| Example | Ice cream sales ↑ with temperature ↑ | Exercise causes weight loss               |

🔸 *Correlation does not mean causation.*

---

### **16. What is an Optimizer? What are different types of optimizers?**

An **Optimizer** helps improve the model by **minimizing the loss** using training data.

🔸 **Types of Optimizers:**

1. **Gradient Descent** – Most basic method
2. **Stochastic Gradient Descent (SGD)** – Updates weights on each training sample
3. **Adam (Adaptive Moment Estimation)** – Combines momentum and adaptive learning rates
4. **RMSprop** – Adjusts learning rate for each parameter

📌 **Example using Adam in TensorFlow:**

```python
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
```

---

### **17. What is `sklearn.linear_model`?**

`sklearn.linear_model` is a module in Scikit-learn for **linear models**, like:

* **LinearRegression**
* **LogisticRegression**
* **Ridge**, **Lasso**

🔸 Example:

```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
```

---

### **18. What does `model.fit()` do? What arguments must be given?**

`model.fit()` trains the model on the training data.

✅ **Required arguments:**

```python
model.fit(X_train, y_train)
```

* `X_train`: Input features
* `y_train`: Target variable

---

### **19. What does `model.predict()` do? What arguments must be given?**

`model.predict()` makes predictions using the trained model.

✅ **Example:**

```python
y_pred = model.predict(X_test)
```

* `X_test`: Features for which you want to make predictions

---

### **20. What are continuous and categorical variables?**

* **Continuous variable** – Can take any value (e.g., salary, temperature)
* **Categorical variable** – Divided into fixed categories (e.g., gender, state)

---

### **21. What is feature scaling? How does it help in Machine Learning?**

**Feature Scaling** standardizes the range of independent variables.

✅ Why it's needed:

* Many ML algorithms (like SVM, KNN) are sensitive to **feature scales**
* Prevents **large values from dominating**

---

### **22. How do we perform scaling in Python?**

Using **Scikit-learn's** `StandardScaler` or `MinMaxScaler`:

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

---

### **23. What is `sklearn.preprocessing`?**

A **Scikit-learn module** used for:

* **Encoding** (LabelEncoder, OneHotEncoder)
* **Scaling** (StandardScaler, MinMaxScaler)
* **Normalization**

---

### **24. How do we split data for model fitting (training and testing) in Python?**

Using `train_test_split()`:

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

### **25. Explain data encoding?**

**Data Encoding** converts **categorical data into numeric format** so that ML algorithms can process it.

✅ **Types:**

* **Label Encoding** – Each category becomes a number (e.g., Red → 0, Blue → 1)
* **One-Hot Encoding** – Each category becomes a separate column with 0/1

📌 Example:

```python
from sklearn.preprocessing import OneHotEncoder
```
