# Feature Engineering

# Assignment Questions

# 1. What is a parameter?

In feature engineering, a parameter is a value that helps decide how to change or process a feature (column) in the data.

🔹Simple Example:

If you divide ages into groups:

    pd.cut(df['age'], bins=3)
    
Here, bins=3 is a parameter – it tells how many groups to make.

In scaling, the min and max values used to scale features are parameters.

# 2. What is correlation? What does Negative Correlation mean?

### What is Correlation?

**Correlation** is a statistical measure that shows how two variables are related.  
It tells whether an increase in one variable will cause an increase or decrease in another.

---

### What does Negative Correlation mean?

**Negative correlation** means when one variable increases, the other decreases.

**Example:**
- If exercise time increases, weight decreases.
- So, exercise time and weight have a negative correlation.

---

### Correlation Values:

- `+1` → Strong positive correlation  
- `0` → No correlation  
- `-1` → Strong negative correlation

# 3. Define Machine Learning. What are the main components in Machine Learning?

### Classic Definition (Tom Mitchell):

> A computer program is said to learn from **experience (E)** with respect to some **task (T)** and **performance measure (P)** if its performance at the task improves with experience.

- **T (Task):** What the model is trying to do (e.g., classify emails as spam or not).
- **P (Performance):** How well it is doing (e.g., accuracy, precision).
- **E (Experience):** The data it learns from (e.g., past emails and their labels).

---

### Main Components of Machine Learning

1. **Data**
   - The source of experience for the model.
   - Example: Customer data, images, text, etc.

2. **Features**
   - Input variables used for prediction (columns in a dataset).

3. **Model**
   - A mathematical structure that finds patterns in data.

4. **Training**
   - The process of feeding data to the model so it can learn.

5. **Task**
   - The goal the model is trying to accomplish.
   - Example: Predicting house prices.

6. **Performance**
   - The metric to evaluate model success.
   - Example: Accuracy, RMSE, F1 Score.

7. **Experience**
   - Historical data used to improve performance.

8. **Loss Function**
   - Measures how far off the predictions are from actual values.

9. **Optimization Algorithm**
   - Algorithm to reduce loss and improve accuracy (e.g., Gradient Descent).

10. **Prediction**
    - Using the trained model to make future decisions.

---    
# 4. How does loss value help in determining whether the model is good or not?

**Loss value** is a number that shows **how far off** the model’s predictions are from the actual values.

- A **low loss** means the model’s predictions are **close to the actual results** → Good model.
- A **high loss** means the predictions are **far from the actual results** → Poor model.

---

### Why is it important?

- During training, the model tries to **minimize the loss** using optimization techniques.
- **Loss guides learning**: It helps the model improve step by step by adjusting weights.
- It tells us **how well the model is learning** from the data.

---

### In short:

> **Lower the loss, better the model’s performance.**

# 5. What are continuous and categorical variables?                              

#### 🔹 Continuous Variables:
- Variables that can take **any numerical value** within a range.
- They are **measurable** and often include decimal values.

**Examples:**
- Height (in cm)
- Weight (in kg)
- Temperature (in °C)
- Price (in ₹)

#### 🔹 Categorical Variables:
- Variables that represent **categories or groups**.
- They are **not numeric** (or are treated as labels even if numbers are used).

**Examples:**
- Gender (Male, Female)
- Color (Red, Blue, Green)
- Education level (High School, College)
- Zip code (even if numbers, they are categories)

---

###  In short:

| Type             | Description                     | Examples                        |
|------------------|----------------------------------|----------------------------------|
| Continuous        | Numeric, measurable              | Height, weight, salary           |
| Categorical       | Labels or categories             | Gender, city, product type       |

# 6. How do we handle categorical variables in Machine Learning? What are the common techniques?   

### How do we handle Categorical Variables in Machine Learning?

Categorical variables must be **converted into numerical form** before they can be used in most machine learning models.

---

### 🔹 Common Techniques to Handle Categorical Variables:

 1. **Label Encoding**
- Converts each category into a **unique number**.
- Best for **ordinal data** (where order matters).

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

2. **One-Hot Encoding**

 - Creates a separate binary column for each category.

 - Best for nominal data (no order).

pd.get_dummies(df['category'])

 3. **Ordinal Encoding**

 - Assigns ordered numbers to categories manually.

 - Useful when categories have natural order.
     
education_levels = {'High School': 1, 'Bachelor': 2, 'Master': 3}
df['education_encoded'] = df['education'].map(education_levels)

 4. **Target Encoding (Mean Encoding)**

 - Replaces categories with the mean of the target variable.

 - Useful for high-cardinality features in supervised learning.

# Example: Average salary per job title
df['job_encoded'] = df.groupby('job')['salary'].transform('mean')

 5. **Frequency Encoding**

Replace each category with how frequently it appears.

freq = df['category'].value_counts()
df['category_encoded'] = df['category'].map(freq)


# 7.What do you mean by training and testing a dataset?

In Machine Learning, a dataset is usually **split into two parts**: **training** and **testing** datasets.

### 🔹 Training Dataset:
- Used to **train the model**.
- The model **learns patterns** from this data.
- It includes both input features and the correct output (label).

**Example:**  
Model learns how hours studied affects exam score.

---

### 🔹 Testing Dataset:
- Used to **evaluate the model’s performance**.
- This data is **not shown** to the model during training.
- Helps check if the model can make accurate predictions on new/unseen data.

**Example:**  
Test if the model can predict exam scores for new students.

---

###  In short:
> - **Training set**: For learning  
> - **Testing set**: For checking accuracy  
> Good models perform well on **both** training and testing data.


# 8. What is sklearn.preprocessing?

`sklearn.preprocessing` is a **module in Scikit-learn (sklearn)** that provides tools to **prepare and transform data** before using it in machine learning models.

---

### 🔹 Why is it used?

Most ML models work better when data is:
- Scaled properly
- Encoded correctly
- Cleaned of missing or inconsistent values

`sklearn.preprocessing` helps with these tasks.

---

###  Common Functions in `sklearn.preprocessing`:

| Function                    | Purpose                                    |
|-----------------------------|--------------------------------------------|
| `StandardScaler()`          | Scales features to have mean 0, std 1     |
| `MinMaxScaler()`            | Scales data between 0 and 1                |
| `LabelEncoder()`            | Converts labels (categories) into numbers  |
| `OneHotEncoder()`           | Converts categories into binary columns    |
| `Binarizer()`               | Converts numerical values to 0 or 1


# 9. What is a Test set?

### What is a Test Set?

A **test set** is a part of the dataset that is **used to evaluate** the performance of a machine learning model **after training**.

---

### 🔹 Purpose:
- To check how well the model performs on **unseen data**.
- Helps to measure the model’s **accuracy, precision, recall**, etc.
- It is **not used** during training.

---

###  Example:
If you have 1000 data points:
- **80% (800)** → used for training (train set)
- **20% (200)** → used for testing (test set)

---

###  In short:
> A **test set** is used to **check if the model works well on new data** that it hasn’t seen before.

# 10. How do we split data for model fitting (training and testing) in Python? 
 # How do you approach a Machine Learning problem?                      

### How do we split data for model fitting (training and testing) in Python?

We use `train_test_split()` from `sklearn.model_selection`.

####  Example:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

                      
X = input features

y = target/output

test_size=0.2 means 80% for training, 20% for testing

# How do you approach a Machine Learning problem?

Easy Steps:
 
1.Understand the problem – What are you trying to predict?

2.Collect data – Get the data you need.

3.Clean the data – Handle missing values, errors.

4.Explore the data – Check patterns using graphs or stats.

5.Preprocess data – Encode categories, scale numbers.

6.Split the data – Train/test split.

7.Choose a model – Like Linear Regression, Decision Tree, etc.

8.Train the model – Fit it on training data.

9.Test the model – Check accuracy on test data.

10.Improve it – Tune parameters if needed.

# 11. Why do we have to perform EDA before fitting a model to the data?

**EDA (Exploratory Data Analysis)** is the process of understanding the data before applying any machine learning model.

---

###  Reasons to Perform EDA:

1. **Understand the Data**
   - Know what your data looks like (shape, types, range, etc.)

2. **Detect Missing Values**
   - Identify and handle missing or null values.

3. **Find Outliers**
   - Spot unusual values that can affect model performance.

4. **Discover Relationships**
   - Understand how features relate to each other and to the target.

5. **Visualize Data**
   - Use graphs to spot patterns, trends, or imbalances.

6. **Check Data Quality**
   - Fix wrong data types, duplicates, and inconsistent entries.

7. **Choose Right Features**
   - Identify which features are useful and which are not.

---

### In short:

> EDA helps you **clean, understand, and prepare** the data so that the model can **learn better and give accurate results**.

 # 12. What is correlation?

**Correlation** is a statistical measure that shows the **relationship between two variables**.

- It tells us whether an **increase or decrease in one variable** is associated with an **increase or decrease in another**.

---

###  Correlation Values:

| Correlation Value | Meaning                         |
|-------------------|----------------------------------|
| +1                | Perfect positive correlation     |
| 0                 | No correlation                   |
| -1                | Perfect negative correlation     |

---

###  In short:

> Correlation shows how **two variables move together** – either in the **same direction** (positive) or **opposite direction** (negative).

# 13. What does negative correlation mean?

**Negative correlation** means that **as one variable increases, the other decreases**.

- The variables move in **opposite directions**.
- The correlation value is **less than 0**, down to **-1**.

---

### Example:

- As **temperature increases**, **sales of jackets decrease**.
- As **exercise time increases**, **body fat percentage may decrease**.

---

###  In short:

> Negative correlation = One goes **up**, the other goes **down**.


# 14. How can you find correlation between variables in Python?

You can use the `.corr()` function in **Pandas** to find the **correlation between numerical variables**.

---

###  Example:

import pandas as pd

data = {
    'height': [150, 160, 170, 180],
    
    'weight': [50, 60, 65, 80]
}
                      
df = pd.DataFrame(data)

correlation_matrix = df.corr()

print(correlation_matrix)


# Visualizing Correlation (Optional):
 
import seaborn as sns

import matplotlib.pyplot as plt

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

plt.show()

# 15. What is causation? Explain difference between correlation and causation with an example. 

### What is Causation?

**Causation** means that **one variable directly affects or causes a change in another**.

---

###  Correlation vs Causation

| Aspect       | Correlation                         | Causation                                |
|--------------|-------------------------------------|-------------------------------------------|
| Meaning      | Variables move together             | One variable causes the other to change   |
| Direction    | No direction implied                | Has a clear cause and effect              |
| Proof Needed | Just statistical link               | Needs experiments or strong evidence      |

---

###  Example:

- **Correlation**: Ice cream sales and drowning cases both increase in summer.  
  → They are related (correlated) but one does **not cause** the other.

- **Causation**: Smoking increases the risk of lung cancer.  
  → Smoking is a **cause** of cancer (causation).

---

###  In short:
> - **Correlation** = Things happen **together**  
> - **Causation** = One thing **causes** the other

# 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

### What is an Optimizer?

An **optimizer** is a method used in machine learning (especially deep learning) to **adjust the model's weights** to reduce the **loss/error** during training.

---

###  Why is it important?
- Optimizers help the model **learn faster** and more **accurately**.
- They update weights based on **gradients** from backpropagation.

---

###  Common Types of Optimizers:

#### 1. **SGD (Stochastic Gradient Descent)**

- Updates weights using a small batch of data.

- Simple and fast but may take longer to converge.

from tensorflow.keras.optimizers import SGD
optimizer = SGD(learning_rate=0.01)

2. Momentum

 - Improves SGD by adding a "velocity" term to avoid local minima.

 - Moves faster in the right direction.

SGD(learning_rate=0.01, momentum=0.9)

3. RMSProp

 - Adapts learning rate for each weight.

 - Good for recurrent neural networks (RNNs).

from tensorflow.keras.optimizers import RMSprop
optimizer = RMSprop(learning_rate=0.001)

4. Adam (Adaptive Moment Estimation)

 - Combines Momentum and RMSProp.

 - Most widely used, fast and efficient.

from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)

# 17. What is sklearn.linear_model ?

`sklearn.linear_model` is a module in **Scikit-learn** that provides **linear models** for **regression** and **classification** tasks.

---

###  Common Models in `sklearn.linear_model`:

| Model                  | Use Case                         |
|------------------------|----------------------------------|
| `LinearRegression`     | Predicting continuous values     |
| `LogisticRegression`   | Binary or multi-class classification |
| `Ridge`                | Linear regression with L2 regularization |
| `Lasso`                | Linear regression with L1 regularization |
| `ElasticNet`           | Combination of L1 and L2 regularization |
| `SGDClassifier`        | Large-scale classification using SGD |
| `SGDRegressor`         | Large-scale regression using SGD |

---

###  Example: Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# 18. What does model.fit() do? What arguments must be given?

### What does `model.fit()` do?

The `model.fit()` function is used to **train the machine learning model**.

- It **learns patterns** from the training data (`X_train`, `y_train`).
- After fitting, the model can **make predictions** on new data.

---

###  Arguments required:

model.fit(X, y)

 
 # Example:
 
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)  # trains the model

 In short:
 
model.fit(X, y) trains the model using the input features and target output.

# 19. What does model.predict() do? What arguments must be given? 

### What does `model.predict()` do?

`model.predict()` is used to **generate predictions** from a **trained machine learning model**.

- It uses the **learned patterns** from `model.fit()` to predict outcomes on **new/unseen data**.

---

###  Argument:

model.predict(X)

---

X: Input features (new data for which you want predictions)

# Example:

predictions = model.predict(X_test)

# 20. What are continuous and categorical variables?

###  Continuous Variables:

- These are **numeric values** that can take any value within a range.
- 
- Can be **measured** and have decimals.

**Examples:**  

- Height (170.5 cm)
 
- Weight (65.2 kg)
 
- Temperature (36.6°C)

---

###  Categorical Variables:

- These are **labels or categories** that represent different groups.
- 
- Can be **nominal** (no order) or **ordinal** (with order).

**Examples:**  

- Gender (Male, Female)
  
- Education Level (High School, Bachelor, Master)
 
- City (Delhi, Mumbai, Pune)

---

###  In short:

> - **Continuous** = Numbers you can measure

> - **Categorical** = Names or groups you can label

# 21. What is feature scaling? How does it help in Machine Learning?

**Feature Scaling** is the process of **normalizing or standardizing** the range of independent variables (features) in your dataset.

- It ensures that all features are **on the same scale**.
- 
- Common ranges: **0 to 1** or **mean = 0, std = 1**

---

###  Why is Feature Scaling Important?

- Some machine learning algorithms (like **KNN**, **SVM**, **Gradient Descent**) are **sensitive to the scale** of features.
  
- Features with larger values can **dominate** others if not scaled.

- Helps the model **converge faster** and perform **better**.

---

###  Common Scaling Techniques:

| Method           | Description                              |
|------------------|------------------------------------------|
| **Min-Max Scaling** | Scales data to range [0, 1]               |
| **Standardization** | Scales data to mean = 0, std = 1          |

---

###  Example (Min-Max Scaling):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(X)


In short:

Feature scaling makes all features comparable, improves accuracy, and speeds up model training.

# 22. How do we perform scaling in Python?

We use **scikit-learn**'s preprocessing tools to scale numerical features so that all values are on a similar scale.

---

###  1. Min-Max Scaling (0 to 1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)

### 2. Standard Scaling (Mean = 0, Std = 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)

In short:

Use MinMaxScaler or StandardScaler from sklearn.preprocessing to scale features for better model performance.

# 23. What is sklearn.preprocessing? 

`sklearn.preprocessing` is a **module in Scikit-learn** that provides tools to **prepare and clean your data** before training a machine learning model.

---

###  What Can You Do with It?

You can:
- **Scale features** (e.g., `MinMaxScaler`, `StandardScaler`)
- **Encode categorical data** (e.g., `LabelEncoder`, `OneHotEncoder`)
- **Normalize data**
- **Handle missing values**

---

###  Example:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_data = scaler.fit_transform(X)

In short:

sklearn.preprocessing is used to scale, encode, and clean data before feeding it to a model.


# 24. How do we split data for model fitting (training and testing) in Python?

We use the `train_test_split()` function from **scikit-learn** to split the dataset into **training** and **testing** parts.

---

###  Why split the data?

- **Training set**: used to train the model.
- **Test set**: used to evaluate the model on unseen data.

---

###  Example:

from sklearn.model_selection import train_test_split

# X = features, y = target/label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 Parameters: 
 
Parameter	         Description
test_size=0.2	     20% of data for testing
random_state=42    	Ensures the split is reproducible

In short:

Use train_test_split() to divide your data into training and testing sets for building and evaluating models.


# 25. Explain data encoding?

**Data Encoding** is the process of **converting categorical (text) data into numeric form** so that machine learning models can understand and use it.

Most ML algorithms work only with **numerical values**, not strings or labels.

---

###  Why is Encoding Needed?

- Categorical data like "Male"/"Female" or "Red"/"Blue" must be turned into numbers.
- Helps the model interpret and learn from the data.

---

###  Common Encoding Techniques:

| Technique           | Use Case                               |
|---------------------|-----------------------------------------|
| **Label Encoding**  | For **ordinal** data (ordered categories) |
| **One-Hot Encoding**| For **nominal** data (no order)         |
| **Ordinal Encoding**| When order matters (manually assigned) |
| **Target Encoding** | Uses target variable (e.g., mean)       |
| **Frequency Encoding** | Based on count of categories         |

---

###  Example: One-Hot Encoding


import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green']})
encoded = pd.get_dummies(df['color'])
print(encoded)

Output:

   blue  green  red
   
0     0      0    1

1     1      0    0

2     0      1    0

In short:

Data encoding converts text labels into numbers so machine learning models can use them.

