# **Feature Engineering**

## **Assignment Questions**

**Q1.What is a parameter?**
  - A **parameter** is a variable that the model learns from the training data. These parameters define how the model makes predictions and decisions. They are adjusted during training to optimize the model's performance.


**Q2.What is correlation?**
  **What does negative correlation mean?**
  - **Correlation** measures the relationship between two variables and how they move together. In statistics and Machine Learning, it helps determine whether changes in one variable are associated with changes in another.
 ### **negative correlation:-**
  - A negative correlation means that when one variable increases, the other decreases. In other words, they move in opposite directions.
**For example:-**
- The more time you spend exercising, the less body fat you might have.
- The faster a car travels, the less time it takes to reach its destination.



**Q3.Define Machine Learning. What are the main components in Machine Learning?**
  - ### **Machine Learning (ML) Definition**  
Machine Learning is a branch of artificial intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed. It allows systems to improve their performance over time by analyzing and learning from new data.

 ### **Main Components of Machine Learning**  
Machine Learning consists of several essential components:

1. **Data** – Raw information used for training the model. Data can be structured (tables, spreadsheets) or unstructured (images, text, videos). Quality data is crucial for accuracy.

2. **Features** – Specific attributes or variables used to make predictions. Selecting the right features improves model performance.

3. **Model** – The mathematical or statistical algorithm that learns patterns in the data. Examples include decision trees, neural networks, and support vector machines.

4. **Training** – The process of feeding data into the model so it can learn. The model adjusts its internal parameters based on patterns it identifies in the data.

5. **Loss Function** – A measure of how well the model performs by comparing predicted values to actual values. Lower loss indicates better accuracy.

6. **Optimization Algorithm** – Methods like gradient descent that adjust the model parameters to minimize errors and improve performance.

7. **Evaluation Metrics** – Techniques such as accuracy, precision, recall, and F1-score used to assess model effectiveness.

8. **Deployment** – Applying the trained model to real-world scenarios where it makes predictions or automates tasks.

**Q4.How does loss value help in determining whether the model is good or not?**
  - A lower loss value generally indicates a better-performing model because it signifies that the model's predictions are closer to the actual target values on the training data. However, it's crucial to also consider the loss on a separate validation set to prevent overfitting. A low training loss but high validation loss suggests the model has merely memorized the training data and won't generalize well.

**Q5.What are continuous and categorical variables?**
  - In data science and machine learning, variables are broadly classified into **continuous** and **categorical** variables based on the type of values they represent.

 ### **Continuous Variables**
- These are **numeric** variables that can take an **infinite range of values** within a given interval.
- They are typically **measured** rather than counted.
- Examples:
  - Height (e.g., 170.5 cm)
  - Temperature (e.g., 36.7°C)
  - Time (e.g., 3.75 seconds)
  - Weight (e.g., 62.3 kg)
- Continuous variables can have **decimal or fractional values** and are used in regression problems.

 ### **Categorical Variables**
- These are variables that represent **categories** or groups rather than numerical values.
- They are typically **counted** rather than measured.
- Examples:
  - Gender (Male, Female, Non-binary)
  - Type of car (SUV, Sedan, Hatchback)
  - Color (Red, Blue, Green)
  - Education Level (High School, Bachelor's, Master's, PhD)
- Categorical variables can be **nominal** (unordered categories) or **ordinal** (ordered categories like "low", "medium", "high").

**Q6.How do we handle categorical variables in Machine Learning?**
   **What are the common techniques?**
   - Handling categorical variables in machine learning is essential because most algorithms work with numerical data. Converting categorical variables into a format that models can understand improves accuracy and performance.

 ### **Common Techniques for Handling Categorical Variables**
1. **Label Encoding**  
   - Assigns **numeric labels** to each category (e.g., "Red" → 0, "Blue" → 1, "Green" → 2).
   - Useful for **ordinal categories** (e.g., "Low", "Medium", "High").
   - Works well with **tree-based models** but can cause issues in linear models.

2. **One-Hot Encoding**  
   - Creates **binary columns** for each category (e.g., "Red" → [1, 0, 0], "Blue" → [0, 1, 0]).
   - Useful for **nominal categorical variables** (no order).
   - Can increase dimensionality if there are many unique values.

3. **Binary Encoding**  
   - Converts categories into **binary numbers** and then into separate columns.
   - Reduces dimensionality compared to one-hot encoding.
   - Works well for **high-cardinality** categorical variables.

4. **Target Encoding**  
   - Replaces categories with their **mean target value** (e.g., replacing "City" with its average house price).
   - Useful for **predictive models**, especially regression.
   - Requires careful handling to prevent **data leakage**.

5. **Frequency Encoding**  
   - Replaces categories with their **frequency of occurrence** (e.g., "Category A appears 40% of the time").
   - Works well with tree-based models.
   - Less prone to increasing dimensionality.

6. **Embedding Techniques (for Deep Learning)**  
   - Uses neural networks to create **dense vector representations** of categorical variables.
   - Reduces dimensionality while maintaining information.
   - Especially useful for **natural language processing (NLP)** and complex datasets.

**Q7.What do you mean by training and testing a dataset?**
  - In machine learning, **training** and **testing** a dataset are crucial steps in building a reliable model.
 ### **Training a Dataset**
- This is the phase where the model learns patterns from labeled data.
- The dataset is used to adjust the model’s parameters, helping it improve predictions.
- Example: If you’re training a model to recognize cats in images, it will analyze cat pictures and learn features like ears, whiskers, and fur patterns.

 ### **Testing a Dataset**
- Once trained, the model is tested on **new, unseen data** to evaluate its accuracy.
- Helps determine how well the model generalizes to different situations.
- Example: After training the cat-recognition model, you test it on fresh images to see if it correctly identifies cats.



**Q8.What is sklearn.preprocessing?**
  - `sklearn.preprocessing` is a module in **scikit-learn**, a popular Python library for machine learning. This module provides various functions to **prepare and transform** data before feeding it into a machine learning model.


**Q9.What is a Test set?**
  - A Test Set is a portion of a dataset used to evaluate the performance of a trained machine learning model. It contains unseen data that the model did not encounter during training, allowing us to measure how well the model generalizes to new inputs.


**Q10.How do we split data for model fitting (training and testing) in Python?**
**How do you approach a Machine Learning problem?**
  - **Splitting Data for Model Training & Testing in Python:-**
In machine learning, data needs to be divided into training and testing sets to ensure the model learns effectively and generalizes well to unseen data. We typically split the dataset using Scikit-learn's **train_test_split()** function.
 ### **Approach to Solving a Machine Learning Problem**
When tackling a machine learning problem, here’s a structured approach:

 ### **1. Define the Problem**
- Understand the problem statement.
- Identify the objective (classification, regression, clustering, etc.).
- Determine the key metrics to evaluate success.

 ### **2. Collect & Prepare the Data**
- Gather relevant data from sources (databases, CSV files, APIs).
- Handle missing values using **imputation techniques**.
- Perform exploratory data analysis (EDA) to understand distributions, correlations, and trends.

 ### **3. Preprocess & Transform Data**
- Standardize or normalize numerical features.
- Encode categorical variables (using Label Encoding, One-Hot Encoding, etc.).
- Feature engineering to create meaningful variables.

 ### **4. Choose a Suitable Model**
- Select appropriate algorithms (e.g., decision trees, neural networks, support vector machines).
- Consider simplicity and interpretability for deployment.

 ### **5. Train the Model**
- Split data into training and test sets.
- Use techniques like **cross-validation** to improve robustness.
- Optimize hyperparameters using **GridSearchCV** or **RandomizedSearchCV**.

 ### **6. Evaluate Model Performance**
- Use metrics like **accuracy, precision, recall, F1-score, RMSE, or AUC-ROC** to measure effectiveness.
- Analyze confusion matrix to assess misclassifications.

 ### **7. Improve & Fine-Tune**
- Experiment with different algorithms or architectures.
- Tune hyperparameters for better optimization.
- Use techniques like feature selection to reduce overfitting.

 ### **8. Deploy & Monitor**
- Convert the trained model into an API or integrate it into a system.
- Continuously monitor model predictions in real-world applications.
- Retrain the model when new data is available to maintain performance.


**Q11.Why do we have to perform EDA before fitting a model to the data?**
  - **Exploratory Data Analysis(**EDA) is a crucial step before fitting a machine learning model because it helps us understand the dataset, detect potential issues, and make informed decisions about preprocessing. Skipping EDA can lead to poor model performance, inaccurate predictions, and misleading insights.


**Q12.What is correlation?**
   - **Correlation** measures the relationship between two variables and how they move together. In statistics and Machine Learning, it helps determine whether changes in one variable are associated with changes in another.

**Q13.What does negative correlation mean?**
   - A negative correlation means that when one variable increases, the other decreases. In other words, they move in opposite directions.
  **For example:-**
    - The more time you spend exercising, the less body fat you might have.
    The faster a car travels, the less time it takes to reach its destination.


**Q14.How can you find correlation between variables in Python?**
  - We can find the correlation between variables in Python, we use libraries like Pandas, NumPy, or SciPy.
  


**Q15.What is causation? Explain difference between correlation and causation with an example.**
  - Causation refers to a relationship where one event (the cause) directly brings about another event (the effect). In essence, a change in the first event produces a change in the second. It implies a direct mechanism or process linking the two.

  - **Difference between correlation and causation:-**

  - **Correlation:-** Correlation describes a statistical relationship between two variables. This means that as one variable changes, the other variable tends to change in a predictable way (either in the same direction – positive correlation, or in the opposite direction – negative correlation). However, correlation does not automatically mean that one variable is causing the other to change.

  - **Causation:-** Causation goes a step further and indicates that the change in one variable is the direct result of the change in the other variable. There's an underlying mechanism at play.


**Q16.What is an Optimizer? What are different types of optimizers? Explain each with an example.**
  - An optimizer is an algorithm that adjusts the parameters of a machine learning model to minimize errors (loss function) and improve its performance. It determines how the model learns by updating weights during training.
 ### Types of Optimizers
  - some common optimizers used in machine learning:
  -  **1.Gradient Descent**
  - A basic optimization algorithm that moves the model weights in the direction of the negative gradient to minimize loss.
  - **Example:-** In linear regression, gradient descent updates the coefficients to reduce prediction errors.


In [1]:
import numpy as np

# Simple gradient descent update rule
learning_rate = 0.01
weight = 2.0
gradient = -0.5

weight = weight - learning_rate * gradient
print(weight)  # Updated weight

2.005


  - **2.Stochastic Gradient Descent (SGD**)
  - Unlike standard gradient descent, SGD updates weights using individual data points rather than the whole dataset.
  - **Example:** Used in deep learning models for faster convergence.


In [2]:
from tensorflow.keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01, momentum=0.9)

  - **3.Adam (Adaptive Moment Estimation)**
  - Combines momentum and adaptive learning rates for efficient optimization.
  - **Example:** Commonly used in deep learning models like CNNs and LSTMs.


In [3]:
from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001)

 - **4.RMSprop (Root Mean Square Propagation)**
 - Adjusts learning rates based on the average of past gradients, preventing rapid oscillations.
 - Example: Used in recurrent neural networks (RNNs).


In [4]:
from tensorflow.keras.optimizers import RMSprop

optimizer = RMSprop(learning_rate=0.001)

**Q17.What is sklearn.linear_model ?**
  - **sklearn.linear_model** is a module within the scikit-learn library, which is one of the most popular and comprehensive machine learning libraries in Python. This specific module is dedicated to linear models.

**Q18.What does model.fit() do? What arguments must be given?**
   - The `model.fit()` method in Scikit-learn is used to train a machine learning model. It takes in training data, adjusts the model’s parameters, and learns patterns to make predictions.

 ### **What does `model.fit()` do?**
- Takes input data (features) and target values (labels).
- Finds the best parameters (e.g., weights in linear models) using an optimization algorithm.
- Stores learned parameters for future predictions (`model.predict()`).

 ### **Required Arguments**
 - The required arguments depend on the type of model, but generally include:
 - 1.**`X` (Features/Input Data)** → Independent variables (numerical or categorical).
 - 2.**`y` (Target/Labels)** → Dependent variable (what the model is predicting).

 ### **Additional Arguments (Optional)**
 Some models allow extra parameters:
 - `sample_weight`: Weights for different samples, useful when some examples are more important.
 - `epochs & batch_size`: In deep learning models (e.g., TensorFlow/Keras), these define the number of training iterations and sample batches.


**Q19.What does model.predict() do? What arguments must be given?**
  - The`model.predict()` method in Scikit-learn is used to make predictions based on the trained model. After the model has learned the patterns in the training data, this function applies those learned patterns to new input data.

 ### **What does `model.predict()` do?**
 - Takes new feature values (`X`) and applies the learned relationships from training.
 - Returns predicted values for the given input data.

 ### **Required Arguments**
 The method requires:
 - 1.**`X` (Feature/Input Data)** → The new data points for which predictions are needed.
### **Key Points**
- The input `X` passed to `model.predict()` must have the same shape (number of features) as the training data.
- The method is commonly used for regression (continuous values) and classification (category predictions).




**Q20.What are continuous and categorical variables?**
  - In statistics and data science, variables can generally be classified into **continuous** and **categorical** types based on the nature of their values.

 ### **1️⃣ Continuous Variables**
 - Represent numerical values that can take on an infinite range within a given interval.
 - These variables can be measured and have meaningful mathematical operations like addition, subtraction, and averaging.
 - **Examples**:
    - Height (e.g., 170.5 cm)
    - Weight (e.g., 65.2 kg)
    - Temperature (e.g., 25.3°C)
    - Income (e.g., ₹50,000)

 ### **2️⃣ Categorical Variables**
 - Represent discrete categories or labels, often describing characteristics or classifications.
 - Can be divided into **nominal** (no natural order) or **ordinal** (have a meaningful order).
 - **Examples**:
   - Nominal (No order):
     - Gender (Male, Female, Other)
     - Eye color (Blue, Brown, Green)
   - Ordinal (Has order):
    - Education level (High School, Bachelor’s, Master’s)
    - Rating (Poor, Average, Good, Excellent)

**Q21.What is feature scaling? How does it help in Machine Learning?**
   - Feature scaling is a technique used in machine learning to normalize or standardize the range of independent variables (features) in a dataset. It ensures that all features contribute equally to the model, preventing biases caused by different scales.
 ### Feature Scaling Importance:-
- Improves Convergence  → Algorithms like gradient descent converge faster when features are on a similar scale.
- Enhances Performance → Some models (e.g., SVM, KNN) perform better when features are scaled.
- Prevents Dominance  → Large-scale features can overpower smaller-scale ones, leading to biased results.


**Q22.How do we perform scaling in Python?**
   - In Python, feature scaling we use the Scikit-learn library, which provides several preprocessing methods to normalize or standardize data.
### **Steps to Perform Scaling in Python**
   - Import Required Libraries

In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np

  - Create Sample Data

In [6]:
X = np.array([[10], [20], [30], [40], [50]])  # Example feature values

  ### Apply Different Scaling Techniques
  - Standardization (Z-score normalization)
  - Centers data around 0 mean and unit variance.


In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Standardized:\n", X_scaled)

Standardized:
 [[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


  -  Min-Max Scaling (Normalization)
  - Rescales values between 0 and 1.


In [8]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print("Min-Max Scaled:\n", X_scaled)

Min-Max Scaled:
 [[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]


 ### Robust Scaling
  - Uses median and interquartile range, making it resistant to outliers.


In [9]:
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
print("Robust Scaled:\n", X_scaled)

Robust Scaled:
 [[-1. ]
 [-0.5]
 [ 0. ]
 [ 0.5]
 [ 1. ]]


**Q23.What is sklearn.preprocessing?**
   - sklearn.preprocessing is a fundamental module within the scikit-learn (often abbreviated as sklearn) library in Python. It provides a wide array of tools and functions for data preprocessing, which is a crucial step in preparing raw data for machine learning algorithms.

**Q24.How do we split data for model fitting (training and testing) in Python?**
   - In Python, we can split data for training and testing using the train_test_split function from Scikit-learn. This ensures the model learns from one portion of the data (training set) and is evaluated on another (testing set) to check its generalization ability.


**Q25.Explain data encoding?**
   - Data encoding is the process of converting categorical variables into numerical formats so that machine learning models can understand and process them. Since most ML algorithms work with numerical data, encoding is crucial for handling text or categorical features.