# Feature Engineering : Theorical Questions

1. What is a parameter?
  - A parameter is like a variable that we use to give information to a function.
  - A parameter is a placeholder inside a function. It helps the function take input and do something useful with it.
  - (For Machine Learning) A parameter is something that a machine learning model learns by itself from the training data.
  - In other word, parameters are the internal values that a model learns during training to make accurate predictions.

2. What is correlation? and What does negative correlation mean?
  - Correlation is a way to measure how two things are related - basically, how one thing changes when another thing changes.
  - Correlation is usually a number between -1 and 1 :
      - +1 means a perfect positive relationship (both go up together).
      - -1 means a perfect negative relationship (one goes up, the other goes down).
      - 0 means no relationship at all.
  - A negative correlation means : When one thing increases, the other tends to decrease.
  - Negative correlation = Opposite movement between two variables.
  - Example : As temperature increases, people buy less coffee. So there's a negative correlation between them.

3. Define Machine Learning. What are the main components in Machine Learning?
  - Machine Learning (ML) is a type of technology where computers learn from data.
  - Instead of writing fixed rules, we give the computer lots of data and it figures out the patterns on its own.
  - Machine learning is teaching computers to learn from data and make predictions or decisions without being directly programmed.
  - Main Components of Machine Learning :
      1. Data :    
      - The raw information (like images, numbers or text) that the machine uses to learn.
      2. Features :    
      - The input values or characteristics taken from the data.
      3. Model :    
      - The brain of the system. It uses features to make predictions.
      4. Algorithm :    
      - The method used to train the model. It decides how the model learns from the data.
      5. Training :    
      - Feeding data into the model so it can learn patterns.
      6. Prediction :    
      - Once trained, the model makes guesses or decisions on new data.
      7. Evaluation :    
      - Checking how good the model's predictions are.

4. How does loss value help in determining whether the model is good or not?
  - The loss value is like a score that tells us how wrong our model is.
  - When a model makes a prediction, we compare it with the actual answer.
  - The difference between the prediction and the actual value is called the loss.
  - The higher the loss, the worse the model is doing.
  - The lower the loss, the better the model is learning.
  - The goal is to minimize the loss, meaning the predictions get closer and closer to the actual values.
  - In simple words :    
      - Loss = Mistake.
      - Low loss = Good model.
      - High loss = Bad model (needs more training or fixing).

5. What are continuous and categorical variables?
  - In machine learning, we use different types of data and two common ones are :
  - Continuous Variables :    
      - These are numeric values that can be measured and have an infinite range.
      - We can do math with them (like add, subtract, average).
      - They can have decimals or fractions.
      - Examples : Height (e.g., 5.6 feet), Temperature (e.g., 22.3C), Salary (e.g., 45,000) etc.
  - Categorical Variables :    
      - These are labels or categories that represent groups or types.
      - They are not measured but classified.
      - We can't do math with them.
      - Examples : Gender (Male, Female), Colors (Red, Blue, Green), Country (India, USA, Japan) etc.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
  - Machine learning models can't understand words or labels directly - they need numbers.
  - So, we convert categorical variables (like "Red", "Blue", "Green") into numbers using different techniques.
  - Common Techniques to Handle Categorical Variables :
  1. Label Encoding :    
      - Assigns a number to each category.
      - Example : Red = 0, Blue = 1, Green = 2
      - Good for ordered categories (like Low, Medium, High).
      - (Disadvantage) May confuse the model if the categories don't have a real order.
  2. One-Hot Encoding :    
      - Creates a new column for each category.
      - Puts 1 where the category exists and 0 elsewhere.
      - Good for unordered categories.
      - (Disadvantage) Can create many columns if there are lots of categories.
  3. Ordinal Encoding :    
      - Similar to label encoding but used when the categories have a natural order.
      - Example : Low = 1, Medium = 2, High = 3
  4. Target Encoding (Advanced) :    
      - Replaces categories with the average target value for that category.
      - Used in specific cases (e.g., Kaggle competitions), but needs careful handling to avoid overfitting.

7. What do you mean by training and testing a dataset?
  - In machine learning, we usually split our data into two parts :
  - Training Dataset :    
      - This is the data the model learns from.
      - Example : If we want to teach a model to predict house prices, the training data will have - House size (inputs), location (inputs), Actual price (correct answer).
  - Testing Dataset :
      - This data is not shown to the model during training.
      - It's used to check how well the model learned.
      - The goal is to see if the model can make accurate predictions on new data.

8. What is sklearn.preprocessing?
  - sklearn.preprocessing is a part of the scikit-learn library (also called sklearn) that helps us prepare our data before training a machine learning model.
  - Why it is needed :
      - Most machine learning models don't work well with raw data. We need to clean, scale or transform the data so the model can understand it better.
  - Common things we can do with sklearn.preprocessing :
  1. Scaling Features :    
      - Makes sure all numbers are on a similar scale.
      - Example : StandardScaler(), MinMaxScaler().
  2. Encoding Categorical Data :    
      - Converts categories (like “Red”, “Blue”) into numbers.
      - Example : OneHotEncoder(), LabelEncoder().
  3. Normalization :    
      - Adjusts values so they fall between 0 and 1.
      - Helps in algorithms that use distance (like KNN).
  4. Binarization :    
      - Turns data into 0s and 1s based on a threshold.
  5. Handling Missing Values (via Imputation) :    
      - Filling in missing values with a default like mean or median.
      - Done using : SimpleImputer().

9. What is a Test set?
  - A test set is a portion of the dataset that is kept separate and used to evaluate how well a machine learning model works.
  -  Why it is important :    
      - After the model learns from the training set, we want to see if it can make good predictions on new, unseen data. That's where the test set comes in.

10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
  - We usually use train_test_split from scikit-learn (sklearn) to split the dataset.

In [None]:
# Syntax :
from sklearn.model_selection import train_test_split

# X = input/features, y = output/target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  - test_size=0.2 means 20% for testing, 80% for training.
  - random_state=42 ensures results are reproducible (optional).
  - Approach a Machine Learning Problem :    
       - Steps to Approach an ML Problem :
       1. Understand the Problem : What are we trying to predict or classify?
       2. Collect the Data : Gather the relevant dataset.
       3. Explore the Data (EDA) : Check for patterns, missing values and outliers.
       4. Preprocess the Data : Clean it, handle missing values and convert categories into numbers.
       5. Split the Data : Use train_test_split to divide into training and test sets.
       6. Choose a Model : Example - Linear Regression, Decision Tree, etc.
       7. Train the Model : Fit it using the training data.
       8. Evaluate the Model : Use the test set to check accuracy, error or other metrics.
       9. Tune and Improve : Try different models or settings to get better results.
       10. Deploy the Model (Optional) : Use it in real applications to make predictions.

11. Why do we have to perform EDA before fitting a model to the data?
  - EDA (Exploratory Data Analysis) is like getting to know our data before teaching it to a machine.
  - It helps us understand the shape, quality and patterns in our dataset - so we can clean it, fix issues and choose the right model.
  - Reasons Why EDA is Important :
  1. Detect Missing Values : We might find empty or null entries that need to be filled or removed.
  2. Find Outliers or Errors : Strange or extreme values can confuse the model.
  3. Understand Feature Relationships : We can spot which features are important and how they relate to the target.
  4. Choose the Right Preprocessing : EDA tells us if we need scaling, encoding or transformation.
  5. Spot Data Imbalance : For classification, we might find that some categories appear more than others (class imbalance).
  6. Avoid Garbage In = Garbage Out : If our data is messy, our model will learn the wrong things.



12. What is correlation? (Same as Question 2.)
  - Correlation is a way to measure how two things are related - basically, how one thing changes when another thing changes.
  - Correlation is usually a number between -1 and 1 :
      - +1 means a perfect positive relationship (both go up together).
      - -1 means a perfect negative relationship (one goes up, the other goes down).
      - 0 means no relationship at all.

13. What does negative correlation mean? (Same as Question 2.)
  - A negative correlation means : When one thing increases, the other tends to decrease.
  - Negative correlation = Opposite movement between two variables.
  - Example : As temperature increases, people buy less coffee. So there's a negative correlation between them.

14. How can you find correlation between variables in Python?
  - We usually use Pandas to calculate the correlation between columns in a dataset.

In [None]:
import pandas as pd

# Sample data
data = {
    'Height': [150, 160, 170, 180],
    'Weight': [50, 60, 65, 80]
}

df = pd.DataFrame(data)

# Find correlation
correlation = df.corr()
print(correlation)

          Height    Weight
Height  1.000000  0.981156
Weight  0.981156  1.000000


15. What is causation? Explain difference between correlation and causation with an example.
  - Causation means one thing directly causes another to happen.
  - It's a cause-and-effect relationship.
  - Example of Causation : If we study more, we score higher on exams. Here, studying causes better results - that's causation.
  - Correlation vs. Causation :    
      - Correlation :
          - Meaning : Two things change together.
          - Example : Ice cream sales increases and drowning cases increases.
      - Causation :    
          - Meaning : One thing causes the other to change.
          - Example : More study hours -> higher marks.
      - Correlation is not causation. Correlation shows a relationship; causation shows direct influence.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
  - An optimizer is an algorithm used in machine learning to adjust the model's parameters (like weights) to reduce the loss/error during training.
  - Think of it like a guide helping the model learn faster and smarter by improving guesses with each step.
  - Common Types of Optimizers (with simple examples) :
  1. Gradient Descent
      - The most basic optimizer.
      - It updates weights slowly in the direction that reduces the error.
      - Example :
          - Imagine we are walking downhill to reach the lowest point (minimum loss). Each step we take is based on how steep the slope is.
          - (Concept only - not usually used directly) weights = weights - learning_rate * gradient
  2. Stochastic Gradient Descent (SGD)
      - Same as gradient descent, but updates weights after each data point instead of the whole batch.
      - Faster and better for large datasets, but more "noisy."
      - Example :
          - Like taking small, fast steps with some wobble.
          - from tensorflow.keras.optimizers import SGD
          - model.compile(optimizer=SGD(learning_rate=0.01), loss='mse')
  3. Adam (Adaptive Moment Estimation)
      - Most commonly used in deep learning.
      - Combines the benefits of Momentum and RMSProp.
      - It's fast and usually works well out of the box.
      - Example :
          - Smart walking - adjusts our step size and direction using past knowledge.
          - from tensorflow.keras.optimizers import Adam
          - model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
  4. RMSProp
      - Good for recurrent neural networks (RNNs).
      - Adapts learning rate for each weight based on recent updates.
  5. Adagrad
      - Adjusts learning rate individually for each parameter.
      - Slows down learning over time.

17. What is sklearn.linear_model?
  - sklearn.linear_model is a module in scikit-learn that provides tools to build linear models - models that try to make predictions based on a straight-line (or linear) relationship between input and output.
  - Common Models in sklearn.linear_model :
  1. LinearRegression
      - Used for predicting continuous values (e.g., house prices).
      - Equation: y = mx + b
  2. LogisticRegression
      - Used for classification (e.g., yes/no, spam/not spam).
      - Predicts probability and turns it into class labels (0 or 1).
  3. Ridge & Lasso Regression
      - Variations of linear regression with regularization (to reduce overfitting).
      - Ridge uses L2 penalty, Lasso uses L1 penalty.

18. What does model.fit() do? What arguments must be given?
  - The .fit() function is used to train the model on our dataset.
It teaches the model how to make predictions by finding patterns in the input data.
  - model.fit(X, y) = Learn from inputs (X) and correct answers (y)
  - X = Features (input data like size, color, age, etc.)
  - y = Labels/target (what we are trying to predict)
  - Required Arguments :
      - X	: The input features (usually a table or 2D array)
      - y : The target values (what the model should learn to predict).

19. What does model.predict() do? What arguments must be given?
  - The .predict() function is used after the model is trained.
  - It takes new input data and gives predictions based on what the model has learned.
  - Required Argument :
      - X : The new input data (same number of features as the training data).
  -  Use Case :
      - In regression: it predicts numbers (e.g., price = 300.5).
      - In classification: it predicts class labels (e.g., cat or dog).

20. What are continuous and categorical variables? (Same as Question 5.)
  - In machine learning, we use different types of data and two common ones are :
  - Continuous Variables :    
      - These are numeric values that can be measured and have an infinite range.
      - We can do math with them (like add, subtract, average).
      - They can have decimals or fractions.
      - Examples : Height (e.g., 5.6 feet), Temperature (e.g., 22.3C), Salary (e.g., 45,000) etc.
  - Categorical Variables :    
      - These are labels or categories that represent groups or types.
      - They are not measured but classified.
      - We can't do math with them.
      - Examples : Gender (Male, Female), Colors (Red, Blue, Green), Country (India, USA, Japan) etc.

21. What is feature scaling? How does it help in Machine Learning?
  - Feature scaling is a technique used to adjust the range of data values so that all features (columns) are on a similar scale - usually between 0 and 1 or with mean 0 and standard deviation 1.
  - Why we need it :    
      - Some machine learning models (like KNN, SVM or Gradient Descent-based models) are sensitive to the scale of features.
      - If one feature has values like 1 - 1000 and another has 0 - 1, the model might give more importance to the bigger one - even if both are equally important.
  - Common Feature Scaling Methods :
  1. Min-Max Scaling (Normalization) :
      - Scales values to a fixed range, usually 0 to 1.
  2. Standardization (Z-score Scaling) :    
      - Transforms data to have mean = 0 and std = 1.
  - How it helps :
      - Improves model accuracy.
      - Speeds up training.
      - Helps models treat all features equally.

22. How do we perform scaling in Python?
  - We usually perform scaling using scikit-learn's preprocessing tools like StandardScaler or MinMaxScaler.
  1. Standard Scaling (Z-score Scaling)
      - This scales the data so that : Mean = 0 and Standard Deviation = 1.
  2. Min-Max Scaling (Normalization)
      - This scales the values to a range [0, 1].

In [None]:
# Standard Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # X is our feature dataset

# Min-Max Scaling (Normalization)
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

23. What is sklearn.preprocessing? (Same as Question 8.)
  - sklearn.preprocessing is a part of the scikit-learn library (also called sklearn) that helps us prepare our data before training a machine learning model.
  - Why it is needed :
      - Most machine learning models don't work well with raw data. We need to clean, scale or transform the data so the model can understand it better.
  - Common things we can do with sklearn.preprocessing :
  1. Scaling Features :    
      - Makes sure all numbers are on a similar scale.
      - Example : StandardScaler(), MinMaxScaler().
  2. Encoding Categorical Data :    
      - Converts categories (like “Red”, “Blue”) into numbers.
      - Example : OneHotEncoder(), LabelEncoder().
  3. Normalization :    
      - Adjusts values so they fall between 0 and 1.
      - Helps in algorithms that use distance (like KNN).
  4. Binarization :    
      - Turns data into 0s and 1s based on a threshold.
  5. Handling Missing Values (via Imputation) :    
      - Filling in missing values with a default like mean or median.
      - Done using : SimpleImputer().

24. How do we split data for model fitting (training and testing) in Python? (Same as Question 10.)
  - We usually use train_test_split from scikit-learn (sklearn) to split the dataset.

In [None]:
# Syntax :
from sklearn.model_selection import train_test_split

# X = input/features, y = output/target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  - test_size=0.2 means 20% for testing, 80% for training.
  - random_state=42 ensures results are reproducible (optional).

25. Explain data encoding?
  - Data encoding is the process of converting categorical (text) data into numerical values so that machine learning models can understand and work with them.
  - Why we need it :
      - Most ML algorithms can't handle text directly - they only understand numbers.
      - So we convert things like "Red", "Blue" or "Yes", "No" into numbers.
  - Common Encoding Techniques :
  1. Label Encoding :    
      - Converts each category into a unique number.
      - Example : "Red" -> 0, "Blue" -> 1, "Green" -> 2.
  2. One-Hot Encoding :    
      - Creates a new column for each category, using 1s and 0s.  

In [None]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoded = encoder.fit_transform(['Red', 'Blue', 'Green', 'Red'])
encoded

array([2, 0, 1, 2])

In [None]:
# One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
encoded = pd.get_dummies(df, columns=['Color'], dtype=int)
encoded

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0,0,1
1,1,0,0
2,0,1,0
