# Feature Engineering

1. What is a parameter?

Ans- A parameter is a value or variable that helps define or control the behavior of a function, system, or process. The meaning of a parameter can vary depending on the context:
- In programming, a parameter is a variable used in a function definition to accept input values (called arguments) when the function is called.
- In mathematics, a parameter is a constant that defines a family of functions or changes the behavior of an equation without being the main variable.
- In statistics, a parameter refers to a value that describes a characteristic of a population, such as the population mean or standard deviation.
- In general usage, it can refer to any limit, boundary, or guideline within which something operates.

2. What is correlation?
What does negative correlation mean?

Ans- Correlation is a statistical measure that describes the strength and direction of a relationship between two variables. It helps to determine whether an increase or decrease in one variable corresponds to an increase or decrease in another.                         

A negative correlation means that as one variable increases, the other variable tends to decrease.



3. Define Machine Learning. What are the main components in Machine Learning?

Ans- Machine Learning (ML) is a branch of artificial intelligence (AI) that allows computers to learn from data and make decisions or predictions without being explicitly programmed. Instead of following fixed rules, ML models improve their performance as they are exposed to more data over time.
Main Components of Machine Learning:        
- Data: The raw information (text, numbers, images, etc.) used to train the model. Quality and quantity of data greatly affect performance.
- Model: A mathematical or computational structure that learns patterns from data and makes predictions or decisions.   
- Algorithm: A step-by-step method or procedure used to train the model. Examples: Linear Regression, Decision Trees, Neural Networks.
- Evaluation: Assessing how well the model performs using test data. Common metrics include accuracy, precision, and recall.

4. How does loss value help in determining whether the model is good or not?

Ans- The loss value is a numerical measure that shows how well or poorly a machine learning model is performing. It represents the difference between the model's predicted output and the actual target value.           

Key Points:
- Low Loss = Better Model
 - A smaller loss value means the model's predictions are close to the actual values, indicating good performance.
- High Loss = Poor Model
 - A large loss value suggests the model's predictions are far from the actual results, meaning the model needs improvement.
- Used During Training
 - During training, the model adjusts its internal parameters to minimize the loss. This process is called optimization.

the loss value is a critical indicator of model quality. A lower loss usually means a better model, while a higher loss signals that the model's predictions are off and need improvement.


5. What are continuous and categorical variables?

Ans- Continuous Variables:
- These are numerical variables that can take any value within a range.
- They can be measured and often have decimal values.
- Example: Height (170.5 cm), Weight (65.2 kg), Temperature (36.6°C), Time (2.45 seconds)                   

Categorical Variables:
- These are variables that represent categories or groups.
- They cannot be measured numerically, only classified into distinct categories.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?

Ans- Machine learning algorithms require numerical input, so categorical variables must be converted into numbers before they can be used to train a model.                          
Common Techniques to Handle Categorical Variables:       
-  Label Encoding: Assigns a unique number to each category.
-  One-Hot Encoding: Creates a new binary column for each category (0 or 1).
-  Ordinal Encoding: Similar to label encoding but respects the order of categories.
- Frequency or Count Encoding: Replace categories with their frequency in the dataset.

7. What do you mean by training and testing a dataset?

Ans- Training Dataset:
- This is the portion of the data used to train the machine learning model.
- The model learns patterns, relationships, and rules from this data.
- It adjusts its internal parameters to reduce error based on input-output pairs.

Testing Dataset:
- This is a separate portion of the data used to evaluate the model's performance.
- The model has not seen this data during training.
- It helps test how well the model generalizes to new, unseen data.


8. What is sklearn.preprocessing?

Ans- sklearn.preprocessing is a module in Scikit-learn (a popular Python machine learning library) that provides tools for preprocessing and transforming data before feeding it into a machine learning model.  

Common Functions in sklearn.preprocessing:
- StandardScaler
 - Scales features to have zero mean and unit variance.
 - Useful for algorithms like SVM or logistic regression.
- MinMaxScaler
 - Scales features to a given range (default is 0 to 1).
- LabelEncoder
 - Converts categorical labels (e.g., "male", "female") into numeric form.
- OneHotEncoder
 - Converts categorical features into a binary matrix (used for nominal data).
- Binarizer
 - Converts numerical values to binary (0 or 1) based on a threshold.
- PolynomialFeatures
 - Generates new features by combining existing ones with polynomial terms.

9. What is a Test set?

Ans- A test set is a portion of the dataset that is kept separate from the training process and used to evaluate the final performance of a machine learning model.
- Purpose:
 - To check how well the trained model performs on unseen data.
 - Helps assess the model’s generalization ability.
- When It Is Used:
 - After the model is trained using the training set, the test set is used to measure accuracy, precision, recall, or other performance metrics.
- Not Used During Training:
 - The model does not learn from the test set. This ensures an unbiased evaluation.
- Typical Split:
 - Common practice is to split the dataset into 70% training and 30% testing (or 80/20), depending on the dataset size.

10. How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?

Ans- To split data into training and testing sets in Python, we use train_test_split() from the sklearn.model_selection module.

In [3]:
from sklearn.model_selection import train_test_split
import pandas as pd # Import pandas to potentially load data

# Suppose X = features, y = labels

# You need to load or create your data here.
# For demonstration, let's create some dummy data using pandas DataFrames.
# Replace this with your actual data loading/creation code.
data = {'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'feature2': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        'label': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# Assign features to X and labels to y
X = df[['feature1', 'feature2']] # Features (all columns except 'label')
y = df['label'] # Labels (the 'label' column)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# You can now use X_train, X_test, y_train, and y_test for model training and evaluation.
print("Data split successfully!")
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Data split successfully!
Shape of X_train: (8, 2)
Shape of X_test: (2, 2)
Shape of y_train: (8,)
Shape of y_test: (2,)


How Do You Approach a Machine Learning Problem
- Understand the Problem
 - Define the goal: classification, regression, clustering, etc.
 - Understand the data and business context.
- Collect and Explore the Data
 - Load data from files, APIs, or databases.
 - Perform exploratory data analysis (EDA): view summaries, distributions, correlations.
- Preprocess the Data
 - Handle missing values, duplicates, and outliers.
 - Encode categorical variables (e.g., with Label Encoding or One-Hot Encoding).
 - Scale or normalize numerical features (using StandardScaler, etc.).
- Split the Data
 - Use train_test_split() to divide data into training and testing sets.
- Choose a Model
 - Pick a suitable algorithm (e.g., Decision Tree, SVM, Linear Regression).
- Train the Model
 - Fit the model using the training data.
- Evaluate the Model
 - Use the test set to check accuracy, precision, recall, F1-score, etc.
- Tune Hyperparameters (if needed)
 - Use techniques like Grid Search or Random Search for optimization.
- Deploy the Model
 - Integrate the trained model into a real-world system or app.
- Monitor and Maintain
 - Continuously track performance and retrain as needed.

11. Why do we have to perform EDA before fitting a model to the data?

Ans- Performing Exploratory Data Analysis (EDA) before fitting a model to data is essential for several key reasons. EDA helps you understand the data, uncover insights, and make informed decisions before modeling. Here's why it's important:

 Understand the Data:
- Data types: Identify categorical, numerical, datetime, or text data.
- Summary statistics: Learn about means, medians, standard deviations, and distributions.
- Data structure: Check dimensions, column names, and the basic layout.  

Detect and Handle Missing or Incorrect Data:
- Models don’t handle missing or corrupt data well.
- EDA reveals:
 - Missing values
 - Outliers or anomalies
 - Duplicates
 - Inconsistent formats

Reveal Patterns and Relationships:
- You can use visualizations (scatter plots, heatmaps, boxplots) to:
 - Spot trends
 - Identify correlations
 - Understand group distributions        

Assess Feature Importance and Redundancy:
- EDA helps detect:
 - Highly correlated features (multicollinearity)
 - Irrelevant or uninformative variables
- You can then reduce dimensionality, simplifying the model and improving performance.                 

Check Assumptions for Specific Models:
- Linear models, for instance, assume:
 - Linearity
 - Homoscedasticity
 - Normality of residuals
 - No multicollinearity

12. What is correlation?


Ans- Correlation is a statistical measure that expresses the degree to which two variables move in relation to each other.
- Positive correlation: As one variable increases, the other tends to increase.
Example: Height and weight.
- Negative correlation: As one variable increases, the other tends to decrease.
Example: Speed and travel time.
- Zero (no) correlation: No consistent relationship between the variables.
Example: Shoe size and exam score.        
- Correlation ≠ Causation
Just because two variables are correlated doesn’t mean one causes the other.
- Outliers can distort correlation values.
- Non-linear relationships might have low correlation even if strongly related.



13. What does negative correlation mean?

Ans- A negative correlation means that as one variable increases, the other decreases — they move in opposite directions.      
A negative correlation coefficient ranges from 0 to -1:
 - -1: Perfect negative linear correlation (e.g., a straight downward line)
 - -0.5: Moderate negative correlation
 - 0: No linear correlation



14. How can you find correlation between variables in Python?

In [4]:
import pandas as pd

# Example data
data = {
    'Hours_Studied': [2, 4, 6, 8, 10],
    'Test_Score': [50, 60, 70, 80, 90],
    'Hours_Watching_TV': [10, 8, 6, 4, 2]
}

df = pd.DataFrame(data)


In [5]:
correlation_matrix = df.corr()
print(correlation_matrix)


                   Hours_Studied  Test_Score  Hours_Watching_TV
Hours_Studied                1.0         1.0               -1.0
Test_Score                   1.0         1.0               -1.0
Hours_Watching_TV           -1.0        -1.0                1.0


15. What is causation? Explain difference between correlation and causation with an example.

Ans- Causation means that one variable directly affects or causes a change in another.

 Correlation:
- Meaning: A relationship or pattern between two variables
- Directionality: 	No clear direction of effect
- Example: 	Ice cream sales ↑ and drowning ↑

 Causation:
- Meaning: One variable directly influences the other
- Directionality: 	There is a clear "cause" and "effect"
- Example: 	Smoking → Lung disease

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

Ans- An optimizer is an algorithm used during the training of machine learning (especially deep learning) models to adjust the model’s parameters (weights and biases) in order to minimize the loss function.

- Gradient Descent (GD):
 -  Basic idea: Update weights in the direction that minimizes loss.     


In [9]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)


In [11]:
'''
 Momentum:
Adds memory of past gradients to accelerate learning and smooth out updates.'''
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)


In [12]:
'''
   RMSprop:
Uses adaptive learning rate for each parameter, based on recent gradients.'''
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001)


In [13]:
'''
  Adam (Adaptive Moment Estimation):
Combines Momentum + RMSprop '''
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


In [14]:
'''
  Adagrad
Adapts learning rate per parameter, scaling down for frequently updated weights.'''
optimizer = torch.optim.Adagrad(model.parameters(), lr=0.01)


17. What is sklearn.linear_model?

Ans- **sklearn.linear_model** is a module in the Scikit-learn library that provides various linear models for regression and classification tasks. These models assume a linear relationship between the input features and the target output.
- It includes models like LinearRegression, LogisticRegression, Ridge, Lasso, and ElasticNet.
- Used for both predicting continuous values (regression) and classifying categories (classification).
- Supports regularization techniques (L1, L2) to avoid overfitting.
- Can handle binary, multiclass, and multilabel classification tasks.
- Widely used for interpretable and efficient modeling on structured/tabular data.

18. What does model.fit() do? What arguments must be given?

Ams- The **model.fit()** method is used to train a machine learning model in Scikit-learn. It learns the relationship between the input data (features) and the target output by adjusting the model’s internal parameters.

 Arguments:
- The input features (independent variables) — a 2D array or DataFrame with shape (n_samples, n_features)
- 	The target values (dependent variable) — a 1D array or Series with shape (n_samples,)


In [15]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)  # X = features, y = target


19. What does model.predict() do? What arguments must be given?

Ans- model.predict() is used to make predictions after a model has been trained using fit().
- It takes new input data and returns the model’s predicted output based on what it has learned.

 Argument:
- The input features (same format as used in fit()) — a 2D array or DataFrame of shape (n_samples, n_features)


In [16]:
from sklearn.linear_model import LinearRegression

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)


20. What are continuous and categorical variables?

Ans- Continuous Variables:
- A continuous variable is a numerical variable that can take any value within a range — including decimals and fractions.
- Can take infinite possible values
- Arithmetic operations (e.g., average, sum) make sense

  Categorical Variables:
- A categorical variable represents categories or groups. It contains a fixed number of distinct values (usually labels or names).
- Values represent groups, not quantities
- Can be nominal (no order) or ordinal (has order, like education level)

21. What is feature scaling? How does it help in Machine Learning?

Ans- Feature scaling is the process of normalizing or standardizing the range of independent variables (features) in a dataset so that they are on a similar scale.       
- Algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Gradient Descent-based models are sensitive to feature scale.
- Scaling ensures that no single feature dominates due to its larger value range.
- It helps speed up training and improves model accuracy and convergence.
- It also improves the effectiveness of regularization (e.g., in Lasso or Ridge regression).

   Common Methods:
- Standardization: Scales features to have zero mean and unit variance.
- Min-Max Scaling: Rescales data to a [0, 1] range.




In [17]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


22. How do we perform scaling in Python?

Ans- In Python, feature scaling is most commonly done using Scikit-learn’s preprocessing tools.
Common Methods of Scaling: a)Standardization
                          , b)Min-Max Scaling
                          , c) Robust Scaling

In [18]:
"""StandardScaler"""
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
X = np.array([[1, 2000], [2, 3000], [3, 4000]])

# Initialize and fit the scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)


[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


In [19]:
"""MinMaxScaler"""
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)


[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


In [20]:
"""RobustScaler """
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)


[[-1. -1.]
 [ 0.  0.]
 [ 1.  1.]]


23. What is sklearn.preprocessing?

Ans- sklearn.preprocessing is a module in Scikit-learn that provides tools for preprocessing and transforming data before feeding it into machine learning models.        
- Scaling features (e.g., StandardScaler, MinMaxScaler)
- Encoding categorical variables (e.g., OneHotEncoder, LabelEncoder)
- Handling missing values (some transformers)
- Generating polynomial features
- Normalizing data (making data vectors unit length)

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


24. How do we split data for model fitting (training and testing) in Python?

Ans- To evaluate how well your model will perform on unseen data, you split your dataset into:
- Training set: Used to train (fit) the model
-
Testing set: Used to test and evaluate model performance

In [24]:
from sklearn.model_selection import train_test_split
import pandas as pd # Ensure pandas is imported
import numpy as np # Import numpy for creating sample data

# Suppose X = features, y = target

# You need to load or create your data here.
# For demonstration, let's create some dummy data using pandas DataFrames.
# Replace this with your actual data loading/creation code.
data = {'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'feature2': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        'label': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# Assign features to X and labels to y from the DataFrame
X = df[['feature1', 'feature2']] # Features (all columns except 'label')
y = df['label'] # Labels (the 'label' column)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% data for testing
    random_state=42       # for reproducibility
)

# You can now use X_train, X_test, y_train, and y_test for model training and evaluation.
print("Data split successfully!")
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Data split successfully!
Shape of X_train: (8, 2)
Shape of X_test: (2, 2)
Shape of y_train: (8,)
Shape of y_test: (2,)


25. Explain data encoding?

Ans- Data encoding is the process of converting categorical data into numerical format so that machine learning models, which typically require numbers, can process the data.
- Most ML algorithms cannot work directly with text or categories.
- Models require numbers as input features.
- Encoding converts categories like "Red", "Blue", "Green" into numeric values.



In [25]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Sample categorical data
colors = ['Red', 'Blue', 'Green', 'Blue']

# Label Encoding
le = LabelEncoder()
labels = le.fit_transform(colors)
print(labels)  # Output: [2 0 1 0]

# One-Hot Encoding using pandas
df = pd.DataFrame({'Color': colors})
one_hot = pd.get_dummies(df['Color'])
print(one_hot)


[2 0 1 0]
    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3   True  False  False
