# Q-1) What is a parameter?

Ans) In feature engineering, a parameter is a value that defines or influences the behavior of a feature transformation or extraction process.

-> User-defined or preset: For example, setting a threshold for binarizing a feature (e.g., values above 0.5 become 1, and below become 0).

-> Learned from data: Sometimes, parameters are estimated during model training (like the weights in a regression model), though these are often considered model parameters rather than feature engineering parameters.

-> parameters in feature engineering are crucial because they control how raw data is transformed into features that a model can use effectively.

-> Adjusting these parameters can significantly affect the quality of the features and, subsequently, the model`s performance.

-> Binning: Choosing the number of bins or the boundaries of bins when discretizing a continuous feature.

-> Normalization/Scaling: Deciding which scaling method to use (min-max, z-score) and calculating the necessary scaling factors (e.g., mean and standard deviation).

# Q-2) What is correlation? What does negative correlation mean?

Ans) Correlation is a statistical measure that describes the relationship between two features (variables).

->  It indicates how one feature changes in relation to another. Correlation values range from -1 to 1:

* 1 → Perfect positive correlation (both increase together).

* 0 → No correlation (no relationship between them).

* -1 → Perfect negative correlation (one increases while the other decreases).

-> In feature engineering, correlation helps in feature selection by identifying redundant or irrelevant features.

-> A negative correlation means that as one feature increases, the other decreases. This is represented by a correlation value between -1 and 0.

-> Strong negative correlation might indicate redundant features that can be removed.

-> If two features are highly negatively correlated, keeping both might not add value to the model.


# Q-3) Define Machine Learning. What are the main components in Machine Learning?

Ans) Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed.

->  Instead of using predefined rules, ML models identify patterns and relationships within data to generalize and make informed predictions

* Dataset

-> A collection of data used for training and testing the model.
Can be structured (tables, databases) or unstructured (images, text).
Features (Input Variables)

-> Independent variables used to train the model.
Feature engineering improves model performance by selecting and transforming relevant features.

* Model (Algorithm)

-> A mathematical structure that learns from data.
Examples: Linear Regression, Decision Trees, Neural Networks.

-> Evaluates how well the model`s predictions match the actual values.
Examples: Mean Squared Error (MSE) for regression.

* Optimization Algorithm

-> Adjusts model parameters to minimize errors.
Example: Gradient Descent is commonly used in neural networks.

* Training Process

-> The phase where the model learns patterns from labeled data by adjusting its parameters.

* Evaluation Metrics

-> Used to assess model performance on unseen data.
Examples: Accuracy, Precision-Recall, F1 Score.

-> Used to check how well the model generalizes to new data.
Prevents overfitting by ensuring the model isn`t memorizing training data.

* Deployment & Monitoring

-> Once trained, the model is deployed for real-world use.
Continuous monitoring ensures the model remains accurate over time.

# Q-4) How does loss value help in determining whether the model is good or not?

Ans) The loss value is a numerical measure of how well a machine learning model's predictions match the actual target values.

->  It helps determine whether the model is learning effectively or if it needs further improvements.

-> A lower loss value means the model is making more accurate predictions.

-> A higher loss value suggests the model is performing poorly.

-> The model updates its parameters (weights) to minimize the loss during training.

-> This process is done using optimization algorithms like Gradient Descent.

-> Regression models:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)

-> Classification models:

Cross-Entropy Loss (Log Loss)
Hinge Loss (for SVMs)


# Q-5) What are continuous and categorical variables?

Ans)  variables (features) are broadly classified into two types:

-> A continuous variable can take any numerical value within a given range. These values are measurable and often have decimal points.

Examples:

-> Height (in cm) → 175.5 cm

Infinite possible values in a range (e.g., 0.1, 0.2, 0.3… up to 100).

-> A categorical variable represents distinct groups or categories, rather than numerical values.

**Types of Categorical Variables:**

* Nominal Variables (No Order):

Gender → Male, Female, Other
Blood Group → A, B, AB, O

* Ordinal Variables (Has Order/Ranking):

Education Level → High School, Bachelor`s, Master's, PhD
Customer Satisfaction → Low, Medium, High

->  Categorical variables require encoding since ML models work with numbers, not text.

->  Understanding variable types helps in feature selection & transformation for better model performance.



# Q-6) How do we handle categorical variables in Machine Learning? What are the common techniques?

Ans) Machine Learning models require numerical inputs, so categorical variables must be transformed into numerical values.

-> The process of converting categorical data into a format that ML models can understand is called encoding.

* One-Hot Encoding (OHE)

->  Converts each category into a separate binary (0 or 1) column.

->  Used for nominal (unordered) categories.

* Label Encoding

-> Assigns a unique integer to each category.

->  Used for ordinal (ordered) categories.

* Ordinal Encoding

->  Similar to Label Encoding but only used when the categories have a meaningful order.

*  Target Encoding (Mean Encoding)

-> Replaces categories with the mean of the target variable.

-> Used in supervised learning problems.



# Q-7) What do you mean by training and testing a dataset?

Ans) In Machine Learning, a dataset is typically split into training and testing sets to evaluate a model's performance on unseen data.

* Training Dataset

->  The training dataset is used to train the machine learning model.

-> The model learns patterns, relationships, and features from this data.

-> It helps the model adjust its internal parameters (weights).

* Testing Dataset

->  The testing dataset is used to evaluate the model's performance.

->  It contains new data that the model has never seen before.

->  Helps check if the model generalizes well or if it is overfitting.

 A common split ratio:

* 80% Training Data
* 20% Testing Data
* Other ratios: 70-30, 90-10, 60-40 (depends on dataset size).

# Q-8) What is sklearn.preprocessing?

Ans) sklearn.preprocessing is a module in Scikit-Learn that provides various data preprocessing techniques to transform raw data into a format suitable for machine learning models.

->  It includes methods for scaling, normalization, encoding, and imputation.

**Functions in sklearn.preprocessing:**

* Standardization (Z-score Scaling)

->  Ensures data has mean = 0 and standard deviation = 1

->  Helps models like Logistic Regression and SVM perform better.

*  Min-Max Scaling (Normalization)

->  Scales data between a fixed range (default: 0 to 1)

->  Useful for models that rely on distance (e.g., KNN, Neural Networks)

*  Label Encoding

->  Converts categorical labels into numerical values.

->  Used for ordinal categorical data.

-> Ensures numerical stability (important for algorithms like Gradient Descent)

->  Improves model accuracy (badly scaled data can lead to poor performance)

-> Handles categorical data automatically.

# Q-9) What is a Test set?

Ans) A test set is a subset of a dataset that is used to evaluate the performance of a trained machine learning model.

-> It contains new, unseen data that was not used during training.

->  Measures Model Performance → Helps check how well the model generalizes to new data.

->  Prevents Overfitting : Ensures the model is not just memorizing training data.

-> Used for Final Evaluation : After training and validation, the test set provides the final accuracy, precision, recall, etc.



In [1]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([10, 20, 30, 40, 50])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data:", X_train)
print("Test Data:", X_test)


Training Data: [[5]
 [3]
 [1]
 [4]]
Test Data: [[2]]


# Q-10) How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

Ans) In Machine Learning, we split the dataset into a training set (for model learning) and a test set (for evaluation).

-> This is done using the train_test_split() function from sklearn.model_selection.

*  Splitting Data Using train_test_split()

->  Default split: 80% training, 20% testing (can be adjusted).

-> Random state ensures reproducibility of the split.

* Customizing the Train-Test Split

->  Change test size: Use test_size=0.3 for 70-30 split, test_size=0.1 for 90-10 split.

->  Stratify split: If the dataset is imbalanced, use stratify=y.

->  Shuffle data: Enabled by default (shuffle=True), but can be disabled (shuffle=False).

* Splitting Data into Train, Validation, and Test Sets

->  Sometimes, we also use a validation set to tune hyperparameters.

->  Common split: 70% training, 15% validation, 15% testing.

*  Approaching machine-learning pattern:

->  Understand the problem before jumping into coding.

->  Preprocess data to clean and structure it for models.

->  Train and test models using appropriate ML techniques.

->  Evaluate and optimize models for better accuracy.

->  Deploy the model and monitor its real-world performance.

# Q-11) Why do we have to perform EDA before fitting a model to the data?

Ans) Performing EDA is a crucial step before fitting a machine learning model for several reasons.

-> It helps us understand the data better and ensure we are well-prepared to build an effective and reliable model.

->  Identify feature types (numerical, categorical, datetime) and their distributions.

->  Helps in selecting the appropriate model and preprocessing techniques (e.g., scaling, encoding).

-> Handle missing data appropriately before fitting the model.

->  Missing data can lead to biased or inaccurate model results if not treated.

->  Outliers can have a significant impact on certain models (e.g., Linear Regression, KNN).

->  Detect and handle outliers to prevent misleading results.

->  Identify correlations between features to better understand dependencies.

->  Highly correlated features can cause multicollinearity in models like Linear Regression.

->  Feature engineering opportunities arise from understanding feature relationships.



# Q-12) What is correlation?

Ans) Correlation refers to a statistical relationship or association between two or more variables.

-> It measures the extent to which one variable changes when another variable changes.

-> The correlation value lies between -1 and +1, and it indicates both the strength and direction of the relationship.

**Types of Correlation**

* Positive Correlation (+1)

-> Both variables increase or decrease together.

Example: As the number of study hours increases, the exam score tends to increase.

* Negative Correlation (-1)

-> One variable increases while the other decreases.

Example: As the amount of sleep decreases, fatigue increases.

* No Correlation (0)

-> No linear relationship between the variables.

Example: Shoe size and intelligence likely have no correlation.

-> Correlation coefficient quantifies this relationship, with values between -1 and +1.

-> Pearson's correlation is used for linear relationships, and Spearman's rank correlation is used for monotonic relationships.


# Q-13) What does negative correlation mean?

Ans) Negative correlation refers to a statistical relationship between two variables where as one variable increases, the other decreases.

->  In simpler terms, they move in opposite directions. If one variable goes up, the other tends to go down, and vice versa.

**Key Characteristics of Negative Correlation:**

* Inverse Relationship:

-> When one variable increases, the other decreases.

Example: Temperature and heating costs — As temperature increases, the need for heating decreases.

* Correlation Coefficient:

-> The correlation coefficient (r) for a negative correlation is less than 0 (but greater than -1).

-> A perfect negative correlation has an r value of -1.

-> A strong negative correlation has an r value closer to -1 (e.g., -0.8).

-> A weak negative correlation has an r value closer to 0 but still negative (e.g., -0.2).

-> Negative correlation means one variable increases while the other decreases.

-> Correlation coefficient for negative correlation is between 0 and -1.

-> A strong negative correlation has a coefficient close to -1, while a weak negative correlation is closer to 0.


# Q-14) How can you find correlation between variables in Python?

Ans) Pandas provides a built-in function corr() to calculate the correlation between numerical variables in a DataFrame.



In [2]:
import pandas as pd

data = {
    'age': [25, 30, 35, 40, 45],
    'height': [150, 160, 170, 180, 190],
    'weight': [55, 60, 65, 70, 75]
}

df = pd.DataFrame(data)

correlation_matrix = df.corr()

print(correlation_matrix)


        age  height  weight
age     1.0     1.0     1.0
height  1.0     1.0     1.0
weight  1.0     1.0     1.0


# Q-15) What is causation? Explain difference between correlation and causation with an example.

Ans) Causation refers to a cause-and-effect relationship between two variables, where one variable directly influences or causes a change in the other.

->  In other words, causation indicates that a change in one variable directly results in a change in another.

* Cause: The factor or action that brings about the change.

* Effect: The change that occurs as a result of the cause.

**Difference Between Correlation and Causation:**

* Correlation:

->  Correlation is a statistical relationship between two variables. It tells you how two variables move relative to each other, but it does not imply that one variable causes the other to change.

-> Does not imply a cause-effect relationship.

* Causation:

-> Causation indicates that one variable directly causes a change in another. If X causes Y, it means changes in X lead to changes in Y. Causation implies a cause-effect relationship.

-> Implies cause-and-effect relationship.


-> Correlation can be misleading, making us think that two variables are directly related when they may not be. For example, correlation does not mean causation.

-> Causation is harder to prove but much more valuable because it shows a true cause-and-effect relationship that we can use to predict or prevent outcomes.

# Q-16) What is an Optimizer? What are different types of optimizers? Explain each with an example.

Ans) In machine learning, an optimizer is an algorithm or method used to minimize (or maximize) a loss function by adjusting the parameters of a model (e.g., weights in a neural network).

->  The goal of an optimizer is to find the best parameters that reduce the error (or loss) during training, so the model can make accurate predictions on new data.

-> Optimizers play a crucial role in training machine learning models, especially in gradient-based optimization, where they adjust model parameters iteratively based on the gradient of the loss function.

* Gradient Descent

-> Gradient Descent (GD) is the most basic and widely used optimization algorithm.

->  It works by computing the gradient (or derivative) of the loss function with respect to the model parameters and then updating the parameters in the direction that minimizes the loss.

* Stochastic Gradient Descent (SGD)

-> Stochastic Gradient Descent (SGD) is a variant of Gradient Descent, where instead of using the entire dataset to compute the gradient, a single data point (or a small batch) is used at each iteration.

* Adam (Adaptive Moment Estimation)

-> Adam combines the benefits of Momentum and RMSProp (another adaptive learning rate optimizer).

-> It computes adaptive learning rates for each parameter by considering both the first moment (mean) and the second moment (uncentered variance) of the gradients.

# Q-17) What is sklearn.linear_model ?

Ans) In scikit-learn, sklearn.linear_model is a module that contains various linear models for regression and classification tasks.

->  These models are based on the concept of linear relationships, meaning they aim to predict the target variable as a linear combination of the input features.

-> This module includes commonly used linear algorithms such as Linear Regression, Logistic Regression, Ridge Regression, Lasso, and others.

**Classes in sklearn.linear_model**

* Linear Regression is suitable for most regression problems unless you have a reason to regularize the model.

* Logistic Regression is the go-to for binary classification tasks.

* Ridge is useful when you have multicollinearity or many features that are highly correlated.

* Lasso is ideal when you need to perform feature selection.

* ElasticNet is useful when your data is large and you want to benefit from both L1 and L2 regularization.

# Q-18) What does model.fit() do? What arguments must be given?

Ans) The fit() method is used to train a machine learning model. It adjusts the model's internal parameters (like weights in a linear model) based on the provided training data.

->  In other words, it learns the patterns from the input data and uses this information to make predictions or classifications on new data.

-> Learns from the training data (X_train as features and y_train as target labels).

-> Adjusts its internal parameters (like coefficients, weights, or other model-specific parameters) to minimize the error or loss function.

**Arguments for model.fit()**

* X (features): A 2D array or matrix representing the input features of your dataset.

-> This is the data that will be used to train the model.

-> Shape: (n_samples, n_features) where n_samples is the number of data points (rows), and n_features is the number of features (columns).

* y (target): A 1D array or vector representing the target values (labels) corresponding to the input data (X).

-> This is the output or the label that the model tries to predict.

-> Shape: (n_samples,), where n_samples is the number of data points.

# Q-19) What does model.predict() do? What arguments must be given?

Ans) The predict() method is used to make predictions based on the trained model.

->  After the model has been trained using the fit() method, the predict() method is called to generate predictions (outputs) for new, unseen data.

-> It takes the input features as an argument and returns the predicted output.

-> The model uses the parameters it learned during training (such as weights and biases) to compute the predictions for the new data.

**Arguments for model.predict()**

-> The predict() method generally requires the following argument:

* X (features): A 2D array (or a similar structure) representing the input data for which you want to generate predictions.

-> This input data should have the same number of features as the training data (i.e., X_train).

-> Shape: (n_samples, n_features) where n_samples is the number of data points you want predictions for, and n_features is the number of features that the model was trained on.

# Q-20) What are continuous and categorical variables?

Ans) continuous and categorical. These classifications help determine the types of models, analysis techniques, and preprocessing steps to apply to the data.

* Continuous Variables

-> Continuous variables are variables that can take any value within a given range or scale.

->  They are numerical and often represent quantities or measurements. Since these values are not restricted to specific, distinct values, they can take on an infinite number of possibilities within the range.

-> Can take any value (including decimals) within a range.

-> Represent measurements, counts, or magnitudes.

-> Can be manipulated mathematically (addition, subtraction, multiplication, etc.).

* Categorical Variables

-> Categorical variables are variables that represent categories or groups.

->  They take on a limited, fixed number of values and are typically used to describe qualities or characteristics rather than quantities.

-> Categorical variables can be either nominal (no specific order) or ordinal (have a specific order).

* Nominal:

-> No intrinsic order or ranking between the categories.

Examples: Gender,Color (red, blue, green)

* Ordinal:

-> Categories have a specific order or ranking.

Examples: Education level (High School < Bachelor's < Master's),

-> Limited, distinct categories (often strings or labels).

-> No mathematical operations can be performed.

-> Used to represent qualitative data.

# Q-21) What is feature scaling? How does it help in Machine Learning?

Ans) Feature scaling is the process of normalizing or standardizing the range of independent variables (features) in your dataset.

-> This is important because the scale of the features can significantly affect the performance of certain machine learning models.

->  In feature scaling, you ensure that all features are on a similar scale, so that no one feature dominates the learning process due to its larger values.

-> Improves Model Performance: Many machine learning algorithms, especially those that rely on distance or gradients, are sensitive to the scale of features.

-> If one feature has a much larger scale than others, it could disproportionately influence the model's performance.

-> Ensures Fair Contribution: When features have different scales, models might give more importance to features with larger ranges (such as income or age in raw form).

->  Feature scaling ensures that all features contribute equally to the model.

-> Scaling is especially important for algorithms like KNN, SVM, and those using gradient-based optimization.

-> Tree-based algorithms are generally not sensitive to feature scaling.

# Q-22) How do we perform scaling in Python?

Ans) In Python, we can perform feature scaling using the scikit-learn library, which provides a set of utilities for scaling and normalizing features.

->  Standardization, Min-Max Scaling, and Robust Scaling

-> Use StandardScaler() for Z-score standardization.

-> Use MinMaxScaler() for scaling to a specified range (e.g., [0, 1]).

-> Use RobustScaler() when your data contains outliers and you want to use the median and IQR for scaling.

In [3]:
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[22, 40000],
                 [35, 80000],
                 [60, 120000],
                 [45, 75000]])

scaler = StandardScaler()

scaled_data = scaler.fit_transform(data)

print("Standardized Data:\n", scaled_data)


Standardized Data:
 [[-1.3307975  -1.36602321]
 [-0.3956425   0.04406526]
 [ 1.4027325   1.45415374]
 [ 0.3237075  -0.13219579]]


# Q-23) What is sklearn.preprocessing?

Ans) sklearn.preprocessing is a module in scikit-learn (a Python library for machine learning) that provides various functions and classes to prepare and scale your data before feeding it into a machine learning model.

->  Preprocessing refers to the operations you apply to your data to make it more suitable for learning algorithms, such as scaling, normalizing, encoding categorical variables, and more.

* Feature Scaling: Normalizing or standardizing features to ensure that they are on a similar scale.

* Encoding Categorical Variables: Converting categorical variables into numerical format.

* Handling Missing Data: Filling missing values or removing data with missing values.

* Transforming Data: Applying mathematical transformations like logarithmic scaling.

# Q-24) How do we split data for model fitting (training and testing) in Python?

Ans) To split data for model fitting (training and testing) in Python, we generally use train_test_split() from sklearn.model_selection.

-> This function randomly splits a dataset into two subsets: one for training the model and another for testing the model's performance.

-> This helps ensure that the model is evaluated on data it has not seen during training.

**Steps for splitting data:**

* Import Required Libraries: You need to import train_test_split from sklearn.model_selection.

* Prepare Your Data: You should have your features (X) and labels/target (y) separated into variables.

* Split the Data: Use the train_test_split() function to randomly split the data into training and testing sets.

**syntax:**

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


-> we should Use train_test_split() to divide your data into training and testing subsets.

-> we can control the split ratio using the test_size parameter (e.g., 0.2 for an 80-20 split).

-> The random_state ensures reproducibility.

we should Use stratify for class imbalance in classification problems.

# Q-25) Explain data encoding?

Ans) Data encoding is the process of converting categorical variables (variables with labels or categories) into a numerical format so that machine learning algorithms can work with them.

->  Machine learning models typically expect numerical input, and encoding helps in converting these categorical variables into numerical values that can be fed into the model.

-> There are different types of encoding techniques, each with its use case.

*  Label Encoding (Integer Encoding)

-> Label Encoding converts each category into a unique integer value. Each label in a categorical column is assigned an integer starting from 0.

*  One-Hot Encoding

-> One-Hot Encoding creates new binary columns for each unique category.

-> Each column represents one category, and it will have a 1 if the instance belongs to that category and a 0 otherwise.

*  Ordinal Encoding

-> Ordinal Encoding is similar to Label Encoding but is specifically used when the categories have an ordinal relationship (e.g., small, medium, large).

-> The order of the categories is important in this encoding.
