# Feature Engineering

## Q.1. What is a parameter?

--> A model parameter is a configuration variable that is internal to the model and whose value can be estimated or learned from the given training data. Parameters define the skill of the model on a specific problem and are not set manually by the practitioner. Examples include the coefficients in linear regression or the weights and biases in a neural network.

## Q.2. What is correlation? What does negative correlation mean?

--> Correlation is a statistical measure that describes the degree and direction of the linear relationship between two or more variables. It quantifies the extent to which changes in one variable are associated with changes in another. The relationship is expressed by a correlation coefficient (r) which ranges from -1 to +1.

--> Negative correlation (or inverse correlation) means that as one variable increases, the other variable generally decreases, and vice versa. The correlation coefficient for a negative correlation is a value below 0, with -1 indicating a perfect negative relationship. An example is the relationship between price and demand: as the price of a commodity increases, its demand tends to decrease.

## Q.3. Define Machine Learning. What are the main components in Machine Learning?

--> Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on building systems that can learn from data and improve with experience without being explicitly programmed.

--> The main components in machine learning typically involve:

> Data:
>> High-quality, relevant data is the foundation for training ML models.

> Algorithm:
>> A set of rules and procedures used to solve a specific problem or perform a task.

> Model:
>> The output or result of applying an algorithm to a dataset after training, which is used to make predictions.

> Loss Function/Objective:
>> A mathematical function that measures the error between the model's predictions and the actual values. The goal is to minimize this loss during training.

> Optimization:
>> The process of adjusting the model's internal parameters (weights and biases) to minimize the loss function.

## Q.4. How does loss value help in determining whether the model is good or not?

--> The loss value is a numerical metric that quantifies the difference (error) between a model's predicted values and the actual true values (ground truth). A lower loss value indicates that the model's predictions are closer to the actual values, meaning the model is performing well. During training, the goal is to minimize this loss value, bringing it as close to zero as possible. A high loss value, conversely, indicates poor model performance.

## Q.5. What are continuous and categorical variables?

--> Continuous variables are numerical data that can take any value within a given range, including infinite intermediate values. Examples include height, weight, temperature, or time, which are typically measured.

--> Categorical variables represent distinct groups or categories and are descriptive rather than numerical. Examples include hair color, dog breed, or education level.

## Q.6. How do we handle categorical variables in Machine Learning? What are the common techniques?

--> Categorical variables must be converted to a numerical format because most machine learning algorithms require numerical input. Common techniques include: 

> One-Hot Encoding:
>> Creates a new binary (0 or 1) column for each category present in the original feature. It is suitable for nominal data where no inherent order exists between categories.

>Label Encoding:
>> Assigns a unique integer value to each category. This is best used for ordinal data where the order of categories matters (e.g., "small", "medium", "large" might be encoded as 0, 1, 2). Using it for nominal data can imply an order that does not exist.

>Frequency Encoding:
>> Replaces each category with its frequency or count in the dataset.

>Target Encoding:
>> Replaces a category with the mean of the target variable for that category.

## Q.7. What do you mean by training and testing a dataset?

--> Training refers to the process of using a large, labeled portion of data to teach a machine learning model to recognize patterns and adjust its internal parameters (weights and biases).

--> Testing involves using a separate, unseen subset of the data to evaluate the performance and generalization ability of the trained model. This step ensures the model hasn't simply memorized the training data (overfitting) and can make accurate predictions on new data.

## Q.8. What is sklearn.preprocessing?

--> The sklearn.preprocessing package in the scikit-learn Python library provides several utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for downstream machine learning estimators. These tools are essential for data cleaning and preparation, and include functions for standardization, scaling features to a range, normalization, binarization, encoding categorical features, and imputing missing values.

## Q.9. What is a Test set?

--> A test set is a dataset that is independent of the data used for training and validation. It is used only at the very end of the model development process to provide an unbiased evaluation of the final model's performance on unseen, real-world data. It helps assess how well the model generalizes and is crucial for detecting issues like overfitting.

## Q.10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

--> The data is typically split using the train_test_split() function from the sklearn.model_selection module in the scikit-learn library.

In [None]:
from sklearn.model_selection import train_test_split
y_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

--> This function takes the feature matrix (x) and target vector (y) as input. Key parameters include test_size (the proportion of data for the test set, commonly 0.2 or 0.3) and random_state (ensures the split is the same every time the code runs for reproducibility).

--> A structured approach to a machine learning problem typically involves several key steps:

>Define the problem:
>> Understand the goal and the desired outcome (e.g., classification, regression).

>Gather the data:
>> Collect relevant and high-quality data from various sources.

>Explore and visualize data (EDA):
>> Analyze the data to find patterns, check for missing values or outliers, and understand relationships between variables.

>Prepare the data:
>> Clean the data, handle missing values, encode categorical variables, and scale numerical features (data preprocessing).

>Select a model and train it:
>>Choose an appropriate algorithm and fit it to the training data.

>>Fine-tune the model:
>>>Optimize hyperparameters and use techniques like cross-validation to improve performance.

>>Evaluate the model:
>>>Assess the final model's performance on the test set using relevant metrics.

>>Deploy and monitor:
>>>Launch the model in a real-world environment and monitor its performance over time. 

## Q.11. Why do we have to perform EDA before fitting a model to the data?

--> Exploratory Data Analysis (EDA) is crucial because it helps to:

>Understand the data:
>> Provides insight into the structure, variables, and potential issues within the dataset.

>Identify issues:
>> Helps in detecting missing values, outliers, or errors in the data that could negatively impact model performance.

>Formulate hypotheses:
>> Allows practitioners to uncover patterns and relationships that can guide model selection and feature engineering decisions.

>Verify assumptions:
>> Checks if the data meets the assumptions required by certain statistical procedures or machine learning algorithms.

>Prepare data effectively:
>> Informs the necessary data cleaning and preprocessing steps needed before training the model.

## Q.12. What is correlation?

--> Correlation is a statistical measure that describes the degree and direction of the linear relationship between two or more variables. It quantifies the extent to which changes in one variable are associated with changes in another. The relationship is expressed by a correlation coefficient (r) which ranges from -1 to +1.

## Q.13. What does negative correlation mean?

--> Negative correlation (or inverse correlation) means that as one variable increases, the other variable generally decreases, and vice versa. The correlation coefficient for a negative correlation is a value below 0, with -1 indicating a perfect negative relationship. An example is the relationship between price and demand: as the price of a commodity increases, its demand tends to decrease.

## Q.14. How can you find correlation between variables in Python?

--> Correlation between variables in Python can be found using functions from libraries like pandas and numpy.

>Pandas:
>>The .corr() method on a DataFrame computes the correlation matrix, showing the correlation coefficient between all pairs of columns.

>Numpy:
>>The numpy.corrcoef() function can be used to calculate the Pearson product-moment correlation coefficient for two variables.

>Visualization:
>>Libraries like matplotlib or seaborn can be used to create scatter plots or heatmaps of the correlation matrix, which visually represent the relationships.

## Q.15. What is causation? Explain difference between correlation and causation with an example.

--> Causation indicates that one event is the direct result of the occurrence of another event; a true cause-and-effect relationship. 

Correlation means that two or more variables are statistically related or associated, but it does not automatically imply that one causes the other. 

Example: Sales of ice cream and sales of sunscreen are highly correlated because both tend to increase during the summer months. However, increased ice cream sales do not cause increased sunscreen sales. The underlying cause for both is a third variable: hotter weather/season. 

## Q.16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

--> An optimizer is an algorithm or method used to minimize the error (loss function) of a machine learning model by iteratively adjusting the model's learnable parameters (weights and biases). 

Types of optimizers include:

>Gradient Descent (Batch Gradient Descent):
>>Calculates the gradient of the loss function using the entire training dataset to update parameters in one step. It is slow and computationally expensive for large datasets.

>Stochastic Gradient Descent (SGD):
>>Updates the model parameters one by one for each training example. It requires less memory and is faster for large datasets, though the updates can be noisy.

>Mini-Batch Gradient Descent:
>>Splits the training data into small batches and performs an update for each batch, balancing the robustness of SGD and the efficiency of Batch Gradient Descent.

>Adam (Adaptive Moment Estimation):
>>One of the most popular optimizers, it computes adaptive learning rates for each parameter by storing decaying averages of past gradients and squared gradients. 

## Q.17. What is sklearn.linear_model ?

--> sklearn.linear_model is a module within the scikit-learn library that contains a variety of functions for performing machine learning using linear models. Linear models assume that the target variable can be predicted using a linear function of the input features. This module includes algorithms for both regression and classification tasks.

## Q.18. What does model.fit() do? What arguments must be given?

--> The model.fit() method is used to train a machine learning model in scikit-learn. It adjusts the internal parameters of the model based on the provided data to learn underlying patterns. 

The two required arguments are:

x: The feature matrix (input data), where each row represents a sample and each column represents a feature.

y: The target vector (labels or target values) corresponding to the samples in x.

## Q.19. What does model.predict() do? What arguments must be given?

--> The model.predict() method is used to make predictions on new, unseen data using the patterns learned during the fit() process. 

The primary argument that must be given is:

X_test: The feature matrix of the new input data for which you want predictions. This data should be in the same format and have the same features as the data used for training. 

## Q.20. What are continuous and categorical variables?

--> Continuous variables are numerical data that can take any value within a given range, including infinite intermediate values. Examples include height, weight, temperature, or time, which are typically measured.

--> Categorical variables represent distinct groups or categories and are descriptive rather than numerical. Examples include hair color, dog breed, or education level.

## Q.21. What is feature scaling? How does it help in Machine Learning?

--> Feature scaling is a data preprocessing technique used to standardize the range of independent features or variables in a dataset. 

It helps in machine learning by:

>Ensuring fairness:
>>Prevents features with larger magnitudes from dominating the learning process or objective function.

>Improving performance:
>>Many algorithms, such as gradient descent, k-nearest neighbors (KNN), and support vector machines (SVM), perform much better or converge faster when features are on a similar scale.

>Optimizing algorithms:
>>Helps optimization algorithms work more efficiently by avoiding issues caused by different scales of features.

## Q.22.How do we perform scaling in Python?

--> Scaling in Python is commonly performed using classes from the sklearn.preprocessing module.

>Standardization (StandardScaler):
>>Transforms data to have a mean of zero and a unit variance.

>Normalization to a range (MinMaxScaler):
>>Rescales features to a fixed range, typically between (0, 1). 

These classes use fit() on the training data to learn the scaling parameters (mean, standard deviation, min/max values) and then transform() to apply the scaling to both training and test data consistently. The fit_transform() method can be used as a shortcut on the training data. 

## Q.23. What is sklearn.preprocessing?

--> The sklearn.preprocessing package in the scikit-learn Python library provides several utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for downstream machine learning estimators. These tools are essential for data cleaning and preparation, and include functions for standardization, scaling features to a range, normalization, binarization, encoding categorical features, and imputing missing values.

## Q.24. How do we split data for model fitting (training and testing) in Python?

--> The data is typically split using the train_test_split() function from the sklearn.model_selection module in the scikit-learn library.

In [None]:
from sklearn.model_selection import train_test_split
y_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

--> This function takes the feature matrix (x) and target vector (y) as input. Key parameters include test_size (the proportion of data for the test set, commonly 0.2 or 0.3) and random_state (ensures the split is the same every time the code runs for reproducibility).

--> A structured approach to a machine learning problem typically involves several key steps:

>Define the problem:
>> Understand the goal and the desired outcome (e.g., classification, regression).

>Gather the data:
>> Collect relevant and high-quality data from various sources.

>Explore and visualize data (EDA):
>> Analyze the data to find patterns, check for missing values or outliers, and understand relationships between variables.

>Prepare the data:
>> Clean the data, handle missing values, encode categorical variables, and scale numerical features (data preprocessing).

>Select a model and train it:
>>Choose an appropriate algorithm and fit it to the training data.

>>Fine-tune the model:
>>>Optimize hyperparameters and use techniques like cross-validation to improve performance.

>>Evaluate the model:
>>>Assess the final model's performance on the test set using relevant metrics.

>>Deploy and monitor:
>>>Launch the model in a real-world environment and monitor its performance over time. 

## Q.25. Explain data encoding?

--> Data encoding is the process of converting data from one form to another, specifically in machine learning, it refers to converting categorical data (labels or categories) into a numerical format that algorithms can understand and process. Common methods include One-Hot Encoding and Label Encoding, each suitable depending on whether the categorical data is nominal (unordered) or ordinal (ordered).