# Feature Engineering

**Q1. What is parameter ?**

-->In machine learning, a parameter refers to an internal variable of a model that is learned from the training data. These parameters are crucial for a model's ability to make predictions and are adjusted during the training process to minimize the model's loss function. Unlike hyperparameters, parameters are not set by the user but are learned by the algorithm from the data.




**Q2.What is correlation?  What does negative correlation mean ?**

--> Correlation expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It's a common tool for describing simple relationships without making a statement about cause and effect.

Negative correlation, also known as inverse correlation, describes a relationship between two variables where an increase in one variable corresponds to a decrease in the other.

**Q3.Define machine learning. What are main componenets of machine learning?**

-->Machine learning (ML) is a branch of artificial intelligence (AI) focused on enabling computers and machines to imitate the way that humans learn, to perform tasks autonomously, and to improve their performance and accuracy through experience and exposure to more data.

>>Here are the key machine learning lifecycle components:

* Representation
This refers to the way knowledge is represented for ML purposes. Some examples include decision trees, sets of rules, instances, graphical models, neural networks, support vector machines, model ensembles, and various others.

* Abstraction
Abstraction simplifies the representation of a problem, allowing for more efficient problem-solving with reduced memory and computation requirements. Examples of data abstraction are decreasing the spatial and temporal resolution or dividing continuous variables into meaningful ranges that align with specific goals.

* Evaluation
Every ML project needs a method for evaluating hypotheses. Some examples are accuracy, prediction and recall, squared error, KL divergence (relative entropy), and others.

* Generalization
Generalization is crucial for a model to effectively handle new, unfamiliar data that comes from the same distribution as the data used to train the model. It allows teams to gain a deeper understanding of overfitting and assess the quality of a model.

* Data Storage
This one might easily get forgotten among the components of machine learning, but where your data resides is very important. Common storage solutions for machine learning include object storage, distributed file systems, and cloud-based storage.

**Q4.How does loss value help in determining whether the model is good or not?**

-->A model's loss value, or error, provides a crucial indicator of its performance. A lower loss value generally signifies a model that is making more accurate predictions and is, therefore, considered better. The loss function quantifies the discrepancy between the model's predictions and the actual values, acting as a guide for adjusting the model's parameters to minimize this difference.

**Q5.What are continuous and categorical variables ?**

-->Continuous variables are typically numerical and can have values that can be measured with a high degree of precision (e.g., height, weight, temperature).


Categorical variables, on the other hand, are not numerical and represent different categories or groups (e.g., gender, eye color, race, city of residence).

**Q6.How do we handle categorical variables in machine learning ? What are common techniques?**

-->Categorical variables in machine learning are handled through encoding techniques, converting them into numerical representations that models can process. Common methods include one-hot encoding for nominal variables, ordinal encoding for variables with inherent order, and target/frequency encoding for certain scenarios.

**Q7.What do you mean by training and testing a dataset?**

-->In machine learning, training a dataset means using a portion of the data to teach a model how to learn patterns and make predictions, while testing a dataset involves using a separate portion to evaluate the model's performance on unseen data. Essentially, the training data is the "schoolbook" the model uses to learn, and the testing data is the "exam" to see how well it understands the material.

**Q8.What is sklearn.preprocessing?**

-->The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, learning algorithms benefit from standardization of the data set.

**Q9.What is a test set ?**

-->a test set is a separate portion of the dataset that is held back from the training process. It's used to evaluate the model's performance on unseen data after it has been trained, providing an unbiased measure of how well it generalizes. This helps determine how the model would perform in a real-world scenario, where it encounters data it hasn't seen before.

**Q10.How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?**

-->To split data for model fitting in Python, use the train_test_split function from scikit-learn. This function randomly divides the data into training and testing sets, typically in a 80-20 or 70-30 split. The training set is used to train the model, and the testing set is used to evaluate its performance.
Train Test Validation Split: How To & Best Practices [2024]

A common machine learning approach involves several steps: data collection, data preparation, model selection, model training, model evaluation, parameter tuning, and making predictions.



**Q11.Why do we have to perform EDA before fitting a model to the data.**

-->Exploratory Data Analysis (EDA) is crucial before fitting a model because it helps reveal data characteristics, potential issues, and relationships, which are essential for building accurate and reliable models. EDA enables data cleaning, preprocessing, and feature engineering, ultimately leading to better model performance.

**Q12.What is correlation?**

-->A correlation is a statistical measure of the relationship between two variables. The measure is best used in variables that demonstrate a linear relationship between each other. The fit of the data can be visually represented in a scatterplot. Using a scatterplot, we can generally assess the relationship between the variables and determine whether they are correlated or not.

**Q13.What does negative correlation mean?**

-->Negative correlation, also known as inverse correlation, describes a relationship between two variables where an increase in one variable corresponds to a decrease in the other.

**Q14.How can you find correlation between variables in python?**

-->In Python the librabry pandas, which provides a built-in method called corr() that returns a correlation matrix for a DataFrame. A correlation matrix is a table that shows the correlation coefficients between each pair of variables in the DataFrame.

In [3]:
#Example

import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate correlation matrix
print(df.corr())

                 Temperature  Ice_Cream_Sales
Temperature         1.000000         0.923401
Ice_Cream_Sales     0.923401         1.000000


**Q15.What is causation ? Explain difference between correlation and causation with an example.**

-->Causation signifies a direct relationship where one event or factor actively generates or brings about another. It's the fundamental link between a cause, the initiating action or circumstance, and its resulting effect. This connection implies a genuine and non-accidental influence, where the cause demonstrably leads to the outcome. Furthermore, the cause typically precedes the effect in time.

A crucial distinction lies between causation and correlation; while correlation indicates a tendency for two things to occur together, causation asserts a direct influence. For instance, while ice cream sales and crime rates might both rise in summer (correlation), it's the warmer weather that likely influences both independently (potential underlying cause), rather than one directly causing the other. Establishing causation necessitates rigorous investigation, often involving controlled experiments and the careful elimination of alternative explanations, making it a cornerstone of understanding in various disciplines.



**Q16.What ia an Optimizer? What are different types of Optimizers ?Expalin each with an example?**

-->In deep learning, an optimizer is an algorithm that adjusts a neural network's weights to minimize the loss function, thereby improving the model's performance. It iteratively modifies the model's parameters (like weights and biases) to reduce errors and enhance the model's ability to make accurate predictions.

1. Stochastic Gradient Descent (SGD):
Explanation:
SGD updates model parameters based on the gradient of the loss function calculated on a random subset of the training data (a mini-batch).

  Example:
Imagine training a simple linear regression model. SGD would randomly select a few data points, calculate the gradient of the loss function (mean squared error) with respect to the model's parameters, and then update the parameters based on this gradient. This process is repeated for multiple mini-batches.
2. Adam (Adaptive Moment Estimation):
Explanation:
Adam combines the benefits of SGD and RMSprop by using moving averages of the gradients (first and second moments) to adaptively adjust the learning rate for each parameter.

  Example:
In the same linear regression example, Adam would not only use the current gradient like SGD but also consider the history of gradients to smooth out the updates and potentially converge faster.

3. RMSprop (Root Mean Square Propagation):
Explanation:
RMSprop also uses adaptive learning rates, similar to Adam, but it focuses on the squared gradients to adjust the learning rate.

 Example:
In the linear regression example, RMSprop would keep track of the squared gradients and adjust the learning rate for each parameter based on the recent history of these squared gradients.

4. Adagrad (Adaptive Gradient Descent):
Explanation:
Adagrad adapts the learning rate for each parameter based on the history of its gradients.

 Example:
In linear regression, Adagrad would assign larger learning rates to parameters with smaller recent gradients and smaller learning rates to parameters with larger recent gradients.

**17.What is sklearn.linear_model?**

-->linear_model is a class of the sklearn module if contain different functions for performing machine learning with linear models. The term linear model implies that the model is specified as a linear combination of features.

**Q18.What does model.fit() do? What arguments must be given ?**

--> model.fit() function is used to train a machine learning model for a fixed number of epochs (iterations over the entire dataset). During training, the model adjusts its internal parameters (weights and biases) to minimize the loss function using optimization techniques like Gradient Descent.

**Q19.What does model.predict() do? What arguments must be given?**

-->model. predict() is used to generate predictions from the trained model based on new input data. It does not require true labels and does not compute any metrics.

Arguments for model.predict():

The primary and most crucial argument for model.predict() is the input data on which you want to make predictions. The format of this input data typically depends on the library and the model's input requirements, but common formats include:

1. NumPy arrays: This is a very common format, especially in scikit-learn and for simpler models in TensorFlow/Keras. The array should have a shape where each row represents a single data sample, and each column represents a feature.

2. TensorFlow Datasets or Tensors: In TensorFlow/Keras, you can also pass tf.data.Dataset objects or tf.Tensor objects directly. This is often more efficient for large datasets and when leveraging TensorFlow's data pipeline capabilities.

3. Pandas DataFrames: Some libraries or specific model implementations might also accept Pandas DataFrames as input.

**Q20.What are continuous and categorical variables?**

-->Continuous variables are typically numerical and can have values that can be measured with a high degree of precision (e.g., height, weight, temperature).


Categorical variables, on the other hand, are not numerical and represent different categories or groups (e.g., gender, eye color, race, city of residence).

**Q21.What is feature scaling? How does it help in machine learning?**

-->Feature scaling is a crucial data preprocessing technique in machine learning that standardizes or normalizes the range of independent variables (features) in your dataset to a similar scale. This process aims to ensure that no single feature unduly influences the learning algorithm simply because its values are much larger than others.

Feature scaling offers several benefits that can significantly improve the performance and training of machine learning models:

1. Improved Algorithm Performance: Many machine learning algorithms are sensitive to the magnitude of input features. Algorithms that rely on distance calculations (like K-Nearest Neighbors, K-Means, Support Vector Machines) can be heavily influenced by features with larger values. Without scaling, these algorithms might incorrectly weigh features with larger ranges as more important.

Example: Consider a dataset with 'age' (ranging from 0 to 100) and 'income' (ranging from $20,000 to $200,000). Without scaling, the income feature would dominate distance calculations due to its larger range, potentially leading to a model that primarily considers income and neglects the importance of age.

2. Faster Convergence of Gradient Descent: Gradient descent is a common optimization algorithm used to train many machine learning models (like linear regression, logistic regression, and neural networks). When features have significantly different scales, the cost function's contours can become elongated, leading to oscillations during gradient descent and a slower convergence to the optimal solution. Feature scaling helps to make the contours more spherical, allowing for larger and more efficient steps towards the minimum.

3. Prevention of Numerical Instability: In some calculations, large differences in feature scales can lead to numerical instability issues, such as overflow or underflow. Scaling helps to mitigate these problems by keeping the values within a manageable range.

4. Equal Contribution of Features: Feature scaling ensures that all features contribute more equally to the model's learning process. This prevents features with larger magnitudes from dominating the learning and potentially biasing the model.

5. Improved Model Interpretability: When features are on a similar scale, it can be easier to interpret the coefficients in linear models or feature importances in some tree-based models. Comparing the impact of different features becomes more meaningful when they are measured on a comparable scale.

**Q22.How do we perform scaling in python?**

-->To scale data in Python, you'll primarily use the sklearn.preprocessing module. You choose a scaler (like StandardScaler for mean 0, std 1, or MinMaxScaler for a 0-1 range). You create an instance of the scaler, fit it to your training data to learn the scaling parameters, and then use the same fitted scaler to transform your training, validation, and test sets. This ensures consistent scaling and prevents data leakage.

**Q23.What is sklearn.preprocessing?**

-->The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, learning algorithms benefit from standardization of the data set.

**Q24.How do we split data for model fitting (training and testing) in python?**

-->In Python, we split data for model training and evaluation using the train_test_split function from the sklearn.model_selection module. You feed it your features (X) and the corresponding target variable (y). A crucial parameter is test_size, which dictates the proportion of your data allocated to the test set (common values are around 0.2 to 0.3). For consistent results across runs, setting the random_state to a specific integer is highly recommended. The function returns four key datasets: the training features (X_train), the testing features (X_test), the training target variable (y_train), and the testing target variable (y_test). For datasets with uneven class representation in the target variable, using the stratify=y argument ensures that both the training and testing sets maintain similar class proportions, leading to a more reliable evaluation. This split allows you to train your model on one part of the data and then assess its generalization ability on the unseen test set.

**25.Explain data Encoding?**

-->Data encoding converts categorical (non-numerical) data into numerical formats that machine learning models can understand. Common techniques include:

* Label Encoding: Assigns a unique number to each category (e.g., Red=0, Blue=1). Good for ordinal data or binary targets.

* One-Hot Encoding: Creates binary columns for each category (e.g., "Red" becomes [1, 0, 0]). Best for nominal data.

* Ordinal Encoding: Explicitly maps ordered categories to numbers (e.g., Low=1, Medium=2, High=3).

* Binary Encoding, Hashing Encoding, Target Encoding: More advanced methods for specific situations (high cardinality, tree-based models).