# Feature Engineering

##1. What is a parameter?

-> A parameter is a variable listed inside the parentheses in a function definition. It acts as a placeholder for values (called arguments) passed into the function when it is called.

##2. What is correlation? What does negative correlation mean?

-> Correlation is a statistical measure that shows how two variables move in relation to each other.

 A negative correlation, also known as an inverse correlation, describes a relationship between two variables where an increase in one variable is associated with a decrease in the other, and vice versa. In simpler terms, when one thing goes up, the other goes down.  Example: If time spent exercising increases, body fat percentage might decrease.

##3. Define Machine Learning. What are the main components in Machine Learning?

-> Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and make predictions or decisions without being explicitly programmed.

Main Components of Machine Learning:
1. Data

  The foundation of ML. Algorithms learn patterns from structured or unstructured data.

2. Model

  A mathematical representation created by the algorithm that captures the patterns in the data.

3. Algorithm

  The method or set of rules the model follows to learn from data (e.g., Linear Regression, Decision Trees, etc.).

4. Training

  The process of feeding data to the model so it can learn patterns.

5. Testing

  After training, the model is tested on new/unseen data to check how well it performs.

6. Features

  The input variables used by the model to make predictions (e.g., age, income, height).

7. Labels (Target)

  The output the model is trying to predict (e.g., price, yes/no, category).

##4. How does loss value help in determining whether the model is good or not?

-> Loss value serves as a crucial indicator of model performance. A lower loss value generally suggests better model accuracy, while a higher loss value indicates a model that is making significant errors. The loss function quantifies the difference between the model's predictions and the actual values, and the goal during training is to minimize this loss.

##5. What are continuous and categorical variables?

-> continuous variables represent measurable quantities that can take on any value within a range, while categorical variables represent data that can be divided into distinct groups or categories.

Continuous variables are typically numerical and can have values that can be measured with a high degree of precision (e.g., height, weight)

Categorical variables, on the other hand, are not numerical and represent different categories or groups (e.g., gender, eye color, race, city of residence).

##6. How do we handle categorical variables in Machine Learning? What are the common techniques?

-> Categorical variables in machine learning need to be converted into numerical representations because most algorithms only accept numerical inputs. This is done through various encoding techniques, each suitable for different types of categorical data and model requirements.

##7. What do you mean by training and testing a dataset?

-> A dataset is a collection of data used to train and evaluate a machine learning model. It's usually divided into two main parts: training set and testing set.

1. Training Dataset

  - Used to teach the model.

  - The model learns patterns from this data.

  - It sees both input features and the correct answers (labels).

2.  Testing Dataset

  - Used to evaluate how well the model has learned.

  - This data is not shown to the model during training.

  - Helps test the model’s ability to generalize to unseen data.  

##8. What is sklearn.preprocessing?


-> The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, learning algorithms benefit from standardization of the data set.

##9. What is a Test set?


-> A test set is a portion of the dataset that is used to evaluate the performance of a trained machine learning model. It contains unseen data that the model did not learn from during training.

##10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

-> In Python, you typically split data for training and testing using the train_test_split function from the scikit-learn library. This function randomly divides your data into two subsets: one for training the model and the other for evaluating its performance. The typical approach to a Machine Learning problem involves data collection, preparation, model selection, training, evaluation, and potentially parameter tuning before making predictions.

```

from sklearn.model_selection import train_test_split

# Assuming X is your features and y is your target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 80% training, 20% testing

```

How to Approach a Machine Learning Problem:

1. Understand the Problem:

- What is the goal? Is it a regression (predicting continuous values) or classification (predicting categories)?

- Identify the input variables (features) and the target variable (what you want to predict).

2. Collect and Clean Data:

- Collect data from reliable sources.

- Clean the data: Handle missing values, duplicates, and outliers.

 - Transform features if necessary (e.g., encoding categorical variables, scaling numerical data).

3. Explore Data (EDA):

- Use visualization and statistical summaries to understand relationships between features and the target.

- Look for patterns, correlations, or potential issues in the data.

4. Split the Data:

- Split into training and test sets to ensure the model is evaluated properly on unseen data.

5. Choose a Model:

- Select a machine learning model based on the problem (e.g., linear regression, decision trees, random forests, etc.).

- Consider trying different models and comparing their performance.

6. Train the Model:

- Use the training data to teach the model by fitting it.

- Fine-tune model hyperparameters using cross-validation if needed.

7. Evaluate the Model:

- Use the test set to check the model's performance (accuracy, precision, recall, RMSE, etc.).

- Look at evaluation metrics to determine how well the model generalizes to new data.

8. Improve the Model:

- If performance is poor, try different models, use more features, tune hyperparameters, or gather more data.

9. Deploy and Monitor:

- Once satisfied with the model, deploy it for real-world use.

- Continuously monitor the model's performance, as data might change over time (model drift).



##11. Why do we have to perform EDA before fitting a model to the data?

->
1. Understand the Data

  - EDA helps you get a sense of the dataset: its structure, features, and relationships between variables.

  - Without EDA, you may not know if there are outliers, missing values, or other data quality issues that could negatively affect your model.


2. Feature Relationships

- EDA helps identify relationships between input features and the target variable.

- You can visualize correlations to see which features are likely relevant for predicting the target and which may be irrelevant.

3. Detect Outliers

- Outliers can skew the model's predictions, so detecting them early is crucial.

- EDA lets you identify if there are any extreme values in your data that could impact the model.

4. dentify Data Imbalance
- Class imbalance (for classification problems) can occur if one class is significantly more frequent than the other.

- EDA helps you identify this imbalance so you can apply techniques (like oversampling or undersampling) to balance the dataset.

5. Feature Engineering
- EDA helps you create new features or decide which existing features to drop.

- For example, if you find that the age feature is skewed, you might decide to transform it into a new feature like age group (e.g., 0-18, 19-30, etc.).

6. Choose the Right Model
- Based on the insights from EDA, you can choose an appropriate model.

- For example, if your data has linear relationships, you might choose linear regression. If it has complex, non-linear relationships, you might opt for tree-based models like Random Forest or XGBoost.



##12. What is correlation?

-> Correlation is a statistical measure that describes the relationship between two variables. It shows whether and how strongly the variables are related to each other.

##13. What does negative correlation mean?

-> A negative correlation means that as one variable increases, the other variable decreases, and vice versa. In other words, the two variables move in opposite directions.



##14. How can you find correlation between variables in Python?

-> To use the corrcoef() function, you need to pass in two arrays of data, one for each variable. The function will return a correlation matrix, which is a square matrix where the diagonal elements are always 1 and the off-diagonal elements indicate the correlations between different variables.  The correlation coefficient is determined by dividing the covariance by the product of the two variables' standard deviations.

# 15. What is causation? Explain difference between correlation and causation with an example.

-> Causation (or causal relationship) refers to a situation where one variable directly influences or causes a change in another variable. In other words, X causes Y.  Causal relationships imply a cause-and-effect scenario where the change in one variable leads to the change in the other variable.

1. Correlation Example:
Ice Cream Sales & Drowning Incidents

- Observation: There’s a correlation between ice cream sales and drowning incidents. Both increase in the summer.

- Correlation: As ice cream sales go up, drowning incidents also increase.

  - But: This does not mean that eating ice cream causes drowning. The true underlying factor is the season (summer), when more people swim and also eat ice cream.

  - Conclusion: This is a spurious correlation. Ice cream sales and drowning incidents are both influenced by warmer weather, but one doesn't cause the other.

2.  Causation Example:

Smoking & Lung Cancer

- Observation: There is strong evidence that smoking causes lung cancer.

- Causation: The chemicals in cigarettes directly damage lung tissue, increasing the risk of cancer.

  - Conclusion: Smoking is the cause of lung cancer, not just correlated with it.



##16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

-> An optimizer in machine learning is an algorithm or method used to minimize or maximize a function (typically a loss function) by adjusting the model's parameters (weights and biases) during training. The goal is to find the optimal values for these parameters to minimize the error between the predicted and actual values.

Types of Optimizers

1. Gradient Descent (GD)
How it works: The optimizer adjusts parameters by taking steps proportional to the negative of the gradient (partial derivatives) of the loss function with respect to the model parameters. It moves downhill on the loss curve to find the minimum.

2. Stochastic Gradient Descent (SGD)
How it works: Unlike batch gradient descent, SGD updates the model parameters after evaluating each individual data point. This makes it faster and more computationally efficient.

3. Momentum
How it works: Momentum helps accelerate gradient descent by adding a "momentum" term that takes into account the previous updates to smooth out oscillations. It speeds up convergence by accumulating previous gradients and using them to influence the current update.

4. Adagrad
How it works: Adagrad adapts the learning rate based on the frequency of updates. It gives larger updates to less frequent features and smaller updates to frequently occurring features.

5. RMSprop (Root Mean Square Propagation)
How it works: RMSprop is similar to Adagrad but solves its problem of rapidly decaying learning rates. It divides the gradient by the moving average of its recent magnitudes.

6. Adam (Adaptive Moment Estimation)
How it works: Adam combines the advantages of both Momentum and RMSprop. It keeps track of both the first moment (mean) and the second moment (variance) of the gradients, adapting the learning rates for each parameter.

##17. What is sklearn.linear_model ?

-> The sklearn.linear_model module in Scikit-learn (a popular machine learning library) provides various linear models for regression and classification tasks. These models are based on the concept of a linear relationship between the input features and the output target.

##18. What does model.fit() do? What arguments must be given?

-> In Scikit-learn, the model.fit() method is used to train the model on the given data. It learns the relationship between the input features (X) and the target variable (y) by adjusting the internal parameters (e.g., weights and biases for linear models).

- In Regression: It learns how to predict a continuous target variable based on input features.

- In Classification: It learns how to classify data points into predefined categories.

Essentially, the fit() method finds the best model parameters (e.g., weights) that minimize the loss function for the given data.

##19. What does model.predict() do? What arguments must be given?

-> The model.predict() method in Scikit-learn is used to make predictions on new, unseen data after a model has been trained using the fit() method. Essentially, it takes the input features and outputs the predicted target values based on the relationships learned during training.

- For Regression: The model predicts continuous values (e.g., prices, quantities).

- For Classification: The model predicts discrete class labels (e.g., category 1, category 2).

In other words, after training the model with the training data (using fit()), you can use predict() to apply the model to make predictions for new data.

Arguments for model.predict()

The model.predict() method typically requires one argument:

 1. X (Features):

- This is the input data you want to make predictions on. It should have the same structure as the training data used in fit().

- Shape: X should be a 2D array or DataFrame with shape (n_samples, n_features), where:

  - n_samples is the number of new data points you want to predict for.

  - n_features is the number of features (the same as in the training data).

- Example: If you are predicting house prices, X might contain the features such as square footage, number of rooms, etc., for the new houses you want to predict prices for.

Syntax of model.predict()

    predictions = model.predict(X)


##20. What are continuous and categorical variables?

-> In data analysis and machine learning, variables (also called features or columns) are usually categorized as continuous or categorical, depending on the type of data they contain.

1. Continuous Variables

  Definition:
A continuous variable is a numerical variable that can take any value within a range. These values are typically measurable and can include decimals.

Examples:

- Height (e.g., 170.2 cm)

- Weight (e.g., 65.5 kg)

- Temperature (e.g., 22.3°C)

- Age (e.g., 21.5 years)

- Salary, price, distance, time, etc.

Key characteristics:

- Can have infinite values within a given range.

- Can be mathematically operated on (addition, mean, standard deviation).

- Often used in regression problems.

2. Categorical Variables
  
  Definition:
A categorical variable contains values that represent categories or groups. These values are usually labels or names, not numbers used for calculations.

Examples:

- Gender (Male, Female, Other)

- Color (Red, Green, Blue)

- Country (India, USA, UK)

- Product type (Laptop, Phone, Tablet)

Key characteristics:

- Values represent categories, not quantities.

- Can be:

  - Nominal: No natural order (e.g., Color: Red, Blue, Green)

  - Ordinal: Has a logical order (e.g., Education Level: High School < Bachelor < Master < PhD)

- Often used in classification problems.


##21. What is feature scaling? How does it help in Machine Learning?

-> Feature scaling is a data preprocessing technique used in Machine Learning to bring all the input features (variables) onto a similar scale—usually so that no one feature dominates the others just because of its range of values.

In many ML algorithms, the model measures distances or optimizes weights during training. If the features are on very different scales, the model might:

- Give more importance to features with larger numerical ranges,

- Struggle to converge properly, or

- Make poor predictions due to imbalance.

Benefits of Feature Scaling
- Speeds up training of models

- Helps gradient descent converge faster

- Improves model performance (especially for distance-based models)

- Prevents bias toward features with larger values

##22. How do we perform scaling in Python?

-> To scale features in Python, we commonly use Scikit-learn's preprocessing module. The most used scaling methods are:

- Min-Max Scaling

```
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Example data
X = np.array([[1], [5], [10], [15]])

# Apply Min-Max Scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)

```
- Standardization (Z-score Scaling)

```
from sklearn.preprocessing import StandardScaler

# Example data
X = np.array([[1], [5], [10], [15]])

# Apply Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)

```
- Robust Scaling

```
from sklearn.preprocessing import RobustScaler

# Example data
X = np.array([[1], [5], [10], [100]])

# Apply Robust Scaling
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)

```



##23. What is sklearn.preprocessing?

-> sklearn.preprocessing is a module in Scikit-learn that provides tools to prepare and transform raw data into a format that can be used effectively by machine learning models.

##24. How do we split data for model fitting (training and testing) in Python?

-> To split data for model fitting, training, and testing in Python, you can use the train_test_split function from the scikit-learn library. This function randomly divides your dataset into training and testing sets, allowing you to train your model on a portion of the data and evaluate its performance on unseen data.

##25. Explain data encoding?

-> Data encoding is the process of converting categorical data (text or labels) into a numeric format so it can be used by machine learning algorithms.

Why Is Encoding Important?
- Many datasets have categorical features like Gender, Country, Color, etc.

- Algorithms work with numerical inputs only.

- Encoding helps translate labels into numbers while preserving meaning.