# Feature-Engineering



---


**Question 1:** What is a parameter?

**Answer:** A parameter refers to a configuration or value that is learned from the training data during the model training process. Parameters are internal variables that the model adjusts in order to make accurate predictions. They define the model's structure and behavior.



---


**Question 2:** What is correlation?

 What does negative correlation mean?


**Answer:** Correlation is a statistical measure that describes the relationship between two variables. It indicates the strength and direction of the linear relationship between the variables.

A negative correlation means if one variable increases, the other variable tends to decrease, and vice versa.



---


**Question 3:**  Define Machine Learning. What are the main components in Machine Learning?


**Answer:** Machine Learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn from data and improve from experience without being explicitly programmed.

**Main components in Machine Learning:-**

`Data`: The foundational element in ML, consisting of features (input variables)

and labels (output variables). Quality and quantity of data are crucial for building effective models.

`Algorithms`: Mathematical methods used to build models, such as Linear Regression, Decision Trees, and Neural Networks. These algorithms learn patterns from data.

`Model`: The mathematical representation built by training an algorithm on data. It predicts or classifies based on learned patterns.

`Training Data`: A dataset used to teach the algorithm by providing examples of input-output pairs.

`Testing Data`: A separate dataset used to evaluate how well the trained model perform on unseen data.

`Features`: Input variables that help predict the outcome. Examples: age, income, or house size.

`Labels`: The target outcome or result the model is trying to predict.



---


**Question 4:**  How does loss value help in determining whether the model is good or not?

**Answer:** The loss value helps determine how well the model is performing. It measures the difference between the model's predictions and the actual values. A lower loss value indicates better model performance, meaning the model's predictions are closer to the actual outcomes. Conversely, a higher loss value suggests that the model's predictions are farther off, indicating it needs improvement.

MSE, MAE, RMSE, R^2 and Adjusted R^2 are some loss functions.



---


**Question 5:** What are continuous and categorical variables?

**Answer:** `Continuous variables` are quantitative variables that can take an infinite number of values within a given range.

`Categorical variables` are qualitative variables that represent categories or groups.



---


**Question 6:** How do we handle categorical variables in Machine Learning? What are the common t
 echniques?

**Answer:** To handle categorical variables in Machine Learning, we need to convert them into numerical values since most algorithms require numerical input. The common techniques are:

1. `One-Hot Encoding`: Converts each category into a new binary column (0 or 1).

2. `Label Encoding`: Converts each category into a unique integer.

3. `Target Encoding`: Involves encoding categories based on the mean of the target variable for each category.

These techniques help the model understand categorical data, allowing it to make accurate predictions.



---


**Question 7:** What do you mean by training and testing a dataset?

**Answer:**  **a)** Training a dataset means using a part of the data to train the ML model.

**b)** **Testing dataset** is unseen data on which the ML model is tested on.



---


**Question 8:** What is sklearn.preprocessing?

**Answer:** sklearn.preprocessing is a module in scikit-learn (a popular Python machine learning library) that provides several utilities and functions to transform and scale data to prepare it for machine learning algorithms.

Key features of sklearn.preprocessing include:
1. Scaling and Normalization
2. Encoding Categorical Data



---


**Question 9:**  What is a Test set?

**Answer:** A test set is a portion of the dataset that is used to evaluate the performance of a machine learning model after it has been trained.



---


**Question 10:** How do we split data for model fitting (training and testing) in Python?

 b) How do you approach a Machine Learning problem?

**Answer:** b.
1. Define the Problem: Identify the objective (classification, regression, etc.), inputs (features), and outputs (target variable).

2. Collect and Prepare Data: Gather relevant data, clean it (handle missing values, outliers), and engineer features.

3. Explore Data (EDA): Visualize and analyze relationships between variables, identify patterns, and calculate statistics.

4. Choose the Model: Select an appropriate model based on the problem (e.g., logistic regression for classification).

5. Split the Data: Divide the dataset into training and testing sets (typically 80-20% or 70-30%).

6. Train the Model: Fit the selected model to the training data and tune hyperparameters.

7. Evaluate the Model: Assess performance using metrics (accuracy, RMSE, etc.) and ensure the model is not overfitting or underfitting.

8. Optimize the Model: Fine-tune hyperparameters and improve features if necessary.

In [3]:
# a.
# Importing necessary libraries
from sklearn.model_selection import train_test_split
import pandas as pd

data1 = pd.DataFrame({'Feature1': [1, 2, 3, 4, 5],'Feature2': [5, 4, 3, 2, 1],'Target': [0, 1, 0, 1, 0]})

# Features (X) and target (y)
X = data1[['Feature1', 'Feature2']]  # Independent variables (features)
y = data1['Target']  # Dependent variable (target)

# Splitting the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the split datasets
print(f"Training Features Shape: {X_train.shape}")
print(f"Testing Features Shape: {X_test.shape}")
print(f"Training Target Shape: {y_train.shape}")
print(f"Testing Target Shape: {y_test.shape}")


Training Features Shape: (4, 2)
Testing Features Shape: (1, 2)
Training Target Shape: (4,)
Testing Target Shape: (1,)




---


**Question 11:**  Why do we have to perform EDA before fitting a model to the data?

**Answer:** EDA (Exploratory Data Analysis) is crucial before fitting a model because it helps you:

1. Understand the distribution of data.
2. Identify missing values, duplicate values and outliers.
3. Explore relationships between features and the target variable.
4. Inform feature engineering and selection.
5. Choose the right model based on data patterns.

**Question 12:**

**Answer:**

**Question 13:**

**Answer:**



---


**Question 14:**  How can you find correlation between variables in Python?

**Answer:**

In [4]:
data2 = {'Age': [23, 25, 28, 30, 22, 27],'Salary': [45000, 50000, 60000, 70000, 40000, 65000],'Experience (Years)': [1, 2, 4, 5, 1, 3]}

# Creating a DataFrame
df = pd.DataFrame(data2)

# Finding the correlation between all numerical columns
correlation_matrix = df.corr()

# Displaying the correlation matrix
print(correlation_matrix)

                         Age    Salary  Experience (Years)
Age                 1.000000  0.966521            0.987105
Salary              0.966521  1.000000            0.931589
Experience (Years)  0.987105  0.931589            1.000000




---


**Question 15:** What is causation? Explain difference between correlation and causation with an example

**Answer:** Causation refers to a relationship between two variables where one variable directly affects or causes changes in the other. In other words, causation implies that changes in one variable result in changes in another.

`Difference between both:`

**Correlation:**

Measures the strength and direction of a relationship between two variables.

Does not imply one variable causes the other.

Symmetric relationship:𝐴↔𝐵

Example: Ice cream sales and drowning incidents are correlated (both increase in summer).

**Causation:**

Indicates a cause-and-effect relationship where one variable directly affects the other.

Implies that changes in one variable result in changes in the other.

Asymmetric relationship: 𝐴→𝐵.

Example: Eating contaminated food causes food poisoning.

---
**Question 16:**  What is an Optimizer? What are different types of optimizers? Explain each with an example.

**Answer:** An optimizer is an algorithm used to minimize or maximize a model's loss function during the training process.
Gradient Descent (GD):

Description: A simple optimization algorithm that adjusts model parameters by moving in the opposite direction of the gradient (derivative) of the loss function. It does this for each parameter individually to minimize the loss function. `Eqn is hQ(x)=Q1x=Q0`

Example: In linear regression, Gradient Descent adjusts the coefficients of the features iteratively to minimize the Mean Squared Error (MSE) or loss function.

---
**Question 17:**  What is sklearn.linear_model ?


**Answer:** `sklearn.linear_model` is a module in the scikit-learn library that provides linear models for regression and classification tasks. These models are based on linear relationships between input features (independent variables) and the output target (dependent variable)



---


**Question 18:**  What does model.fit() do? What arguments must be given?


**Answer:** The `fit() method` in machine learning is used to train the model on the provided training data. It learns the patterns or relationships in the data to make predictions.

Arguments for model.fit():

x (Training features): This is the input data that the model uses to learn. It is usually a 2D array or a DataFrame.

y (Target labels): These are the true values corresponding to the features X. It could be a 1D array or a Series.



---


**Question 19:**  What does model.predict() do? What arguments must be given?


**Answer:** The `predict()` method is used to make predictions using the trained model. After fitting the model, this method takes in new data (the test set) and generates the predicted outputs.

Arguments for model.predict():

x (Test features): The input data on which predictions are to be made.

**Question 20:**

**Answer:**

---
**Question 21:**  What is feature scaling? How does it help in Machine Learning?


**Answer:** Feature scaling refers to the process of normalizing or standardizing the range of independent variables or features in a dataset. It ensures that each feature contributes equally to the model, preventing any feature with larger numerical ranges from dominating the learning process.

Types of Feature Scaling:

1. Normalization (Min-Max Scaling)
2. Standardization (Z-score Scaling)



---


**Question 22:**  How do we perform scaling in Python?

**Answer:**

In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Creating a small dataset
data3 = {'Age': [25, 30, 35, 40, 45],'Income': [40000, 50000, 60000, 70000, 80000],'Score': [88, 92, 85, 79, 95]}

df = pd.DataFrame(data3)

# Standardizing the data (mean = 0, std = 1)
scaler_standard = StandardScaler()
df_standardized = df.copy()
df_standardized[['Age', 'Income', 'Score']] = scaler_standard.fit_transform(df[['Age', 'Income', 'Score']])

# Normalizing the data (range 0-1)
scaler_normalize = MinMaxScaler()
df_normalized = df.copy()
df_normalized[['Age', 'Income', 'Score']] = scaler_normalize.fit_transform(df[['Age', 'Income', 'Score']])

# Displaying the results
print("Original Data:\n", df)
print("\nStandardized Data:\n", df_standardized)
print("\nNormalized Data:\n", df_normalized)


Original Data:
    Age  Income  Score
0   25   40000     88
1   30   50000     92
2   35   60000     85
3   40   70000     79
4   45   80000     95

Standardized Data:
         Age    Income     Score
0 -1.414214 -1.414214  0.035944
1 -0.707107 -0.707107  0.754829
2  0.000000  0.000000 -0.503220
3  0.707107  0.707107 -1.581547
4  1.414214  1.414214  1.293993

Normalized Data:
     Age  Income   Score
0  0.00    0.00  0.5625
1  0.25    0.25  0.8125
2  0.50    0.50  0.3750
3  0.75    0.75  0.0000
4  1.00    1.00  1.0000


**Question 23:**

**Answer:**

**Question 24:**

**Answer:**

---


**Question 25:**  Explain data encoding?

**Answer:** Data encoding is the process of converting categorical data into numerical data.

Types of data encoding:-

1. **One hot Endcoding**: Converts categorical variables into binary columns, with 1 for the category's presence and 0 for absence.

2. **Ordinal Encoding**: Assigns integer values to categories with a meaningful order.
3. **Target Encoding**: Replaces categories with the mean of the target variable for each category.