# Feature Engineering - 5

## Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

**Answer:**

- **Label Encoding:** Assigns a unique integer to each category. Used for nominal (unordered) data. Example: [red, green, blue] → [0, 1, 2].
- **Ordinal Encoding:** Assigns integers to categories with a meaningful order. Used for ordinal (ordered) data. Example: [small, medium, large] → [0, 1, 2].

Choose ordinal encoding when the categories have a natural order; use label encoding for unordered categories.

## Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

**Answer:**

Target Guided Ordinal Encoding assigns integers to categories based on the mean of the target variable for each category. Example: For a feature 'City' and target 'Sales', encode cities by their average sales. Useful when a categorical variable has a strong relationship with the target.

## Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

**Answer:**

Covariance measures how two variables change together. A positive value means they increase together; a negative value means one increases as the other decreases. It is important for understanding relationships between variables and is used in portfolio theory, PCA, etc.

Covariance formula:

cov(X, Y) = Σ((Xᵢ - mean(X)) * (Yᵢ - mean(Y))) / (n - 1)

## Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = pd.DataFrame({
    'Color': ['red', 'green', 'blue'],
    'Size': ['small', 'medium', 'large'],
    'Material': ['wood', 'metal', 'plastic']
})

le = LabelEncoder()
for col in data.columns:
    data[col + '_encoded'] = le.fit_transform(data[col])
print(data)

**Explanation:**

Each unique category in a column is assigned an integer. For example, 'Color': blue=0, green=1, red=2 (order may vary). The same applies to 'Size' and 'Material'.

## Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [None]:
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'Age': [25, 30, 45, 35, 50],
    'Income': [40000, 50000, 80000, 60000, 90000],
    'Education': [12, 16, 18, 14, 20]
})

cov_matrix = data.cov()
print(cov_matrix)

**Interpretation:**

The covariance matrix shows how each pair of variables varies together. Positive values indicate that as one variable increases, the other tends to increase. Negative values indicate an inverse relationship.

## Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

**Answer:**

- Gender: One-hot encoding (few categories, no order)
- Education Level: Ordinal encoding (ordered categories)
- Employment Status: One-hot encoding (few categories, no order)

## Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({
    'Temperature': [30, 25, 28, 22, 35],
    'Humidity': [60, 55, 65, 50, 70],
    'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind': ['North', 'South', 'East', 'West', 'North']
})

le = LabelEncoder()
data['Weather_encoded'] = le.fit_transform(data['Weather'])
data['Wind_encoded'] = le.fit_transform(data['Wind'])

cov_matrix = data[['Temperature', 'Humidity', 'Weather_encoded', 'Wind_encoded']].cov()
print(cov_matrix)

**Interpretation:**

The covariance values show how each variable pair changes together. For example, a positive covariance between Temperature and Humidity means they tend to increase together.