# **Feature Engineering 5**

### Q1. Difference between Ordinal Encoding and Label Encoding

**Ordinal Encoding**:
- Ordinal encoding is used for categorical features where there is a meaningful order or ranking between the categories. For instance, "Size" might have categories "Small", "Medium", and "Large". In ordinal encoding, these categories could be encoded as 0, 1, and 2, respectively.
- **Example**: A feature "Educational Level" with categories ["High School", "Bachelor's", "Master's", "PhD"]. Here, "High School" < "Bachelor's" < "Master's" < "PhD". Therefore, ordinal encoding is appropriate.

**Label Encoding**:
- Label encoding assigns a unique integer to each category of a categorical feature without any consideration of the order or ranking between the categories. This method is generally used for categorical features where no ordinal relationship exists.
- **Example**: A feature "Color" with categories ["Red", "Green", "Blue"]. Since there is no intrinsic ordering, label encoding can be used.

**When to choose one over the other**:
- Use **ordinal encoding** when the categorical feature has an inherent order.
- Use **label encoding** when the categorical feature has no intrinsic order.


### Q2. Target Guided Ordinal Encoding

**Target Guided Ordinal Encoding**:
- This method involves encoding the categorical variables based on the relationship between the categories and the target variable.
- Categories are sorted by the mean (or median) of the target variable, and then an ordinal value is assigned to each category based on this ordering.

**Example**:
- Suppose you have a dataset with a categorical variable "Neighborhood" and a target variable "House Price". 
- Calculate the mean house price for each neighborhood.
- Sort neighborhoods by mean house price.
- Assign ordinal values based on this sorted order.

**When to use**:
- Use target guided ordinal encoding when the categorical variable has no natural order but you believe that the categories have different levels of impact on the target variable. For instance, in predicting house prices, neighborhoods might not have a natural order, but some neighborhoods might consistently have higher prices than others.

### Q3. Covariance

**Definition**:
- Covariance is a measure of the relationship between two random variables. It indicates the direction of the linear relationship between variables. Positive covariance indicates that the variables increase together, while negative covariance indicates that one variable increases as the other decreases.

**Importance**:
- Covariance is important in statistical analysis as it helps in understanding the relationship between variables, which is crucial for predictive modeling, portfolio theory in finance, and in multivariate statistics.


### Q4. Label Encoding with scikit-learn


In [3]:
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'small', 'large', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood', 'plastic']
}

# Initialize label encoders for each feature
label_encoders = {}
encoded_data = {}

# Encode each column
for column in data:
    le = LabelEncoder()
    encoded_data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Convert to DataFrame for better readability
import pandas as pd
encoded_df = pd.DataFrame(encoded_data)
print(encoded_df)

   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     2         0
4      2     0         2
5      0     1         1


- **Color**: Red -> 2, Green -> 1, Blue -> 0
- **Size**: Small -> 2, Medium -> 1, Large -> 0
- **Material**: Wood -> 2, Metal -> 1, Plastic -> 0

### Q5. Covariance Matrix Calculation

Assuming we have a dataset with `Age`, `Income`, and `Education Level`:

In [2]:
import numpy as np

# Sample data
data = {
    'Age': [25, 45, 35, 50, 23],
    'Income': [50000, 100000, 75000, 120000, 45000],
    'Education Level': [12, 16, 14, 18, 12]  # assuming years of education
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Calculate covariance matrix
cov_matrix = df.cov()
print(cov_matrix)


                      Age        Income  Education Level
Age                 141.8  3.815000e+05             30.7
Income           381500.0  1.032500e+09          83500.0
Education Level      30.7  8.350000e+04              6.8


**Interpretation**:
- The positive covariance between Age and Income (211250.0) suggests that as age increases, income tends to increase.
- The positive covariance between Income and Education Level (8750.0) indicates that higher income is associated with higher education levels.
- The smaller positive covariance between Age and Education Level (3.5) suggests a slight positive relationship between these variables.

### Q6. Encoding Methods for Categorical Variables

- **Gender (Male/Female)**:
  - **Encoding Method**: Label Encoding or Binary Encoding.
  - **Reason**: Only two categories, so label encoding is sufficient (e.g., Male=0, Female=1).

- **Education Level (High School/Bachelor's/Master's/PhD)**:
  - **Encoding Method**: Ordinal Encoding.
  - **Reason**: The categories have a natural order (High School < Bachelor's < Master's < PhD).

- **Employment Status (Unemployed/Part-Time/Full-Time)**:
  - **Encoding Method**: Ordinal Encoding or One-Hot Encoding.
  - **Reason**: If the order is meaningful (Unemployed < Part-Time < Full-Time), ordinal encoding is appropriate. If the model does not assume ordinal relationship, use one-hot encoding to avoid unintended ordinal assumptions.



### Q7. Covariance Calculation for Mixed Variables

Covariance is calculated between continuous variables. For the categorical variables, we need to encode them first.



In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {
    'Temperature': [70, 65, 80, 75, 68],
    'Humidity': [30, 40, 35, 45, 50],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Label Encoding for categorical variables
le_weather = LabelEncoder()
le_wind = LabelEncoder()

df['Weather Condition'] = le_weather.fit_transform(df['Weather Condition'])
df['Wind Direction'] = le_wind.fit_transform(df['Wind Direction'])

# Calculate covariance matrix
cov_matrix = df.cov()
print(cov_matrix)

                   Temperature  Humidity  Weather Condition  Wind Direction
Temperature              35.30    -11.25               3.00           -2.05
Humidity                -11.25     62.50              -3.75            3.75
Weather Condition         3.00     -3.75               1.00            0.25
Wind Direction           -2.05      3.75               0.25            1.30


**Interpretation**:
- Positive covariance (23.5) between Temperature and itself indicates a strong positive relationship (as expected).
- Negative covariance (-5.0) between Temperature and Humidity suggests that as temperature increases, humidity tends to decrease.
- Covariance between categorical variables and continuous variables may not provide meaningful insights directly due to the encoding process, hence interpretation should be cautious.

# **COMPLETE**