Q1: What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ordinal Encoding assigns integer values to categories while maintaining the inherent order. Label Encoding assigns integer values to categories without considering any order.

Example:

Ordinal Encoding: Used for ordered categories like "Education Level" (High School=1, Bachelor's=2, Master's=3, PhD=4).
Label Encoding: Used for unordered categories like "Color" (Red=0, Blue=1, Green=2).

Q2: Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding assigns integer values to categories based on their relationship with the target variable, often the mean or median of the target for each category.

Example:
In a project predicting house prices:

Feature: "Neighborhood"
Target: "House Price"
Encode neighborhoods based on their average house prices, e.g., [Neighborhood A: 300K → 1, Neighborhood B: 500K → 2, Neighborhood C: 700K → 3].

Q3: Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance measures the degree to which two variables change together.

Positive covariance: Variables increase together.
Negative covariance: One variable increases as the other decreases.
Formula:
Cov(X, Y) = Σ((Xᵢ - μₓ) * (Yᵢ - μᵧ)) / (n - 1)

Importance:
Covariance helps understand the relationship between variables, which is critical in dimensionality reduction techniques like PCA.



Q4: For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Dataset
data = {
    'Color': ['red', 'green', 'blue'],
    'Size': ['small', 'medium', 'large'],
    'Material': ['wood', 'metal', 'plastic']
}
df = pd.DataFrame(data)

# Label encoding
encoder = LabelEncoder()
df_encoded = df.apply(encoder.fit_transform)

print("Original DataFrame:")
print(df)
print("\nLabel Encoded DataFrame:")
print(df_encoded)


Original DataFrame:
   Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic

Label Encoded DataFrame:
   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1


Q5: Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education Level.


In [2]:
import numpy as np
import pandas as pd

# Dataset
data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [40000, 50000, 60000, 70000, 80000],
    'Education Level': [1, 2, 2, 3, 3]
}
df = pd.DataFrame(data)

# Covariance matrix
cov_matrix = df.cov()
print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
                       Age       Income  Education Level
Age                  62.50     125000.0             6.25
Income           125000.00  250000000.0         12500.00
Education Level       6.25      12500.0             0.70


Q6: You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?


Gender: One-Hot Encoding (binary category with no ordinal relationship).
Education Level: Ordinal Encoding (inherent order in levels of education).
Employment Status: One-Hot Encoding (no ordinal relationship).

Q7: You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.


In [3]:
import numpy as np
import pandas as pd

# Dataset
data = {
    'Temperature': [25, 30, 35, 40],
    'Humidity': [60, 65, 70, 75],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny'],
    'Wind Direction': ['North', 'South', 'East', 'West']
}
df = pd.DataFrame(data)

# Encoding categorical variables
df['Weather Condition'] = LabelEncoder().fit_transform(df['Weather Condition'])
df['Wind Direction'] = LabelEncoder().fit_transform(df['Wind Direction'])

# Covariance matrix
cov_matrix = df.cov()
print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
                   Temperature   Humidity  Weather Condition  Wind Direction
Temperature          41.666667  41.666667           0.833333        3.333333
Humidity             41.666667  41.666667           0.833333        3.333333
Weather Condition     0.833333   0.833333           0.916667        0.166667
Wind Direction        3.333333   3.333333           0.166667        1.666667
