Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Label Encoding assigns a unique integer to each category in a categorical variable without considering any order. This method is suitable for nominal (unordered) categorical variables.

Ordinal Encoding also assigns unique integers to categories but considers the order of the categories. This method is suitable for ordinal (ordered) categorical variables.

Example:

Label Encoding:
Variable: Color (red, green, blue)
Encoded: red -> 0, green -> 1, blue -> 2

Ordinal Encoding:
Variable: Size (small, medium, large)
Encoded: small -> 0, medium -> 1, large -> 2

When to Choose:

Label Encoding: When the categorical variable is nominal, e.g., color of a car (red, green, blue).

Ordinal Encoding: When the categorical variable is ordinal, e.g., rating of a product (poor, average, excellent).

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding assigns ordinal labels to categories based on the relationship between the category and the target variable. The categories are sorted according to the mean (or median) of the target variable, and then ordinal labels are assigned.

Example:



In [2]:
import pandas as pd

# Sample data
data = {
    'Neighborhood': ['A', 'B', 'C', 'A', 'C', 'B', 'A', 'B', 'C'],
    'Price': [250000, 300000, 400000, 260000, 390000, 310000, 255000, 305000, 405000]
}

df = pd.DataFrame(data)

# Display the data
print("Original DataFrame:")
print(df)

# Calculate the mean price for each neighborhood
mean_prices = df.groupby('Neighborhood')['Price'].mean()

# Order neighborhoods by mean price
mean_prices = mean_prices.sort_values()

# Create a mapping of neighborhood to ordinal value based on mean price
ordinal_mapping = {neighborhood: idx for idx, neighborhood in enumerate(mean_prices.index, 1)}

# Apply the mapping to the DataFrame
df['Neighborhood_Ordinal'] = df['Neighborhood'].map(ordinal_mapping)

# Display the transformed DataFrame
print("\nTransformed DataFrame with Target Guided Ordinal Encoding:")
print(df)


Original DataFrame:
  Neighborhood   Price
0            A  250000
1            B  300000
2            C  400000
3            A  260000
4            C  390000
5            B  310000
6            A  255000
7            B  305000
8            C  405000

Transformed DataFrame with Target Guided Ordinal Encoding:
  Neighborhood   Price  Neighborhood_Ordinal
0            A  250000                     1
1            B  300000                     2
2            C  400000                     3
3            A  260000                     1
4            C  390000                     3
5            B  310000                     2
6            A  255000                     1
7            B  305000                     2
8            C  405000                     3


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance measures the degree to which two variables change together. It indicates whether an increase in one variable tends to be associated with an increase or decrease in another variable.

- Positive Covariance: Indicates that the variables increase together.
- Negative Covariance: Indicates that one variable increases as the other decreases.

Importance: Covariance is important because it helps understand the relationship between variables and is a key component in the calculation of correlation and in constructing the covariance matrix for multivariate analysis.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample dataset
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
}
df = pd.DataFrame(data)

# Initialize label encoders
le_color = LabelEncoder()
le_size = LabelEncoder()
le_material = LabelEncoder()

# Fit and transform the data
df['Color_encoded'] = le_color.fit_transform(df['Color'])
df['Size_encoded'] = le_size.fit_transform(df['Size'])
df['Material_encoded'] = le_material.fit_transform(df['Material'])

print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green   small     wood              1             2                 2
4    red   large    metal              2             0                 0


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [4]:
import numpy as np

# Sample data
data = {
    'Age': [25, 32, 47, 51, 62],
    'Income': [50000, 60000, 120000, 100000, 110000],
    'Education': [12, 16, 16, 18, 14]
}
df = pd.DataFrame(data)

# Covariance matrix
cov_matrix = df.cov()
print(cov_matrix)


                Age       Income  Education
Age           221.3     408500.0       12.9
Income     408500.0  970000000.0    33000.0
Education      12.9      33000.0        5.2


Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?


Gender: Label Encoding or One-Hot Encoding

Reason: Gender has only two categories, so label encoding can be efficient. One-hot encoding can be used if there is a need to avoid ordinal relationships.

Education Level: Ordinal Encoding

Reason: Education levels have a natural order, making ordinal encoding suitable.

Employment Status: One-Hot Encoding

Reason: Employment status categories have no natural order, so one-hot encoding is appropriate to avoid implying any ordinal relationship.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

In [5]:
# Sample data
data = {
    'Temperature': [20, 22, 23, 21, 19],
    'Humidity': [30, 35, 33, 31, 29],
    'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind': ['North', 'South', 'East', 'West', 'North']
}
df = pd.DataFrame(data)

# Covariance matrix for continuous variables
cov_matrix = df[['Temperature', 'Humidity']].cov()
print(cov_matrix)


             Temperature  Humidity
Temperature         2.50      3.25
Humidity            3.25      5.80
