Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

 Ordinal Encoding and Label Encoding are both techniques used to convert categorical variables into numerical format, but they differ in how they handle the encoding.

Ordinal Encoding: Assigns a unique integer to each category, but these integers have an ordered relationship. This means that the encoded values reflect the order or rank of the categories. For example, in ordinal encoding, if we have categories like "low," "medium," and "high," they might be encoded as 0, 1, and 2 respectively. Ordinal encoding is suitable when there is a clear ordering or ranking among the categories, such as in levels like "low," "medium," and "high."

Label Encoding: Assigns a unique integer to each category without considering any order or ranking. Each category is simply assigned a different integer value. For example, in label encoding, if we have categories like "red," "green," and "blue," they might be encoded as 0, 1, and 2 respectively. Label encoding is suitable when there is no inherent order or ranking among the categories.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

 Target Guided Ordinal Encoding is a technique used for ordinal encoding where the categories are ranked according to the mean of the target variable within each category. This means that the categories are encoded based on their relationship with the target variable.

Example: In a binary classification problem where you are predicting whether a customer will buy a product or not, you might use Target Guided Ordinal Encoding to encode the categories of a feature such as "Education Level." This encoding could help the model capture the relationship between education level and the likelihood of buying the product.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between variables. If covariance is positive, it means that as one variable increases, the other also tends to increase. If covariance is negative, it means that as one variable increases, the other tends to decrease. Covariance is important in statistical analysis as it helps understand how changes in one variable are associated with changes in another.

Covariance is calculated using the formula:
cov(X,Y)=n∑i=1n​(Xi​−Xˉ)(Yi​−Yˉ)​

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.



In [4]:
from sklearn.preprocessing import LabelEncoder

data = {'Color': ['red', 'green', 'blue', 'green'],
        'Size': ['small', 'medium', 'large', 'medium'],
        'Material': ['wood', 'metal', 'plastic', 'metal']}
label_encoder = LabelEncoder()
for col in data.keys():
    data[col] = label_encoder.fit_transform(data[col])
print(data)


{'Color': array([2, 1, 0, 1], dtype=int64), 'Size': array([2, 1, 0, 1], dtype=int64), 'Material': array([2, 0, 1, 0], dtype=int64)}


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [5]:
import numpy as np

# Example data for Age, Income, and Education level
age = np.array([30, 40, 50, 35, 45])
income = np.array([50000, 60000, 70000, 55000, 65000])
education_level = np.array([12, 16, 18, 14, 20])

# Create a matrix where each row represents a variable (Age, Income, Education level)
data_matrix = np.array([age, income, education_level])

# Calculate the covariance matrix
covariance_matrix = np.cov(data_matrix)

print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
[[6.25e+01 6.25e+04 2.25e+01]
 [6.25e+04 6.25e+07 2.25e+04]
 [2.25e+01 2.25e+04 1.00e+01]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

For the given categorical variables:

Gender: Since there's no inherent order or ranking between Male and Female, label encoding would be appropriate.
Education Level: There is an inherent order in the education levels, so ordinal encoding could be used, or target guided ordinal encoding if the target variable is available.
Employment Status: Similar to Gender, there's no inherent order, so label encoding would be suitable.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.


In [6]:
import pandas as pd

# Sample data
data = {
    "Temperature": [25, 28, 30, 22, 27],
    "Humidity": [60, 65, 70, 55, 63],
    "Weather Condition": ["Sunny", "Cloudy", "Rainy", "Sunny", "Cloudy"],
    "Wind Direction": ["North", "South", "East", "West", "North"]
}

# Create DataFrame
df = pd.DataFrame(data)

# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=["Weather Condition", "Wind Direction"])

# Calculate covariance between continuous variables
covariance_continuous = df_encoded[["Temperature", "Humidity"]].cov()

print("Covariance Matrix for Continuous Variables (Temperature and Humidity):")
print(covariance_continuous)


Covariance Matrix for Continuous Variables (Temperature and Humidity):
             Temperature  Humidity
Temperature         9.30     16.95
Humidity           16.95     31.30
