Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

In [1]:
# Ordinal Encoding:
# Assigns integer values to categories with a meaningful order.
# Example: ["Low", "Medium", "High"] → [0, 1, 2]

# Label Encoding:
# Assigns integer values to categories without assuming any order.
# Example: ["Cat", "Dog", "Fish"] → [0, 1, 2]

# When to use:
# Use Ordinal Encoding when the categorical variable has a clear ranking (like size or satisfaction).
# Use Label Encoding for unordered categories only if the algorithm can handle categorical variables internally (like decision trees).

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

In [2]:
# Definition:

# Assigns ordinal values to categories based on the mean of the target variable.
import pandas as pd

# Sample data
df = pd.DataFrame({
    'City': ['A', 'B', 'C', 'A', 'C', 'B'],
    'Churn': [1, 0, 1, 1, 0, 0]
})

# Target guided ordinal encoding
mean_churn = df.groupby('City')['Churn'].mean().sort_values()
encoding = {k: i for i, k in enumerate(mean_churn.index)}
df['City_encoded'] = df['City'].map(encoding)

print(df)


  City  Churn  City_encoded
0    A      1             2
1    B      0             0
2    C      1             1
3    A      1             2
4    C      0             1
5    B      0             0


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

In [4]:
# Covariance measures the direction of the linear relationship between two variables.

# Positive covariance: Variables increase together.

# Negative covariance: One increases, the other decreases.

# Formula: Cov(x,y) = 1/n-1 * Σ[(xi - x̄)(yi - ȳ)]

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output

In [5]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample dataset
df = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'green'],
    'Size': ['small', 'medium', 'large', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood']
})

# Applying Label Encoding
le = LabelEncoder()
for col in df.columns:
    df[col + '_encoded'] = le.fit_transform(df[col])

print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green  medium     wood              1             1                 2


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [7]:
import numpy as np
import pandas as pd

# Sample data
data = {
    'Age': [25, 32, 47, 51, 62],
    'Income': [50000, 60000, 80000, 82000, 90000],
    'Education': [12, 16, 16, 18, 20]
}
df = pd.DataFrame(data)

# Covariance matrix
cov_matrix = df.cov()
print(cov_matrix)
#Interpretation: Covariance values show how two variables vary together. Higher values = stronger linear relationship.

                Age       Income  Education
Age           221.3     245300.0       40.8
Income     245300.0  278800000.0    44800.0
Education      40.8      44800.0        8.8


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In [8]:
#for Variable
#gender -> encoding method : label encoding/One Hot encoding ->reason : Only 2 values, one hot avoids assumptions

#education elevel -> ordinal encoding -> reason : More than 2 values, ordinal encoding preserves the order

#employment status -> One hot encoding -> reason : More than 2 values, one hot encoding avoids assumptions alsoo no clear order between categories

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results

In [9]:
# Covariance requires numerical values, so categorical variables need to be encoded first
import pandas as pd

# Sample data
df = pd.DataFrame({
    'Temperature': [20, 25, 30, 35, 40],
    'Humidity': [30, 45, 50, 60, 70],
    'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy'],
    'Wind': ['North', 'South', 'East', 'West', 'North']
})

# Label encoding for categories
df['Weather_encoded'] = LabelEncoder().fit_transform(df['Weather'])
df['Wind_encoded'] = LabelEncoder().fit_transform(df['Wind'])

# Covariance matrix
cov = df[['Temperature', 'Humidity', 'Weather_encoded', 'Wind_encoded']].cov()
print(cov)


                  Temperature  Humidity  Weather_encoded  Wind_encoded
Temperature      6.250000e+01    118.75     1.110223e-16          1.25
Humidity         1.187500e+02    230.00    -1.500000e+00          3.25
Weather_encoded  1.110223e-16     -1.50     7.000000e-01          0.15
Wind_encoded     1.250000e+00      3.25     1.500000e-01          1.30
