Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Ordinal Encoding:

Ordinal Encoding is used specifically for categorical variables where the categories have an inherent order or ranking.
It assigns numerical values to categories based on their order or level.
For example, education levels such as "High School," "Bachelor's," and "Master's" might be encoded as 1, 2, and 3, respectively.

Label Encoding:

Label Encoding is a more general method used for converting categorical variables into numerical format.
It assigns unique integer labels to each category without necessarily considering an inherent order.
For example, colors like "Red," "Green," and "Blue" might be encoded as 1, 2, and 3.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.


Target Guided Ordinal Encoding is a technique that assigns ordinal labels to categories based on their relationship with the target variable in a machine learning project. Instead of relying on the inherent order of the categories, this method uses the information from the target variable to create an ordinal encoding that aligns with the target's behavior.

Consider a dataset with a "City" feature and a binary target variable indicating whether a customer made a purchase (1) or not (0). Applying Target Guided Ordinal Encoding to the "City" feature involves calculating the mean purchase rate for each city and ordering the cities accordingly. The cities with higher purchase rates might get higher ordinal labels.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?



Covariance is a statistical measure that describes the extent to which two variables change together. It assesses the directional relationship between the movements of two variables, indicating whether they tend to increase or decrease simultaneously. A positive covariance suggests a positive relationship, meaning that as one variable increases, the other tends to increase as well, and vice versa for negative covariance.

In statistical analysis, covariance is essential for understanding the degree to which variables are associated and the direction of that association. However, it doesn't provide a standardized measure, making it difficult to compare the strength of relationships across different pairs of variables. This limitation led to the development of correlation, which is a normalized version of covariance

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [2]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
data = {
    'Color': ['red', 'green', 'blue', 'red', 'green'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
}

df = pd.DataFrame(data)

label_encoder = LabelEncoder()

for column in df.columns:
    df[column+'_encoded'] = label_encoder.fit_transform(df[column])

print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3    red  medium     wood              2             1                 2
4  green   small    metal              1             2                 0


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [5]:
import numpy as np
data = np.array([
    [25, 50000, 12],
    [30, 60000, 16],
    [28, 55000, 14],
    [35, 75000, 18],
    [40, 80000, 20]
])
covariance_matrix = np.cov(data, rowvar=False)
print(covariance_matrix)

[[3.530e+01 7.575e+04 1.850e+01]
 [7.575e+04 1.675e+08 4.000e+04]
 [1.850e+01 4.000e+04 1.000e+01]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

Gender (Nominal):

Encoding Method: One-hot encoding.
Explanation: Since gender has no inherent order, one-hot encoding is appropriate. It creates binary columns (e.g., "Male" and "Female") to represent the categories without introducing spurious ordinal relationships.

Education Level (Ordinal):

Encoding Method: Ordinal encoding.
Explanation: Education level has a meaningful order (e.g., High School < Bachelor's < Master's < PhD). Ordinal encoding preserves this order, assigning numerical labels based on the ranking of categories.

Employment Status (Nominal):

Encoding Method: One-hot encoding.
Explanation: Employment status is likely nominal since there's no inherent order. One-hot encoding is suitable to represent different employment statuses with binary columns, avoiding the introduction of false ordinal relationships.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [11]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = {
    'Temperature': [25, 20, 22, 28, 30],
    'Humidity': [60, 70, 75, 50, 65],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

df = pd.DataFrame(data)
encoder = OneHotEncoder(sparse_output=False)
encoded_categorical = encoder.fit_transform(df[['Weather Condition', 'Wind Direction']])
encoded_df = pd.concat([df[['Temperature', 'Humidity']], pd.DataFrame(encoded_categorical, columns=encoder.get_feature_names_out(['Weather Condition', 'Wind Direction']))], axis=1)
covariance_matrix = np.cov(encoded_df, rowvar=False)
print(covariance_matrix)


[[ 1.70e+01 -2.50e+01 -1.25e+00  5.00e-01  7.50e-01 -7.50e-01  1.25e+00
  -1.25e+00  7.50e-01]
 [-2.50e+01  9.25e+01  1.50e+00  3.00e+00 -4.50e+00  2.75e+00 -7.50e-01
   1.50e+00 -3.50e+00]
 [-1.25e+00  1.50e+00  2.00e-01 -1.00e-01 -1.00e-01 -5.00e-02 -1.00e-01
   2.00e-01 -5.00e-02]
 [ 5.00e-01  3.00e+00 -1.00e-01  3.00e-01 -2.00e-01  1.50e-01  5.00e-02
  -1.00e-01 -1.00e-01]
 [ 7.50e-01 -4.50e+00 -1.00e-01 -2.00e-01  3.00e-01 -1.00e-01  5.00e-02
  -1.00e-01  1.50e-01]
 [-7.50e-01  2.75e+00 -5.00e-02  1.50e-01 -1.00e-01  2.00e-01 -1.00e-01
  -5.00e-02 -5.00e-02]
 [ 1.25e+00 -7.50e-01 -1.00e-01  5.00e-02  5.00e-02 -1.00e-01  3.00e-01
  -1.00e-01 -1.00e-01]
 [-1.25e+00  1.50e+00  2.00e-01 -1.00e-01 -1.00e-01 -5.00e-02 -1.00e-01
   2.00e-01 -5.00e-02]
 [ 7.50e-01 -3.50e+00 -5.00e-02 -1.00e-01  1.50e-01 -5.00e-02 -1.00e-01
  -5.00e-02  2.00e-01]]
