#### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ans: **Ordinal encoding** is used when the categorical data has an inherent order or hierarchy. In this method, the categories are assigned an integer value based on their rank or order. Eg: in a survey question asking about education level, "High School" might be assigned a lower value than "Bachelor's Degree" because it is lower in the educational hierarchy.

**Label encoding** on the other hand, assigns a unique numerical value to each category in a categorical variable. This method does not consider any inherent order or hierarchy in the data. Eg: in a dataset of colors, "Blue" might be assigned a value of 1, "Green" a value of 2, and "Red" a value of 3.

#### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Ans: **Target Guided Ordinal Encoding** is a technique used to encode categorical variables by ordering the categories according to their impact on the target variable.

The categories are ranked in order of their mean target value, and then each category is assigned a numerical value based on its rank.

Eg: We are working on a project to predict customer churn in a subscription-based business, we could use target guided ordinal encoding to encode the type of subscription plan a customer is using, since certain plans may have a higher impact on customer churn than others.

#### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans: **Covariance** is a measure of how two variables in a dataset are related to each other, indicating the degree to which the variables tend to vary together. It measures the joint variability of two random variables and can be used to determine whether they have a positive, negative, or zero correlation.

Covariance is an important concept in statistical analysis because it helps to identify relationships and dependencies between variables.

Covariance is calculated using the formula:

Cov(X,Y) = E[(X - E[X]) * (Y - E[Y])]

where X and Y are two random variables, E[X] and E[Y] are their respective expected values.

#### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [10]:
from sklearn.preprocessing import LabelEncoder

#Sample dataset
data = [['red', 'medium', 'wood'],
        ['blue', 'small', 'metal'],
        ['green', 'large', 'plastic'],
        ['green', 'medium', 'wood'],
        ['red', 'large', 'metal'],
        ['blue', 'medium', 'plastic']]


le = LabelEncoder()

for i in range(len(data[0])):
    integer_encoded = le.fit_transform([row[i] for row in data])
    for j in range(len(integer_encoded)):
        data[j][i] = integer_encoded[j]

print(data)

[[2, 1, 2], [0, 2, 0], [1, 0, 1], [1, 1, 2], [2, 0, 0], [0, 1, 1]]


The output shows the encoded values for each categorical variable. The first column represents the encoded values for Color, the second column represents the encoded values for Size, and the third column represents the encoded values for Material.

#### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [12]:
import pandas as pd

# Sample data
data = {
    'age': [25, 35, 45, 55, 65],
    'income': [40000, 60000, 80000, 100000, 120000],
    'education': [12, 14, 16, 18, 20]
}
df = pd.DataFrame(data)

cov_matrix = df.cov()
print(cov_matrix)

                age        income  education
age           250.0  5.000000e+05       50.0
income     500000.0  1.000000e+09   100000.0
education      50.0  1.000000e+05       10.0


#### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

Ans:
1. For the categorical variable "Gender" with two categories (Male/Female), we can use binary encoding or label encoding. Since there are only two categories, binary encoding is a suitable method to use, where we replace one category with 0 and the other with 1.

2. For the categorical variable "Education Level" with multiple categories (High School/Bachelor's/Master's/PhD), we can use ordinal encoding, where we assign each category a rank based on their level of education. For example, we can assign High School a rank of 1, Bachelor's a rank of 2, Master's a rank of 3, and PhD a rank of 4. This method preserves the order of the categories and allows us to maintain the information about the level of education.

3. For the categorical variable "Employment Status" with multiple categories (Unemployed/Part-Time/Full-Time), we can use one-hot encoding to convert the categories into binary variables. This method allows us to capture the information about each category without assuming any order or hierarchy among them.

#### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [13]:
import pandas as pd

#Sample Data
data = {
    "Temperature": [25, 22, 20, 18, 23, 21, 24, 19, 20, 22],
    "Humidity": [60, 55, 50, 45, 65, 70, 75, 55, 50, 60],
    "Weather Condition": ["Sunny", "Cloudy", "Rainy", "Rainy", "Sunny", "Sunny", "Cloudy", "Rainy", "Sunny", "Cloudy"],
    "Wind Direction": ["North", "South", "East", "West", "North", "East", "West", "North", "South", "East"]
}

df = pd.DataFrame(data)

cov_matrix = df.cov()

print(cov_matrix)

             Temperature   Humidity
Temperature     4.933333  14.555556
Humidity       14.555556  89.166667


  cov_matrix = df.cov()
