### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

A1. **Label Encoding** converts categorical values into unique numerical values. Each unique category is assigned a distinct integer.
It is suitable for categorical features where the categories do not have a meaningful order.
- **Example**: For a feature "Color" with values ["Red", "Green", "Blue"], Label Encoding might map "Red" to 0, "Green" to 1, and "Blue" to 2. This is appropriate when colors have no inherent order.

**Ordinal Encoding** converts categorical values into numerical values while preserving the order or ranking of the categories.
It is ideal for categorical features with a meaningful order.
- **Example**: For a feature "Size" with values ["Small", "Medium", "Large"], Ordinal Encoding might map "Small" to 1, "Medium" to 2, and "Large" to 3. This reflects the inherent order in the sizes.

**Choosing One Over the Other**:
- **Label Encoding** should be used when categories do not have a natural order.
- **Ordinal Encoding** should be used when categories have a natural, meaningful order.

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

A2. **Target Guided Ordinal Encoding** assigns ordinal values to categorical features based on the relationship between the feature and the target variable.

To incorporate with Target Guided Ordinal Encoding we need to
1. **Calculate the Mean of the Target Variable**: Compute the mean target value for each category.
2. **Rank Categories**: Sort categories based on their target mean values.
3. **Assign Ordinal Values**: Assign integers based on the rank.

**For Example**:
In a project predicting house prices with a categorical feature "Condition" (Excellent, Good, Fair, Poor), calculate the mean house price for each condition. If "Excellent" has the highest mean price and "Poor" has the lowest, assign higher ordinal values to "Excellent" and lower values to "Poor".

### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

A3. **Covariance** measures how two variables change together. A positive covariance indicates that as one variable increases, the other tends to increase as well, while a negative covariance indicates an inverse relationship.

**Importance of covariance**:
- **Relationship Analysis**: Helps in understanding the direction of the relationship between two variables.
- **Correlation Calculation**: Used to compute correlation coefficients, which standardize covariance.

**Calculation**:
For two variables  X  and  Y :
$$
 \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X}) (Y_i - \bar{Y}) 
$$
where X bar and Y bar are the means of X and Y, and n is the number of observations.

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.


In [4]:
#A4.
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
})

# Initialize LabelEncoder
le_color = LabelEncoder()
le_size = LabelEncoder()
le_material = LabelEncoder()

# Apply Label Encoding
data['Color_Encoded'] = le_color.fit_transform(data['Color'])
data['Size_Encoded'] = le_size.fit_transform(data['Size'])
data['Material_Encoded'] = le_material.fit_transform(data['Material'])

print(data)

   Color    Size Material  Color_Encoded  Size_Encoded  Material_Encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green   small     wood              1             2                 2
4    red  medium    metal              2             1                 0


**Explanation**: The `LabelEncoder` converts each unique category into a numerical value. Each categorical column will be transformed into a new column with encoded values.

### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.


In [3]:
#A5.
import numpy as np

# Sample data
data = np.array([
    [25, 50000, 2],
    [30, 55000, 3],
    [35, 60000, 4],
    [40, 65000, 3],
    [45, 70000, 4]
])

# Calculate covariance matrix
cov_matrix = np.cov(data, rowvar=False)
print(cov_matrix)

[[6.25e+01 6.25e+04 5.00e+00]
 [6.25e+04 6.25e+07 5.00e+03]
 [5.00e+00 5.00e+03 7.00e-01]]


**Interpretation**: The diagonal values represent the variances of each variable. The off-diagonal values represent covariances between pairs of variables. Positive values indicate a positive relationship, while negative values indicate an inverse relationship.

### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

A6. For **Gender** I will use **Label Encoding** or **One-Hot Encoding**. Actually for the binary categories like Male/Female, Label Encoding is simple and effective. One-Hot Encoding can also be used but is less necessary.
For **Education Level** I will prefer **Ordinal Encoding**. Since there is a natural order (High School < Bachelor's < Master's < PhD), Ordinal Encoding captures this hierarchy.
For **Employment Status**: **One-Hot Encoding** will be suitable. With more than two categories and no inherent order, One-Hot Encoding is appropriate to avoid introducing a false ordinal relationship.

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

A7. To calculate the covariance between continuous variables:


In [2]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

temperature = np.array([30, 32, 35, 28, 33])
humidity = np.array([70, 65, 60, 75, 68])

cov_temp_humidity = np.cov(temperature, humidity)[0, 1]
print(cov_temp_humidity)
data = pd.DataFrame({'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],'Wind Direction': ['North', 'South', 'East', 'West', 'North']})

encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(data)

cov_matrix_cat = np.cov(encoded_data, rowvar=False)
print(cov_matrix_cat)

-14.2
[[ 0.3  -0.1  -0.2  -0.1   0.05  0.15 -0.1 ]
 [-0.1   0.2  -0.1   0.2  -0.1  -0.05 -0.05]
 [-0.2  -0.1   0.3  -0.1   0.05 -0.1   0.15]
 [-0.1   0.2  -0.1   0.2  -0.1  -0.05 -0.05]
 [ 0.05 -0.1   0.05 -0.1   0.3  -0.1  -0.1 ]
 [ 0.15 -0.05 -0.1  -0.05 -0.1   0.2  -0.05]
 [-0.1  -0.05  0.15 -0.05 -0.1  -0.05  0.2 ]]




**Interpretation**: The covariance values indicate how the variables change together. Positive covariance means that as one variable increases, the other tends to increase, and vice versa for negative covariance.