Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

The difference between Ordinal Encoding and Label Encoding lies in their applications and handling of categorical data with different characteristics.

Label Encoding: In label encoding, each unique category is assigned a unique integer label. This method is suitable for nominal data (no inherent order). For example, if you have colors like "Red," "Green," and "Blue," they can be encoded as 0, 1, and 2. However, label encoding might lead to misinterpretation of ordinal relationships if applied to ordinal data.

Ordinal Encoding: Ordinal encoding assigns integer labels based on the ordinal relationship between categories. For instance, for education levels like "High School," "Bachelor's," "Master's," and "PhD," you could encode them as 0, 1, 2, and 3 to reflect the increasing educational levels.

Example: Consider a dataset of movie ratings: "Bad," "Average," "Good," and "Excellent." If we believe there's an inherent order in these ratings, you might choose ordinal encoding (0, 1, 2, 3). However, if you consider the ratings as nominal, label encoding (0, 1, 2, 3) could be a better option.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Target Guided Ordinal Encoding is a technique where each category in a categorical variable is assigned an ordinal rank based on its relationship with the target variable's mean or median. It captures the monotonic relationship between the categorical feature and the target, making it useful when the target and feature have an ordered connection.

Example: In a credit risk analysis, you might use Target Guided Ordinal Encoding for the "Risk Level" category. If higher risk levels are associated with higher default rates, you could assign higher ordinal ranks (0 to 3, for example) to higher risk levels.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance measures the degree to which two variables change together. A positive covariance indicates that both variables increase or decrease together, while a negative covariance indicates that one variable increases as the other decreases. Covariance is important because it provides insights into the relationship between variables, helping us understand their dependencies.

Covariance is calculated using the formula:

C o v ( X , Y ) = 1/(n−1) ∑ ( X i − x̄ ) ( Y i − ȳ  ) 
 -  ∑ is i = 1 to n
                     


Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [17]:
from sklearn.preprocessing import LabelEncoder

#  data
colors = ['red', 'green', 'blue']
sizes = ['small', 'medium', 'large']
materials = ['wood', 'metal', 'plastic']

# Create LabelEncoder instance 
color_encoder = LabelEncoder()
size_encoder = LabelEncoder()
material_encoder = LabelEncoder()

# Fit and transform 
encoded_colors = color_encoder.fit_transform(colors)
encoded_sizes = size_encoder.fit_transform(sizes)
encoded_materials = material_encoder.fit_transform(materials)

# encoded values
print("Encoded Colors:", encoded_colors)
print("Encoded Sizes:", encoded_sizes)
print("Encoded Materials:", encoded_materials)

# Reverse transform to see the original labels
original_colors = color_encoder.inverse_transform(encoded_colors)
original_sizes = size_encoder.inverse_transform(encoded_sizes)
original_materials = material_encoder.inverse_transform(encoded_materials)

#  original labels
print("\nOriginal Colors:", original_colors, ", Encoded respectively as:", encoded_colors)
print("Original Sizes:", original_sizes, ", Encoded respectively as:", encoded_sizes)
print("Original Materials:", original_materials, ", Encoded respectively as:", encoded_materials)


Encoded Colors: [2 1 0]
Encoded Sizes: [2 1 0]
Encoded Materials: [2 0 1]

Original Colors: ['red' 'green' 'blue'] , Encoded respectively as: [2 1 0]
Original Sizes: ['small' 'medium' 'large'] , Encoded respectively as: [2 1 0]
Original Materials: ['wood' 'metal' 'plastic'] , Encoded respectively as: [2 0 1]


1. **Encoded Colors, Sizes, and Materials:**
   - These arrays represent the encoded numerical values for each category within the respective variables.
   - For the "Colors" variable: "red" is encoded as 2, "green" as 1, and "blue" as 0.
   - Similarly, for "Sizes" and "Materials," the encoding follows the same pattern.

2. **Original Colors, Sizes, and Materials:**
   - These arrays represent the original categorical labels that correspond to the encoded values.
   - They show how the encoded values map back to the original categories.

**Explanation:**
- In the original data, each categorical variable (colors, sizes, materials) had three distinct categories.
- The label encoder assigned numerical values to each category, starting from 0 and incrementing by 1 for each subsequent category.
- For "Colors," "blue" is assigned the numerical value 0, "green" is assigned 1, and "red" is assigned 2.
- For "Sizes" and "Materials," the same encoding logic is applied.


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

The covariance matrix for variables Age, Income, and Education level in a dataset.

 **Calculate Covariance:** Calculate the covariance between each pair of variables (Age-Income, Age-Education, Income-Education) using the formula:
  C o v ( X , Y ) = 1/(n−1) ∑ ( X_i − x̄ ) ( Y_i − ȳ  ) 
  
   - X and Y are the variables (Age, Income, Education) respectively.
   - X_i   and Y_i are the individual data points of X and Y respectively.
   - x̄ and ȳ  are the means of X and Y respectively.
   - n is the number of data points.

Interpretation of the results:
- Positive Covariance: A positive covariance between two variables (e.g., Age and Income) indicates that when one variable increases, the other tends to increase as well. For example, if Age increases, Income tends to increase too.
- Negative Covariance: A negative covariance between two variables (e.g., Age and Education) indicates that when one variable increases, the other tends to decrease. For example, if Age increases, Education level tends to decrease.
- Close to Zero Covariance: A covariance close to zero suggests that there is little to no linear relationship between the variables.



Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

For the categorical variables "Gender," "Education Level," and "Employment Status,"we can use one-hot encoding. This method creates binary columns for each unique category, preserving the distinction between categories without introducing ordinal relationships.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

Calculating the covariance between each pair of variables (two continuous variables and combinations of continuous and categorical variables) can help you understand the relationships between them. 


 **Calculate Covariance:** Calculate the covariance between each pair of variables using the formula:

  C o v ( X , Y ) = 1/(n−1) ∑ ( X_i − x̄ ) ( Y_i − ȳ  ) 
  
   - X and Y are the variables (Age, Income, Education) respectively.
   - X_i   and Y_i are the individual data points of X and Y respectively.
   - x̄ and ȳ  are the means of X and Y respectively.
   - n is the number of data points.

 **Interpretation:**

- **Covariance between Continuous Variables (Temperature and Humidity):**
  The covariance between Temperature and Humidity indicates the direction of their linear relationship:
   - Positive Covariance: A positive covariance suggests that as Temperature increases, Humidity tends to increase as well. This could indicate that there might be a positive correlation between the two variables, meaning they tend to increase together.
   - Negative Covariance: A negative covariance suggests that as Temperature increases, Humidity tends to decrease. This could indicate a negative correlation, where one variable tends to decrease as the other increases.

- **Covariance between Continuous and Categorical Variables (Temperature and Weather Condition, Humidity and Weather Condition, Temperature and Wind Direction, Humidity and Wind Direction):**
  The covariance between a continuous variable and a categorical variable provides insight into how the two variables vary together, but it doesn't imply a direct linear relationship. It shows whether the variations in the continuous variable change in response to different categories of the categorical variable.

For example, the covariance between Temperature and Weather Condition might indicate how Temperature varies across different weather conditions (Sunny, Cloudy, Rainy).

