## Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when youmight choose one over the other.

### Ans :
Difference Between Ordinal Encoding and Label Encoding

* Label Encoding assigns a unique number to each category without implying any order or ranking. It’s typically used for nominal (unordered) data.

* Ordinal Encoding assigns a numerical value to categories with a meaningful order or ranking (e.g., "Low", "Medium", "High").

When to Choose:

* Label Encoding: Use for nominal data where there’s no natural order.

        Example: Colors (Red, Blue, Green) – Label encode as 0, 1, 2.
* Ordinal Encoding: Use for ordinal data where there is a clear ranking.

        Example: Education levels (High School, Bachelor's, Master's) – Ordinal encode as 0, 1, 2.
Label Encoding is for unordered categories, while Ordinal Encoding is for ordered categories.

## Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

### Ans :
Target Guided Ordinal Encoding : Target Guided Ordinal Encoding is a method where categorical values are replaced by their average target value. The categories are ordered based on the mean of the target variable for each category.

1. For each category, calculate the average target value (e.g., mean of the target variable for that category).
2. Assign an ordinal value based on the calculated target averages.
3. Categories with higher target averages get higher ordinal values.

usage : This technique is useful when the categorical feature has a natural relationship with the target variable, especially in regression tasks.

Example:

In a customer churn prediction project, you might have a "contract type" feature with categories like "Month-to-month", "One year", and "Two years". You can replace these categories with the average churn rate for each contract type:

* Month-to-month: 0.7 churn rate
* One year: 0.3 churn rate
* Two years: 0.1 churn rate
Then, you would assign higher values to categories with higher churn rates.

## Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

### Ans :
Covariance : Covariance is a measure of the relationship between two variables. It indicates whether the variables tend to increase or decrease together (positive covariance) or if one increases while the other decreases (negative covariance).

Important Covariance :
* Helps in Understanding Relationships: It shows how two variables move in relation to each other, which is important in statistical analysis, especially when studying patterns or trends in data.
* Basis for Correlation: Covariance is the foundation for calculating correlation, which normalizes the relationship to a scale from -1 to 1.

How is Covariance Calculated?
The formula for covariance between two variables 𝑋 and 𝑌 is:

Cov(𝑋,𝑌)=1/𝑛 ∑ 𝑖=1 upto 𝑛 (𝑋𝑖−𝑋‾)(𝑌𝑖−𝑌‾)

Where:
* 𝑋𝑖 and Yi are individual data points,
* X and Y are the mean values of X and y.
* n is the number of data points.

In short, covariance measures how two variables vary together, helping to understand their relationship.

## Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

### Ans :


In [4]:
from sklearn.preprocessing import LabelEncoder

color = ['red','green','blue', 'red','blue']
size = ['small','medium','large', 'medium', 'small']
material = ['wood','metal','plastic','wood','metal']

label = LabelEncoder()

Enco_color = label.fit_transform(color)
Enco_size = label.fit_transform(size)
Enco_material = label.fit_transform(material)

print("Encoded color", Enco_color)
print("Encoded size", Enco_size)
print("Encoded material", Enco_material)


Encoded color [2 1 0 2 0]
Encoded size [2 1 0 1 2]
Encoded material [2 0 1 2 0]


## Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

### Ans :


In [6]:
import numpy as np

data = np.array([
    [25, 50000, 2],  # Age = 25, Income = 50000, Education Level = 2 (e.g., Bachelor's)
    [30, 60000, 3],  # Age = 30, Income = 60000, Education Level = 3 (e.g., Master's)
    [35, 70000, 3],  # Age = 35, Income = 70000, Education Level = 3 (e.g., Master's)
    [40, 80000, 4],  # Age = 40, Income = 80000, Education Level = 4 (e.g., PhD)
])


In [8]:
# Calculate covariance matrix
cov_matrix = np.cov(data, rowvar=False)

print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
[[4.16666667e+01 8.33333333e+04 5.00000000e+00]
 [8.33333333e+04 1.66666667e+08 1.00000000e+04]
 [5.00000000e+00 1.00000000e+04 6.66666667e-01]]


## Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

### Ans :
Encoding Method Selection
1. Gender (Male/Female):

* Encoding Method: Label Encoding
* Why: Since Gender has only two categories, label encoding (Male = 0, Female = 1) is simple and efficient.

2. Education Level (High School/Bachelor's/Master's/PhD):

* Encoding Method: Ordinal Encoding
* Why: Education Level has a natural order (High School < Bachelor's < Master's < PhD), so ordinal encoding (assigning numeric values based on order) makes sense.

3. Employment Status (Unemployed/Part-Time/Full-Time):

* Encoding Method: One-Hot Encoding
* Why: Employment Status has more than two categories with no inherent order. One-hot encoding will create separate binary columns for each category, allowing the model to treat them as distinct.

## Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

### Ans :

To calculate the covariance between each pair of variables, we need numerical representations for both continuous and categorical variables. Here's a simple way to calculate and interpret the covariance:

Steps:
1. Temperature and Humidity are continuous variables, so we calculate their covariance directly.
2. Weather Condition and Wind Direction are categorical variables, so we first need to encode them (e.g., using Label Encoding or One-Hot Encoding) before calculating the covariance.

In [10]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Sample data (Temperature, Humidity, Weather Condition, Wind Direction)
data = np.array([
    [30, 80, 'Sunny', 'North'],
    [25, 70, 'Cloudy', 'South'],
    [20, 90, 'Rainy', 'East'],
    [15, 85, 'Sunny', 'West'],
])

# Encoding categorical variables
label_encoder = LabelEncoder()
weather_encoded = label_encoder.fit_transform(data[:, 2])  # Weather Condition
wind_encoded = label_encoder.fit_transform(data[:, 3])  # Wind Direction

# Combine the data with encoded values
encoded_data = np.column_stack((data[:, 0].astype(float), data[:, 1].astype(float), weather_encoded, wind_encoded))

# Calculate covariance matrix
cov_matrix = np.cov(encoded_data, rowvar=False)

print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
[[ 41.66666667 -29.16666667  -0.83333333  -3.33333333]
 [-29.16666667  72.91666667   4.58333333  -4.16666667]
 [ -0.83333333   4.58333333   0.91666667   0.16666667]
 [ -3.33333333  -4.16666667   0.16666667   1.66666667]]
