### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Answer - 

**Ordinal Encoding vs. Label Encoding:**
Ordinal Encoding and Label Encoding are both techniques used in machine learning to convert categorical data into numerical format for model training. However, they are used in different scenarios and have distinct approaches.

1. **Ordinal Encoding:**
   - In ordinal encoding, categories are assigned numerical values based on their order or rank.
   - It's suitable when the categorical variable has an inherent order or hierarchy.
   - Example: Education levels (High School < Bachelor's < Master's < PhD) can be encoded as (1 < 2 < 3 < 4).

2. **Label Encoding:**
   - In label encoding, each category is assigned a unique integer label.
   - It's used when there is no inherent order among categories.
   - Example: Colors (Red, Blue, Green) can be encoded as (0, 1, 2).



### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Answer - 

**Target Guided Ordinal Encoding is a technique used in machine learning to encode categorical variables based on their relationship with the target variable. It's particularly useful when dealing with categorical features that have a strong influence on the target variable.**

Here's how Target Guided Ordinal Encoding works:

1. **Calculate Target Mean/Median per Category:** For each category in the categorical feature, calculate the mean or median of the target variable for instances belonging to that category.

2. **Sort Categories:** Sort the categories based on their calculated means/medians. The category with the lowest mean/median gets the lowest label, and so on.

3. **Assign Ordinal Labels:** Assign ordinal labels to the categories based on their order after sorting. The category with the highest mean/median gets the highest label, and the lowest mean/median gets the lowest label.

This encoding technique takes advantage of the relationship between the categorical feature and the target variable. It's particularly useful when the categorical variable has a significant impact on the target and you want to capture that influence in the encoding.

**Example Scenario: Loan Approval Prediction**
Suppose you're working on a machine learning project to predict whether a loan application should be approved or not. One of the features is "Education Level," and you suspect that the education level might have an impact on loan approval. You can use Target Guided Ordinal Encoding in the following way:

1. **Calculate Mean Approval Rate for Each Education Level:**
   - High School: 0.40 (40% approval rate)
   - Bachelor's: 0.60 (60% approval rate)
   - Master's: 0.75 (75% approval rate)
   - PhD: 0.85 (85% approval rate)

2. **Sort Education Levels Based on Approval Rates:**
   - PhD (85%)
   - Master's (75%)
   - Bachelor's (60%)
   - High School (40%)

3. **Assign Ordinal Labels:**
   - PhD: 4
   - Master's: 3
   - Bachelor's: 2
   - High School: 1



### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

**Covariance** is a statistical measure that quantifies the degree to which two random variables change together. It indicates whether an increase in one variable corresponds to an increase, decrease, or no change in another variable. In other words, it measures the directional relationship between two variables. 

**Importance of Covariance in Statistical Analysis:**
Covariance plays a crucial role in statistical analysis for several reasons:

1. **Relationship Assessment**

2. **Portfolio Diversification**

3. **Dimensionality Reduction** 

4. **Linear Regression**

**Calculation of Covariance:**
The formula to calculate the covariance between two variables X and Y in a dataset of n data points is as follows:

```
cov(X, Y) = Σ((X_i - X̄) * (Y_i - Ȳ)) / (n - 1)
```

Where:
- `X_i` and `Y_i` are the individual data points of X and Y.
- `X̄` is the mean of variable X.
- `Ȳ` is the mean of variable Y.
- `n` is the number of data points.

Note that dividing by `(n - 1)` instead of `n` in the formula is known as Bessel's correction and is used to provide an unbiased estimate of the population covariance from a sample.


### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [24]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {
    'Color':['red', 'green', 'blue'],
    'Size':['small', 'medium', 'large'],
    'Material':['wood', 'metal', 'plastic']
}

df = pd.DataFrame(data)

In [25]:
encoder = LabelEncoder()
for i in df.columns:
    df[i + '_encoded'] = encoder.fit_transform(df[i])

In [26]:
df

Unnamed: 0,Color,Size,Material,Color_encoded,Size_encoded,Material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1


### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [47]:
import numpy as np

# Sample dataset
age = [25, 30, 40, 35, 28]
income = [50000, 60000, 80000, 70000, 55000]
# Assuming ordinal values for Education Level: 1 = High School, 2 = Bachelor's, 3 = Master's, 4 = PhD
education_level = [2, 3, 2, 4, 1]

# Combine variables into a matrix
data_matrix = np.array([age, income, education_level])

# Calculate covariance matrix
covariance_matrix = np.cov(data_matrix)

# Print the covariance matrix
print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
[[3.53e+01 7.15e+04 2.20e+00]
 [7.15e+04 1.45e+08 4.75e+03]
 [2.20e+00 4.75e+03 1.30e+00]]


### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

For the given categorical variables "Gender," "Education Level," and "Employment Status," the choice of encoding method depends on the nature of the variables and their potential impact on the machine learning model. Here's a recommendation for each variable:

1. **Gender (Binary Categorical Variable - Male/Female):**
   For binary categorical variables like "Gender," where there are only two possible values, you can use **Label Encoding**. Since there are only two categories (Male and Female), you can assign 0 to Male and 1 to Female. This approach simplifies the representation of binary data without introducing extra dimensions.

   ```plaintext
   Male   -> 0
   Female -> 1
   ```

2. **Education Level (Ordinal Categorical Variable - High School/Bachelor's/Master's/PhD):**
   For ordinal categorical variables like "Education Level," where there os inherent order among the categories, **Ordinal Encoding** is recommended.

   ```plaintext
   High School -> 1 0 0 0
   Bachelor's  -> 0 1 0 0
   Master's    -> 0 0 1 0
   PhD         -> 0 0 0 1
   ```

3. **Employment Status (Ordinal Categorical Variable - Unemployed/Part-Time/Full-Time):**
   For ordinal categorical variables like "Employment Status," where there is an inherent order, but the differences between levels are not uniformly quantifiable, you can consider using **Label Encoding**. However, if you want to capture the ordinal relationship more accurately, you could also explore **Target Guided Ordinal Encoding**. This technique assigns labels based on the relationship with the target variable (if available).

   ```plaintext
   Unemployed  -> 0
   Part-Time   -> 1
   Full-Time   -> 2
   ```


### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.


In [41]:
import numpy as np

In [46]:
data = {
    'Temperature': [25.0, 22.5, 28.7, 30.2, 21.8],
    'Humidity': [60.0, 75.2, 52.1, 45.8, 70.3],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'East', 'West', 'South', 'North']
}

df = pd.DataFrame(data)

encoder = LabelEncoder()
df['encoder_weather'] = encoder.fit_transform(df['Weather Condition'])
df['encoder_wind'] = encoder.fit_transform(df['Wind Direction'])

# print(np.cov(np.array([df.Temperature, df.Humidity, df.encoder_weather, df.encoder_wind])))
df.cov()

[[ 13.793  -44.0515   2.725    3.455 ]
 [-44.0515 149.717   -9.925  -11.64  ]
 [  2.725   -9.925    1.       0.5   ]
 [  3.455  -11.64     0.5      1.3   ]]


  df.cov()


Unnamed: 0,Temperature,Humidity,encoder_weather,encoder_wind
Temperature,13.793,-44.0515,2.725,3.455
Humidity,-44.0515,149.717,-9.925,-11.64
encoder_weather,2.725,-9.925,1.0,0.5
encoder_wind,3.455,-11.64,0.5,1.3
