Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

**Ordinal Encoding** assigns integer values to categories based on a defined order. It is used when the categorical variable has a meaningful order (e.g., ratings like "low," "medium," "high").

**Label Encoding** assigns unique integer values to each category without any inherent order. It is suitable for nominal data where categories do not have a ranking (e.g., colors like "red," "blue," "green").

`Example:`

* Use Ordinal Encoding for a feature like "Education Level" (e.g., High School = 1, Bachelor's = 2, Master's = 3).
* Use Label Encoding for a feature like "Fruit" (e.g., Apple = 0, Banana = 1, Cherry = 2) since there is no order among the fruits.


Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.


**Target Guided Ordinal Encoding** involves encoding categorical variables based on the target variable's mean or median for each category. This method helps to capture the relationship between the categorical feature and the target variable.

**Example:** In a project predicting house prices, you might have a categorical feature "Neighborhood." You could calculate the average house price for each neighborhood and use these averages to encode the neighborhoods. This encoding can improve model performance by providing a more informative representation of the categorical variable.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?


**Covariance** is a measure of the degree to which two variables change together. It indicates the direction of the linear relationship between the variables.

* Positive Covariance: Both variables increase together.
* Negative Covariance: One variable increases while the other decreases.
* *Zero Covariance: No linear relationship exists.

**Importance:** Covariance is crucial in understanding relationships between variables, which is essential for regression analysis, portfolio theory in finance, and feature selection in machine learning.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and output.


In [1]:
import pandas as pd  
from sklearn.preprocessing import LabelEncoder  

# Sample dataset  
data = {  
    'Color': ['red', 'green', 'blue', 'red', 'blue'],  
    'Size': ['small', 'medium', 'large', 'medium', 'small'],  
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']  
}  

df = pd.DataFrame(data)  

# Initialize LabelEncoder  
label_encoder = LabelEncoder()  

# Apply Label Encoding  
df['Color'] = label_encoder.fit_transform(df['Color'])  
df['Size'] = label_encoder.fit_transform(df['Size'])  
df['Material'] = label_encoder.fit_transform(df['Material'])  

print(df)

   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      2     1         2
4      0     2         0


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Spending Score.


|Age	|Income|	Spending Score|
|-------|------|-----------------|
|25	|50000|	60|
|30	|60000|	70|
|35	|70000|	80|
|40	|80000|	90|
|45	|90000|	100|

In [3]:
import numpy as np  
import pandas as pd  

# Sample data  
data = {  
    'Age': [25, 30, 35, 40, 45],  
    'Income': [50000, 60000, 70000, 80000, 90000],  
    'Spending Score': [60, 70, 80, 90, 100]  
}  

df = pd.DataFrame(data)  

# Calculate covariance matrix  
cov_matrix = df.cov()  
print(cov_matrix)

                     Age       Income  Spending Score
Age                 62.5     125000.0           125.0
Income          125000.0  250000000.0        250000.0
Spending Score     125.0     250000.0           250.0


Q6. You are working on a machine learning project involving several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you choose for each variable, and why?


* Gender: Use Binary Encoding since it has only two categories (Male = 0, Female = 1).
* Education Level: Use Ordinal Encoding because there is a natural order (e.g., High School = 1, Bachelor's = 2, Master's = 3, PhD = 4).
* Employment Status: Use One-Hot Encoding since it has multiple categories without a natural order, creating separate binary columns for each status.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity," and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.


|Temperature|	Humidity|	Weather Condition|	Wind Direction|
|-----------|-----------|--------------------|----------------|
|30|	              70|	            Sunny|	         North|
|25|                  80|          	   Cloudy|	         South|
|20|	              90|	            Rainy|	          East|
|35|                  60|	            Sunny|	          West|
|28|	              75|	           Cloudy|

In [4]:
# Sample data  
data = {  
    'Temperature': [30, 25, 20, 35, 28],  
    'Humidity': [70, 80, 90, 60, 75],  
}  

df = pd.DataFrame(data)  

# Calculate covariance matrix  
cov_matrix = df.cov()  
print(cov_matrix)

             Temperature  Humidity
Temperature         31.3     -62.5
Humidity           -62.5     125.0
