# Q-1

### Ordinal Encoding:
### Ordinal encoding is used when the categorical variable has an inherent order or hierarchy among its categories. It assigns integer labels to the categories based on their order or level of preference. The numerical labels preserve the ordinal relationship between the categories.
### Suppose we have a categorical variable "Education Level" with categories "High School," "Bachelor's Degree," "Master's Degree," and "Ph.D." In ordinal encoding, we can assign integer labels 0, 1, 2, and 3 to represent the increasing level of education, respectively.
### Label Encoding:
### Label encoding, on the other hand, is a more generic encoding technique that assigns a unique numerical label to each category of the categorical variable. The labels are assigned arbitrarily without considering any ordering or hierarchy between the categories.
### Consider a categorical variable "City" with categories "New York," "London," and "Paris." In label encoding, we can assign integer labels 0, 1, and 2 to represent the cities, respectively.

# Q-2

### Target Guided Ordinal Encoding is a technique used to encode categorical variables based on their relationship with the target variable in a supervised machine learning setting. It assigns ordinal labels to the categories of the variable, considering the likelihood of the target variable being a specific value for each category.
### Here's how Target Guided Ordinal Encoding works:
### 1. Calculate the Mean/Median/Mode of the Target Variable:
### 2. Sort the Categories:
### 3. Assign Ordinal Labels:
### 4. Replace the Categorical Variable with the Encoded Labels:

### Here's an example to illustrate the usage of Target Guided Ordinal Encoding:
### Suppose we have a dataset with a categorical variable "Education Level" (categories: High School, Bachelor's Degree, Master's Degree, Ph.D.) and a binary target variable indicating whether a person is likely to default on a loan (Yes or No).
### To apply Target Guided Ordinal Encoding, we calculate the default rate (proportion of defaults) for each category of "Education Level." We sort the categories based on their default rate and assign ordinal labels accordingly. The category with the highest default rate can be assigned the highest label, indicating a higher risk of default.
### Example Encoding:

### High School: 1 (Lowest risk of default)
### Bachelor's Degree: 2
### Master's Degree: 3
### Ph.D.: 4 (Highest risk of default)

### In this scenario, Target Guided Ordinal Encoding takes into account the relationship between the "Education Level" variable and the likelihood of default, allowing the encoded variable to capture the underlying patterns and prioritize the categories based on their association with the target variable.

# Q-3

### Covariance is a measure of how two variables vary together. It quantifies the relationship and the extent to which changes in one variable correspond to changes in another variable. It helps to understand the direction (positive or negative) and strength of the linear relationship between two variables.
### Covariance is important in statistical analysis for several reasons:
### Relationship Assessment: Covariance allows us to assess the degree and direction of the relationship between two variables. A positive covariance indicates that the variables tend to move in the same direction (increase or decrease together), while a negative covariance indicates an inverse relationship.
### Dependency Identification: Covariance helps to identify the dependency between variables. If the covariance is close to zero, it suggests that the variables are independent or have a weak linear relationship. On the other hand, a significant covariance value indicates that the variables are dependent.
### Feature Selection: In machine learning and feature selection tasks, covariance can be used to identify highly correlated features. Variables with high covariance may provide redundant or similar information, and reducing such variables can simplify the model and improve interpretability.
### Covariance is calculated using the following formula:

### cov(X, Y) = Σ((Xᵢ - μₓ)(Yᵢ - μᵧ))/(N-1)

### Where:

### X and Y are the two variables of interest.
### Xᵢ and Yᵢ are the individual observations of X and Y.
### μₓ and μᵧ are the means of X and Y, respectively.
### N is the number of observations.

# Q-4

In [2]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

color = ['red','green','blue']
size = ['small','medium','large']
material = ['wood','metal','plastic']

color_encoded = label_encoder.fit_transform(color)
size_encoded = label_encoder.fit_transform(size)
material_encoded = label_encoder.fit_transform(material)

print("color encoded: ", color_encoded)
print("size encoded: ", size_encoded)
print("material encoded:", material_encoded)

color encoded:  [2 1 0]
size encoded:  [2 1 0]
material encoded: [2 0 1]


# Q-5

In [3]:
import numpy as np

# Define the data for each variable
age = [30, 40, 35, 45, 50]
income = [50000, 60000, 55000, 70000, 80000]
education = [12, 16, 14, 18, 20]

# Create a numpy array with the variables
data = np.array([age, income, education])

# Calculate the covariance matrix
covariance_matrix = np.cov(data)

# Print the covariance matrix
print(covariance_matrix)


[[6.250e+01 9.375e+04 2.500e+01]
 [9.375e+04 1.450e+08 3.750e+04]
 [2.500e+01 3.750e+04 1.000e+01]]


# Q-6

### Gender:
### Since there are only two categories ("Male" and "Female") in the "Gender" variable, you can use binary encoding or label encoding. 
### Education Level:
### For the "Education Level" variable, which has multiple categories (e.g., "High School," "Bachelor's," "Master's," "PhD"), you can use one-hot encoding or ordinal encoding.
### Employment Status:
### Similar to the "Education Level" variable, the "Employment Status" variable has multiple categories ("Unemployed," "Part-Time," "Full-Time"). For this variable, you can also use one-hot encoding or ordinal encoding.
### 

# Q-7

In [4]:
import numpy as np

# Define the data for the continuous variables
temperature = [20, 25, 30, 35, 40]
humidity = [50, 55, 60, 65, 70]

# Create a numpy array with the continuous variables
continuous_data = np.array([temperature, humidity])

# Calculate the covariance matrix
covariance_matrix = np.cov(continuous_data)

# Print the covariance matrix
print(covariance_matrix)

[[62.5 62.5]
 [62.5 62.5]]
