### Question 1

In [None]:
'''
ordinal encoding is used when there is the internal order in the categories eg. 'small','medium','large'

label encoding just assign an unique value for each unique category. eg: 'red' 'blue' 'green'

'''

### Question 2

In [None]:
'''
Target Guided Ordinal Encoding is a technique used to encode categorical variables based on their relationship with the target variable in a supervised machine learning project. 
It assigns ordinal labels to the categories based on the target variable's mean or median value for each category. This encoding captures the monotonic relationship between the 
categorical variable and the target variable, making it useful when the categorical variable has predictive power for the target variable.

Here's an example to illustrate the application of Target Guided Ordinal Encoding:

Suppose you are working on a credit card default prediction project, and one of the categorical variables in the dataset is "Education Level" with categories "High School,"
"College," and "Graduate." The target variable is "Default" indicating whether a customer defaults on their credit card payment.

To apply Target Guided Ordinal Encoding, you would calculate the mean or median default rate for each category of the "Education Level" variable. The categories can then be 
assigned ordinal labels based on their default rates. For instance, if the default rates are 0.2, 0.4, and 0.6 for "High School," "College," and "Graduate" categories, 
respectively, you would assign ordinal labels 1, 2, and 3 based on the ascending order of default rates.

By encoding the "Education Level" variable using Target Guided Ordinal Encoding, the model can capture the information about the target variable's behavior for each category. 
This encoding technique takes advantage of the relationship between the categorical variable and the target variable, potentially improving the model's predictive performance.

Target Guided Ordinal Encoding can be particularly useful when there is a clear monotonic relationship between the categorical variable and the target variable. It allows the 
model to leverage this relationship during training, enhancing its ability to make accurate predictions. However, it's important to note that this encoding technique assumes 
a monotonic relationship and may not be suitable for all categorical variables or datasets.

'''

### Question 3

In [None]:
'''
Covariance is a statistical measure that quantifies the relationship between two variables.

The formula to calculate the covariance between two variables X and Y is as follows:

Cov(X, Y) = Σ [(Xᵢ - X̄) * (Yᵢ - Ȳ)] / (n - 1)

where Xᵢ and Yᵢ are the individual values of X and Y, X̄ and Ȳ are the means of X and Y, 
and n is the number of observations.

In this formula, (Xᵢ - X̄) represents the deviation of each X value from its mean, and 
(Yᵢ - Ȳ) represents the deviation of each Y value from its mean. The covariance is 
obtained by summing the product of these deviations and dividing by (n - 1).

'''

### Question 4

In [5]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Create a sample dataset
data = {
    'Color': ['red', 'green', 'blue', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
}

df = pd.DataFrame(data)

# Perform label encoding
label_encoder = LabelEncoder()

df_encoded = df.copy()
for column in df_encoded.columns:
    df_encoded[column] = label_encoder.fit_transform(df_encoded[column])

print(df_encoded)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      2     2         2
4      0     1         0


In [None]:
'''
Explanation:
In the code, we import the necessary libraries, create a sample dataset with three categorical variables: Color, Size, and Material. We then use the LabelEncoder 
class to perform label encoding on each column of the dataset. The label encoding is done by mapping each unique category in a column to a numerical label.

The resulting df_encoded DataFrame contains the encoded values for each categorical variable. Each unique category within a column is represented by a numerical label. 
For example, in the "Color" column, "red" is encoded as 2, "green" as 1, and "blue" as 0. Similarly, the "Size" column is encoded as 2 for "small", 0 for "medium", 
and 1 for "large". The "Material" column is encoded as 2 for "wood", 1 for "metal", and 0 for "plastic".

Label encoding is useful when dealing with categorical variables in machine learning models as it allows the algorithms to work with numerical representations of the 
categories. However, it's important to note that label encoding introduces an arbitrary ordinal relationship between the labels, which may not always be appropriate 
for all categorical variables.

'''

### Question 5

In [6]:
import numpy as np

# Create a sample dataset
age = [25, 30, 35, 28, 42]
income = [50000, 60000, 70000, 55000, 80000]
education_level = [12, 16, 14, 18, 15]

# Create a matrix with the variables
data = np.array([age, income, education_level])

# Calculate the covariance matrix
covariance_matrix = np.cov(data)

print(covariance_matrix)


[[4.45e+01 8.00e+04 1.00e+00]
 [8.00e+04 1.45e+08 1.25e+03]
 [1.00e+00 1.25e+03 5.00e+00]]


### Question 6

In [None]:
'''
for Gender column, binary encoding

for Education Level, ordinal encoding

for Employment Status. one hot encoding

'''

### Question 7

In [8]:
import numpy as np

# Create a sample dataset
temperature = [25, 28, 30, 22, 24]
humidity = [60, 65, 70, 55, 58]
weather_condition = ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny']
wind_direction = ['North', 'South', 'East', 'West', 'North']

# Create a matrix with the variables
data = np.array([temperature, humidity])

# Calculate the covariance matrix
covariance_matrix = np.cov(data)

print(covariance_matrix)


[[10.2 18.9]
 [18.9 35.3]]
