In [None]:
Ordinal Encoding: In ordinal encoding, each unique category value is assigned a unique integer, starting 
from 1. This encoding preserves the ordinal relationship between categories, meaning it assumes an inherent
order in the categories. For example, if you have a categorical feature "Size" with categories "Small,"
"Medium," and "Large," you might encode them as 1, 2, and 3, respectively.

Label Encoding: Label encoding, on the other hand, also converts categorical values into numerical format, 
but it does not assume any ordinal relationship between the categories. It simply assigns a unique integer
to each category, similar to ordinal encoding but without considering any specific order. For example, if
you have categories "Red," "Green," and "Blue," label encoding might assign them integers 1, 2, and 3,
respectively.

When to choose one over the other:

Use ordinal encoding when the categorical feature has an inherent order or hierarchy among its categories.
For example, size categories like "Small," "Medium," and "Large" can be encoded as 1, 2, and 3 to preserve
the order.
Use label encoding when the categorical feature has no meaningful order or when you want to avoid 
introducing unintended ordinality. For example, if you are encoding colors, label encoding is more 
appropriate since colors do not have a natural order.
Example:

Ordinal Encoding: If you have a dataset with a categorical feature "Education Level" with categories
"High School," "Bachelor's," "Master's," and "PhD," you might use ordinal encoding to encode them as 
1, 2, 3, and 4, respectively, to preserve the order.
Label Encoding: If you have a dataset with a categorical feature "Country" with categories "USA,"
"Canada," "France," and "Japan," you might use label encoding to encode them as 1, 2, 3, and 4,
respectively, since there is no inherent order among these countries.

In [None]:
For each category in the categorical variable, calculate the mean or median of the target variable.
This means you need to have the target variable as a part of your dataset during encoding.

Sort the categories based on their mean or median values, assigning a rank or ordinal number to each
category. The category with the lowest mean or median value gets the lowest rank, and so on.

Replace the categories in the original categorical variable with their respective ranks.

Use the encoded variable in your machine learning model.

In [None]:
Covariance is a statistical measure that describes the relationship between two random variables.
It indicates the degree to which two variables change together. In other words, it measures the extent 
to which two variables tend to move in the same direction (positive covariance) or in opposite directions
(negative covariance).

Covariance is important in statistical analysis for several reasons:

Relationship between variables: Covariance helps to understand the relationship between two variables. 
If the covariance is positive, it indicates that as one variable increases, the other variable also tends
to increase. If the covariance is negative, it indicates that as one variable increases, the other 
variable tends to decrease.

Direction of relationship: Covariance can help determine the direction of the relationship between 
variables. A positive covariance suggests a positive relationship, while a negative covariance suggests
a negative relationship.

Strength of relationship: The magnitude of the covariance indicates the strength of the relationship 
between variables. A larger covariance (positive or negative) indicates a stronger relationship, while 
a smaller covariance indicates a weaker relationship.

Use in other statistical measures: Covariance is used in calculating other statistical measures such as 
correlation, which is a standardized measure of the relationship between two variables.

In [None]:
from sklearn.preprocessing import LabelEncoder


data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

label_encoder = LabelEncoder()

for column in data.columns:
    data[column] = label_encoder.fit_transform(data[column])

print(data)


In [2]:
import numpy as np

data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 70000, 80000, 90000],
    'Education': [12, 14, 16, 18, 20]
}

dataset = np.array([data['Age'], data['Income'], data['Education']])

covariance_matrix = np.cov(dataset)

print(covariance_matrix)


[[6.25e+01 1.25e+05 2.50e+01]
 [1.25e+05 2.50e+08 5.00e+04]
 [2.50e+01 5.00e+04 1.00e+01]]


In [None]:
Gender (Binary):

Encoding Method: Use label encoding or one-hot encoding.
Explanation: Since gender is binary (Male/Female), you can use label encoding (assigning 0 or 1)
or one-hot encoding (creating two binary columns).
Education Level (Ordinal):

Encoding Method: Use ordinal encoding.
Explanation: Education level has a clear order (High School < Bachelor's < Master's < PhD), so ordinal
encoding (assigning integer values based on the order) is appropriate.
Employment Status (Nominal):

Encoding Method: Use one-hot encoding.
Explanation: Employment status is nominal (no inherent order), so one-hot encoding
(creating binary columns for each category) is suitable to avoid introducing unintended ordinality.

In [None]:
import numpy as np


data = {
    'Temperature': [25, 28, 22, 20, 30],
    'Humidity': [50, 60, 45, 55, 65],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}


cov_temp_humidity = np.cov(data['Temperature'], data['Humidity'])[0, 1]

cov_temp_weather = {}
for condition in set(data['Weather Condition']):
    temp_subset = [data['Temperature'][i] for i in range(len(data['Temperature'])) if data['Weather Condition'][i] == condition]
    cov_temp_weather[condition] = np.cov(temp_subset, data['Temperature'])[0, 1]

cov_temp_wind = {}
for direction in set(data['Wind Direction']):
    temp_subset = [data['Temperature'][i] for i in range(len(data['Temperature'])) if data['Wind Direction'][i] == direction]
    cov_temp_wind[direction] = np.cov(temp_subset, data['Temperature'])[0, 1]

print("Covariance between Temperature and Humidity:", cov_temp_humidity)
print("Covariance between Temperature and Weather Condition:", cov_temp_weather)
print("Covariance between Temperature and Wind Direction:", cov_temp_wind)
