# Feature Engineering-5 Assignment

# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

# Answer-1-Ordinal encoding and label encoding are both techniques used to convert categorical data into numerical format, but they differ in how they handle the encoding of the categorical variables.

# Ordinal Encoding:Ordinal encoding is a type of encoding that is used when the categorical data has an inherent order or ranking among the categories. In this method, each category is assigned a unique integer value based on its position or order. For instance, categories like 'low,' 'medium,' and 'high' might be encoded as 0, 1, and 2, respectively. Ordinal encoding implies a ranked relationship between categories.

# Label Encoding:Label encoding, on the other hand, is a more general technique used for converting categorical data into numerical form. It assigns a unique integer to each category without considering any inherent order or hierarchy. Each category is assigned a different integer without any ranking implication. For instance, categories like 'red,' 'green,' and 'blue' might be encoded as 0, 1, and 2, respectively.

# Example:Let's consider an example of a dataset containing the education level of individuals.

# Ordinal Encoding: Suppose the education levels are 'High School,' 'Associate's Degree,' 'Bachelor's Degree,' 'Master's Degree,' and 'Ph.D.' In this case, since there's a clear order among the education levels, you might choose to use ordinal encoding to represent these levels with integer values such as 0, 1, 2, 3, and 4, respectively, to maintain the hierarchy of education.

# Label Encoding: If you have a different categorical feature in the same dataset, such as 'Color' with categories 'Red,' 'Green,' and 'Blue,' and there is no inherent order among these colors, label encoding would be more appropriate. Each color category would be assigned a different integer without implying any ordinal relationship among them.

# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

# Answer-2-Target Guided Ordinal Encoding is a method that combines the principles of ordinal encoding with the target variable's information to assign labels to categorical variables. It involves encoding categorical variables based on the target variable, ensuring that the encoding reflects the relationship between the categories and the target variable's impact.

# Here is how Target Guided Ordinal Encoding works:

# Calculate the Mean/Median/Another Aggregation by Category:For each category in the categorical variable, calculate the mean, median, or other relevant aggregation of the target variable. For instance, if predicting customer churn, calculate the average churn rate for each category of a feature (like customer type or service usage).
# Order the Categories Based on the Aggregated Values:Sort the categories based on their aggregated values in ascending or descending order. This indicates the relationship between the category and the target variable.
# Assign Ordinal Labels:Assign labels or ordinal values to the categories based on their order according to the aggregated values. The category with the highest average (or lowest, depending on the goal) would be assigned the highest label, and so on.
# Example:Consider a machine learning project involving predicting the risk level of loans for a financial institution. You have a categorical variable "Credit Score Category" with labels like 'Poor,' 'Fair,' 'Good,' 'Very Good,' and 'Excellent.'

# To use Target Guided Ordinal Encoding in this scenario:

# Calculate the average default rates (target variable) for each 'Credit Score Category.'
# Order the categories based on the average default rates (from highest to lowest or vice versa).
# Assign ordinal labels based on the ordered default rates to represent the risk level: for example, 'Poor' might be labeled as 4, 'Fair' as 3, 'Good' as 2, 'Very Good' as 1, and 'Excellent' as 0, indicating decreasing risk.

# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

# Answer-3-Covariance is a statistical measure that describes the relationship between two random variables. It measures how changes in one variable are associated with changes in another variable. In essence, it quantifies the degree to which two variables change together.

# Importance of Covariance in Statistical Analysis:Relationship between Variables: Covariance indicates whether the two variables tend to increase or decrease together. A positive covariance suggests that as one variable increases, the other tends to increase as well. A negative covariance indicates that as one variable increases, the other tends to decrease.

# Direction and Strength of Relationship: It gives an idea about the direction (positive or negative) and the strength of the linear relationship between variables. However, it doesn't provide information about the magnitude of the relationship, making it necessary to normalize this measure, which leads to correlation.

In [1]:
import numpy as np

In [2]:
data_X = np.array([1, 2, 3, 4, 5])
data_Y = np.array([5, 4, 3, 2, 1])

In [3]:
covariance = np.cov(data_X, data_Y)[0, 1]  

In [4]:
print(f"The covariance between X and Y is: {covariance}")

The covariance between X and Y is: -2.5


# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.Show your code and explain the output.

In [5]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

In [6]:
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
}

In [7]:
df = pd.DataFrame(data)

In [8]:
label_encoder = LabelEncoder()

In [9]:
for col in df.columns:
    if df[col].dtype == 'object':  # Checking for categorical columns
        df[col + '_encoded'] = label_encoder.fit_transform(df[col])

In [10]:
df

Unnamed: 0,Color,Size,Material,Color_encoded,Size_encoded,Material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1
3,green,small,wood,1,2,2
4,red,large,metal,2,0,0


# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [11]:
#Answer-5-

In [12]:
covariance_matrix = df.cov()
print(covariance_matrix)

                  Color_encoded  Size_encoded  Material_encoded
Color_encoded              0.70          0.25              0.00
Size_encoded               0.25          1.00              0.75
Material_encoded           0.00          0.75              1.00


  covariance_matrix = df.cov()


# Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

# Answer-6-In a machine learning project involving categorical variables like "Gender," "Education Level," and "Employment Status," the choice of encoding method for each variable depends on the nature of the data within each categorical feature.

# Here are the encoding methods and explanations for each variable:

# Gender (Binary Categorical Variable - Nominal):Encoding Method: For "Gender," a binary encoding method, such as Label Encoding or One-Hot Encoding, can be used.
# Explanation: Since "Gender" has only two categories (Male/Female), binary encoding through Label Encoding (assigning 0 and 1) or One-Hot Encoding (creating one binary column) for Male/Female would effectively represent this nominal variable without implying any order or hierarchy.
# Education Level (Categorical Variable with Order):

# Encoding Method: Ordinal Encoding or Label Encoding with order could be used for "Education Level."
# Explanation: "Education Level" contains categories with a natural order (e.g., High School, Bachelor's, Master's, PhD). Ordinal Encoding or Label Encoding with specific label assignments (e.g., 0, 1, 2, 3) would represent the ordinal relationship between the education levels.
# Employment Status (Categorical Variable without Order):Encoding Method: One-Hot Encoding is suitable for "Employment Status."
# Explanation: "Employment Status" contains categories without an inherent order (e.g., Unemployed, Part-Time, Full-Time). One-Hot Encoding would create binary columns for each category, preventing the model from assuming any order or relationship between the statuses.

# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

# Answer-7-

In [13]:
data = {
    'Temperature': [25, 27, 22, 20, 24],
    'Humidity': [60, 65, 70, 55, 58],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

In [14]:
df = pd.DataFrame(data)
continuous_variables = df[['Temperature', 'Humidity']]
covariance_matrix = continuous_variables.cov()

In [15]:
covariance_matrix

Unnamed: 0,Temperature,Humidity
Temperature,7.3,4.55
Humidity,4.55,35.3


# Assignment Completed