## Feature Engineering-4

In [2]:
# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose
# one over the other.

# Ans:

# Label Encoding:
# Assigns a unique integer to each category.   
# Doesn't consider any order or ranking among the categories.
# Can be useful for nominal categorical data where there's no inherent order.

# Ordinal Encoding
# Assigns integers to categories based on their order or rank.   
# Preserves the order information.   
# Suitable for ordinal categorical data where the order matters.

# Example:

# Consider an "Education Level" variable with categories: "High School", "Bachelor's Degree" and "Master's Degree". 
# Ordinal encoding might assign:
# High School = 0   
# Bachelor's Degree = 1
# Master's Degree = 2


# When to Choose One Over the Other:

# Label Encoding: We can use when the categorical variable is nominal (no order). For example, "City" 
# (New York, London, Tokyo) or "Product Category" (Electronics, Clothing, Books).

# Ordinal Encoding: Use when the categorical variable is ordinal (has a meaningful order). For example, 
# "Customer Satisfaction" (Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied) or "Size" 
# (Small, Medium, Large).   



In [3]:
# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
# a machine learning project.

# Ans:

# Target Guided Ordinal Encoding is a kind of encoding technique, it assigns values to the cayegorical data with
# respect to the dependent or target feature. We can use different techniques like mean or median.

# Example:
# Let say we have a dataset with feature city and rent with values, ['BBSR', 'CTC', 'BBSR', 'Delhi', 'CTC', 'Delhi'],
# [10000, 8000, 12000, 16000, 6000, 18000] respectively.

# Let say we are replacing values with mean so the values would be:
# BBSR: (10000+12000)/2 = 11000
# CTC: (8000+6000)/2 = 7000
# Delhi: (16000+18000)/2 = 17000

In [4]:
# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

# Ans:

# Covariance is a technique to determine the relationship between two variables, whether they are directly or 
# inversly proportional to each other.

# Formula of covariance:
# Let say we have two variables, x and y, we are calculating for sample data
# cov(x,y) = Σ[(xi - μx)(yi - μy)] / N-1 , where i [1 to N]
# μx: Mean of x
# μy: Mean of y

In [None]:
# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,large), and 
# Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and 
# explain the output.

# Ans:


import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Step 1: Create a sample dataset
data = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
})


label_encoders = {}  


for column in data.columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])  
    label_encoders[column] = le  


print(data)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     2         0
4      2     0         2


In [6]:
# Explanation of the Output:
# Label Encoding converts each category into a unique integer value.
# The mapping is based on alphabetical order by default.

In [7]:
# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
# level. Interpret the results.


# Ans:

import pandas as pd

# Create a dataset
data = pd.DataFrame({
    'Age': [25, 32, 40, 50, 28],
    'Income': [50, 60, 80, 100, 55],  # in thousands
    'Education': [16, 18, 14, 12, 16]  # Years of schooling
})

# Step 2: Calculate the Covariance Matrix
cov_matrix = data.cov()

# Step 3: Display the results
print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
              Age  Income  Education
Age        102.00  208.75      -19.0
Income     208.75  430.00      -41.0
Education  -19.00  -41.00        5.2


In [8]:
# Interpretation:
# Age vs. Income (208.75) → High Positive Covariance
# Older people tend to have higher incomes in this dataset.
# As Age increases, Income also increases.

# Age vs. Education (-19.0) → Negative Covariance
# Indicates older individuals have slightly fewer years of education (possibly due to earlier generations having 
# different education systems).

# Income vs. Education (-41.0) → Negative Covariance
# Suggests that higher education levels do not always correspond to higher income in this dataset.
# This could be due to job market differences (e.g. some highly educated people work in lower-paying fields).

In [9]:
# Q6. You are working on a machine learning project with a dataset containing several categorical
# variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
# and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
# each variable, and why?

# Ans:

# Gender → Label Encoding (0 = Female, 1 = Male) (since it's binary).
# Education Level → Ordinal Encoding (High School < Bachelor's < Master's < PhD).
# Employment Status → One-Hot Encoding (since categories are nominal and less cardinality).

In [10]:
# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
# categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
# East/West). Calculate the covariance between each pair of variables and interpret the results.

# Ans:

# Import the packages
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a sample dataset
data = pd.DataFrame({
    'Temperature': [30, 25, 28, 22, 35],
    'Humidity': [70, 65, 80, 60, 75],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
})

# Label categorical variables
label_encoders = {}

for column in ['Weather Condition', 'Wind Direction']:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

cov_matrix = data.cov()

# Step 4: Display Results
print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
                   Temperature  Humidity  Weather Condition  Wind Direction
Temperature              24.50     27.50              -0.25           -3.75
Humidity                 27.50     62.50              -1.25           -8.75
Weather Condition        -0.25     -1.25               0.70            0.15
Wind Direction           -3.75     -8.75               0.15            1.30


In [11]:
# Interpretation
# Temperature vs. Humidity (27.50, Positive)
# Higher temperature tends to increase humidity in this dataset.

# Temperature vs. Weather Condition (-0.25, Slightly negative)
# This suggests that Sunny days might be warmer than Cloudy or Rainy days.

# Humidity vs. Weather Condition (-1.25, Negative)
# Rainy conditions tend to have lower humidity.

# Temperature vs. Wind Direction (-3.75, Negative)
# Indicates that in this dataset, certain wind directions may be linked to lower temperatures.
