# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
# might choose one over the other.

Label Encoding


Label Encoding is a technique where each category is assigned a unique integer value. This method does not imply any order or relationship between categories; it's simply a way to convert categorical data into numerical format.




Example: Suppose you have a feature called "Color" with categories ["Red", "Green", "Blue"]. Label encoding might map these categories to integers as follows:


Red: 0

Green: 1

Blue: 2

Ordinal Encoding


Ordinal Encoding is similar to label encoding but is specifically used when there is an inherent order or ranking among the categories. Each category is assigned an integer value that reflects its order.





Example: Suppose you have a feature called "Education Level" with categories ["High School", "Associate's", "Bachelor's", "Master's", "PhD"]. Ordinal encoding might map these categories to integers as follows:

High School: 1

Associate's: 2

Bachelor's: 3

Master's: 4

PhD: 5

In [1]:
# Choosing Between Ordinal and Label Encoding
# Label Encoding:

# When to Use: When there is no inherent order among the categories. For example, encoding colors like "Red", "Green", "Blue" where the order is not meaningful.
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
df = pd.DataFrame(data)

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding
df['Color_encoded'] = label_encoder.fit_transform(df['Color'])

print(df)


   Color  Color_encoded
0    Red              2
1  Green              1
2   Blue              0
3  Green              1
4    Red              2


In [2]:
# Ordinal Encoding:

# When to Use: When there is a clear, meaningful order among the categories. For example, in educational levels where there is a progression from "High School" to "PhD".
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Sample data
data = {'Education Level': ['High School', 'Associate\'s', 'Bachelor\'s', 'Master\'s', 'PhD']}
df = pd.DataFrame(data)

# Initialize the OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['High School', 'Associate\'s', 'Bachelor\'s', 'Master\'s', 'PhD']])

# Apply ordinal encoding
df['Education_Level_encoded'] = ordinal_encoder.fit_transform(df[['Education Level']])

print(df)


  Education Level  Education_Level_encoded
0     High School                      0.0
1     Associate's                      1.0
2      Bachelor's                      2.0
3        Master's                      3.0
4             PhD                      4.0


# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding is a technique used to convert categorical variables into ordinal numbers based on the relationship between the categories and the target variable. This method helps to retain the ordering information based on the target, which can sometimes lead to better model performance, especially in cases where the target variable is continuous or ordinal.

In [3]:
import pandas as pd

# Sample data
data = {
    'Neighborhood': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A'],
    'HousePrice': [300000, 200000, 400000, 350000, 210000, 420000, 320000, 180000, 410000, 360000]
}

df = pd.DataFrame(data)

# Step 1: Calculate the mean target value for each category
mean_prices = df.groupby('Neighborhood')['HousePrice'].mean()

# Step 2: Rank the categories based on the mean target value
mean_prices_sorted = mean_prices.sort_values()
ordinal_mapping = {k: i for i, k in enumerate(mean_prices_sorted.index, 1)}

# Step 3: Map the categories to ordinal values
df['Neighborhood_Encoded'] = df['Neighborhood'].map(ordinal_mapping)

# Display the result
print(df)


  Neighborhood  HousePrice  Neighborhood_Encoded
0            A      300000                     2
1            B      200000                     1
2            C      400000                     3
3            A      350000                     2
4            B      210000                     1
5            C      420000                     3
6            A      320000                     2
7            B      180000                     1
8            C      410000                     3
9            A      360000                     2


# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that indicates the relationship between two variables. It shows how much two variables change together.   

Positive covariance: If two variables tend to move in the same direction, their covariance is positive.   

Negative covariance: If two variables tend to move in opposite directions, their covariance is negative.   

Zero covariance: If there is no linear relationship between the two variables, their covariance is zero.   

Importance of Covariance in Statistical Analysis


Covariance is crucial in statistical analysis for several reasons:

Understanding relationships:
    It helps identify relationships between variables, which can be essential in various fields like finance, economics, and science.

    Portfolio management: In finance, covariance is used to assess the risk of a portfolio by measuring how asset returns move together.   

    Correlation analysis: While covariance shows the direction of the relationship, correlation normalizes covariance and provides a standardized measure of the strength of the relationship.   

    Regression analysis: Covariance is a fundamental concept in regression analysis, where it helps determine the relationship between the dependent and independent variables.

Calculating Covariance

The formula for calculating covariance between two variables X and Y is:

Cov(X, Y) = Σ[(Xi - X̄)(Yi - Ȳ)] / (n - 1)

# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
# large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
# Show your code and explain the output.

In [4]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

df = pd.DataFrame(data)

# Initialize the label encoder
le = LabelEncoder()

# Apply label encoding to each categorical column
df_encoded = df.apply(le.fit_transform)

# Display the encoded DataFrame
print(df_encoded)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     1         0
4      2     2         2


Label encoding converts each category in a categorical variable to a unique integer. The categories are typically assigned values in alphabetical order or according to the order they appear in the data. This method is useful when the categorical variable has a natural order or when working with algorithms that can handle ordinal relationships

# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
# level.


In [5]:
import pandas as pd

# Sample data
data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 70000, 80000, 90000],
    'Education Level': [16, 18, 16, 20, 22]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate the covariance matrix
cov_matrix = df.cov()

# Display the covariance matrix
print(cov_matrix)


                      Age       Income  Education Level
Age                  62.5     125000.0             17.5
Income           125000.0  250000000.0          35000.0
Education Level      17.5      35000.0              6.8


# Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

Gender (Male/Female):

This is a binary categorical variable with only two distinct values. A suitable encoding method is Binary Encoding or One-Hot Encoding. Both will work, but Binary Encoding is often preferred for binary variables as it can be more efficient. One-Hot Encoding would create two columns, one for each category (Male and Female), with binary values (0 or 1). Binary Encoding will use a single column with binary values.

Education Level (High School/Bachelor's/Master's/PhD): 

This is an ordinal categorical variable because the categories have a meaningful order. For ordinal variables, you can use Ordinal Encoding. This method assigns an integer to each category based on its order (e.g., High School = 1, Bachelor's = 2, Master's = 3, PhD = 4). This encoding method preserves the order of the categories, which can be useful for some models.

Employment Status (Unemployed/Part-Time/Full-Time): 

This is a nominal categorical variable with no intrinsic order. One-Hot Encoding is generally the best approach for nominal variables. It creates a separate binary column for each category, with a value of 1 indicating the presence of that category and 0 otherwise. This method avoids any assumptions about the order or relationship between the categories.



Gender: Binary Encoding or One-Hot Encoding


Education Level: Ordinal Encoding



Employment Status: One-Hot Encoding

# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction"(North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {
    'Temperature': [30, 22, 25, 28, 32],
    'Humidity': [80, 65, 70, 75, 85],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()
df['Weather Condition Encoded'] = label_encoder.fit_transform(df['Weather Condition'])
df['Wind Direction Encoded'] = label_encoder.fit_transform(df['Wind Direction'])

# Calculate the covariance matrix
covariance_matrix = df[['Temperature', 'Humidity', 'Weather Condition Encoded', 'Wind Direction Encoded']].cov()

# Print the covariance matrix
print("Covariance Matrix:\n", covariance_matrix)


Covariance Matrix:
                            Temperature  Humidity  Weather Condition Encoded  \
Temperature                      15.80     31.25                       1.00   
Humidity                         31.25     62.50                       1.25   
Weather Condition Encoded         1.00      1.25                       1.00   
Wind Direction Encoded           -0.45     -1.25                       0.25   

                           Wind Direction Encoded  
Temperature                                 -0.45  
Humidity                                    -1.25  
Weather Condition Encoded                    0.25  
Wind Direction Encoded                       1.30  


Explanation:
Label Encoding:

We use LabelEncoder from sklearn.preprocessing to convert categorical variables into numerical values.
For example, "Weather Condition" might be encoded as: Sunny = 2, Cloudy = 0, Rainy = 1.
Covariance Calculation:

The covariance matrix includes the covariance between all pairs of the selected variables (both continuous and encoded categorical variables).
Interpreting Results:
The covariance between continuous and encoded categorical variables needs careful interpretation. High or low covariance might suggest a potential relationship, but it does not imply causation or linearity.
Since categorical variables were arbitrarily encoded, the covariance values should be considered as a preliminary step. Further analysis (like ANOVA, regression, or correlation analysis) might be more appropriate for understanding the relationships.