Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Ans:
    
Ordinal Encoding and Label Encoding are both techniques used to convert categorical data into numerical data, but they serve different purposes and are used in different scenarios.

Ordinal Encoding

Ordinal Encoding is used when the categorical data has an inherent order or ranking. In this encoding, the categories are converted into integers based on their order.

Label Encoding

Label Encoding is used when the categorical data does not have an inherent order. Each unique category is assigned a unique integer, but the integers do not have any specific order or ranking.

Choosing Between Ordinal Encoding and Label Encoding

Use Ordinal Encoding when:

1. The categories have a meaningful order.

2. The numerical relationship between the categories matters for the model.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Ans:
    
Target Guided Ordinal Encoding is a technique used to convert categorical variables into numerical values based on the relationship between the categories and the target variable. This method is particularly useful when the categorical variables do not have an inherent order, but you want to leverage the information from the target variable to create a meaningful order.

How Target Guided Ordinal Encoding Works

Calculate the Mean Target Value for Each Category:

For each category of the categorical variable, calculate the mean (or median) of the target variable. This step helps to understand the relationship between the categories and the target variable.

1. Sort the Categories:
Sort the categories based on the calculated mean target values.

2. Assign Ordinal Values:
Assign ordinal values (e.g., 1, 2, 3, etc.) to the categories based on the sorted order. The category with the lowest mean target value gets the smallest ordinal value, and so on.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans:
    
Covariance is a measure of the relationship between two random variables and indicates the extent to which they change together. Specifically, it measures the degree to which two variables move in tandem.

Importance of Covariance in Statistical Analysis

Understanding Relationships:

Covariance helps in understanding the direction of the linear relationship between two variables. If the covariance is positive, it indicates that as one variable increases, the other variable also tends to increase. If the covariance is negative, it indicates that as one variable increases, the other tends to decrease.

Feature Selection:

In machine learning and data analysis, covariance can be used to identify which features have significant relationships with the target variable, helping in feature selection and dimensionality reduction.

Portfolio Management:

In finance, covariance is used to measure how the returns of two assets move together, which is crucial for portfolio diversification and risk management.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample dataset
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Initialize LabelEncoders for each categorical feature
label_encoder_color = LabelEncoder()
label_encoder_size = LabelEncoder()
label_encoder_material = LabelEncoder()

# Fit and transform the categorical data
df['Color_encoded'] = label_encoder_color.fit_transform(df['Color'])
df['Size_encoded'] = label_encoder_size.fit_transform(df['Size'])
df['Material_encoded'] = label_encoder_material.fit_transform(df['Material'])

print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green  medium    metal              1             1                 0
4    red   small     wood              2             2                 2


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [2]:
import numpy as np
import pandas as pd

# Sample dataset
data = {
    'Age': [25, 30, 45, 50, 35],
    'Income': [50000, 60000, 80000, 90000, 70000],
    'Education_Level': [12, 14, 16, 18, 15]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Calculating the covariance matrix
cov_matrix = np.cov(df, rowvar=False)

print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
[[1.075e+02 1.625e+05 2.250e+01]
 [1.625e+05 2.500e+08 3.500e+04]
 [2.250e+01 3.500e+04 5.000e+00]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In [3]:
from sklearn.preprocessing import LabelEncoder

# Example data
gender = ['Male', 'Female', 'Female', 'Male']

# Initialize and fit LabelEncoder
label_encoder = LabelEncoder()
gender_encoded = label_encoder.fit_transform(gender)

print(gender_encoded)  


[1 0 0 1]


In [4]:
import pandas as pd

# Example data
education = ['High School', 'Bachelor\'s', 'Master\'s', 'PhD']

# Ordinal mapping
education_mapping = {'High School': 1, 'Bachelor\'s': 2, 'Master\'s': 3, 'PhD': 4}
education_encoded = [education_mapping[edu] for edu in education]

print(education_encoded)  


[1, 2, 3, 4]


In [5]:
import pandas as pd

# Example data
employment_status = ['Unemployed', 'Part-Time', 'Full-Time', 'Part-Time']

# Creating DataFrame
df = pd.DataFrame({'Employment_Status': employment_status})

# One-Hot Encoding
employment_encoded = pd.get_dummies(df['Employment_Status'])

print(employment_encoded)


   Full-Time  Part-Time  Unemployed
0          0          0           1
1          0          1           0
2          1          0           0
3          0          1           0


Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [6]:
import pandas as pd
import numpy as np

# Example data
data = {
    'Temperature': [30, 25, 20, 35, 28],
    'Humidity': [70, 80, 90, 60, 75],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

# Creating a DataFrame
df = pd.DataFrame(data)

# One-Hot Encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=['Weather Condition', 'Wind Direction'])

# Calculating the covariance matrix
cov_matrix = df_encoded.cov()

print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
                          Temperature  Humidity  Weather Condition_Cloudy  \
Temperature                     31.30    -62.50                     -0.55   
Humidity                       -62.50    125.00                      1.25   
Weather Condition_Cloudy        -0.55      1.25                      0.30   
Weather Condition_Rainy         -1.90      3.75                     -0.10   
Weather Condition_Sunny          2.45     -5.00                     -0.20   
Wind Direction_East             -1.90      3.75                     -0.10   
Wind Direction_North             0.70     -1.25                      0.05   
Wind Direction_South            -0.65      1.25                      0.15   
Wind Direction_West              1.85     -3.75                     -0.10   

                          Weather Condition_Rainy  Weather Condition_Sunny  \
Temperature                                 -1.90                     2.45   
Humidity                                     3.75     