Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Ordinal Encoding
Purpose: Used for categorical features where the categories have a meaningful order or ranking.
How it works: Assigns integer values to the categories based on their order.
Example: Suppose you have a feature representing education levels: ["High School", "Bachelor's", "Master's", "PhD"]. Ordinal Encoding might assign:
High School: 1
Bachelor’s: 2
Master’s: 3
PhD: 4
Label Encoding
Purpose: Used for categorical features where the categories do not have a meaningful order.
How it works: Assigns integer values to the categories arbitrarily.
Example: Suppose you have a feature representing colors: ["Red", "Green", "Blue"]. Label Encoding might assign:
Red: 0
Green: 1
Blue: 2
When to Choose One Over the Other
Ordinal Encoding: Use this when the categorical feature has a natural order. For example, education levels, customer satisfaction ratings (e.g., “Poor”, “Average”, “Good”, “Excellent”), or any other feature where the order matters.
Label Encoding: Use this when the categorical feature does not have a natural order. For example, colors, types of animals, or any other feature where the order does not matter.
Example Scenario
Ordinal Encoding: If you are working on a model to predict salaries based on education levels, you would use Ordinal Encoding because education levels have a natural order.
Label Encoding: If you are working on a model to classify images of different animals, you would use Label Encoding because the types of animals do not have a natural order

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Calculate the Mean of the Target for Each Category: For each category in the categorical variable, calculate the mean of the target variable.
Sort the Categories: Sort the categories based on the calculated means.
Assign Ordinal Values: Assign ordinal values to the categories based on their sorted order.
Example
Let’s say you are working on a customer churn prediction project, and you have a categorical variable Customer_Type with values like Regular, VIP, and Occasional. The target variable is Churn (1 for churned, 0 for not churned).

Calculate the Mean of Churn for Each Category:
Regular: Mean Churn = 0.3
VIP: Mean Churn = 0.1
Occasional: Mean Churn = 0.5
Sort the Categories:
VIP (0.1)
Regular (0.3)
Occasional (0.5)
Assign Ordinal Values:
VIP = 1
Regular = 2
Occasional = 3
Now, the Customer_Type variable is encoded as follows:

VIP -> 1
Regular -> 2
Occasional -> 3
When to Use It
You might use Target Guided Ordinal Encoding in scenarios where the categorical variable has a potential ordinal relationship with the target variable. For example:

Customer Churn Prediction: Encoding customer types based on their likelihood to churn.
Credit Risk Assessment: Encoding employment types based on their default rates.
Stock Price Prediction: Encoding market sectors based on their average returns.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Importance in Statistical Analysis
Covariance is crucial because it helps in understanding the relationship between variables. It is used in various fields such as finance, economics, and data science to:

Identify Relationships: Determine whether variables move together or inversely.
Portfolio Theory: Assess the risk and return in finance by understanding how asset prices move together.
Principal Component Analysis (PCA): Reduce dimensionality in data by identifying correlated variables.

Calculation
Covariance between two variables (X) and (Y) is calculated using the formula:
Cov(X,Y)=n−1∑i=1n​(Xi​−Xˉ)(Yi​−Yˉ)​
Where:

(X_i) and (Y_i) are the individual sample points.
(\bar{X}) and (\bar{Y}) are the means of (X) and (Y), respectively.
(n) is the number of data points.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.DataFrame({'color' :('red', 'green', 'blue'),
        'Size':('small', 'medium','large'),
         'Material':('wood', 'metal', 'plastic')
       })

In [3]:
df

Unnamed: 0,color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [8]:
encoder = LabelEncoder()

In [16]:
encoder.fit_transform(df['color'])

array([2, 1, 0])

In [13]:
encoder.fit_transform(df['Size'])

array([2, 1, 0])

In [18]:
encoder.fit_transform(df['Material'])

array([2, 0, 1])

Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [19]:
import pandas as pd

# Sample data
data = {
    'Age': [25, 45, 35, 50, 23],
    'Income': [50000, 80000, 60000, 90000, 45000],
    'Education': [12, 16, 14, 18, 12]
}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate covariance matrix
covariance_matrix = df.cov()

print(covariance_matrix)


                Age       Income  Education
Age           141.8     228750.0       30.7
Income     228750.0  375000000.0    50000.0
Education      30.7      50000.0        6.8


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In [20]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample DataFrame
data = {
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Education Level': ['Bachelor\'s', 'Master\'s', 'PhD', 'High School'],
    'Employment Status': ['Full-Time', 'Part-Time', 'Unemployed', 'Full-Time']
}
df = pd.DataFrame(data)

# Label Encoding for Gender
label_encoder = LabelEncoder()
df['Gender_Label'] = label_encoder.fit_transform(df['Gender'])

# Ordinal Encoding for Education Level
education_order = ['High School', 'Bachelor\'s', 'Master\'s', 'PhD']
df['Education_Level_Ordinal'] = df['Education Level'].apply(lambda x: education_order.index(x) + 1)

# One-Hot Encoding for Employment Status
one_hot_encoder = OneHotEncoder(sparse=False)
employment_status_encoded = one_hot_encoder.fit_transform(df[['Employment Status']])
employment_status_df = pd.DataFrame(employment_status_encoded, columns=one_hot_encoder.get_feature_names_out(['Employment Status']))
df = pd.concat([df, employment_status_df], axis=1)

print(df)


   Gender Education Level Employment Status  Gender_Label  \
0    Male      Bachelor's         Full-Time             1   
1  Female        Master's         Part-Time             0   
2  Female             PhD        Unemployed             0   
3    Male     High School         Full-Time             1   

   Education_Level_Ordinal  Employment Status_Full-Time  \
0                        2                          1.0   
1                        3                          0.0   
2                        4                          0.0   
3                        1                          1.0   

   Employment Status_Part-Time  Employment Status_Unemployed  
0                          0.0                           0.0  
1                          1.0                           0.0  
2                          0.0                           1.0  
3                          0.0                           0.0  




Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [22]:
import numpy as np
import pandas as pd

# Sample data
data = {
    'Temperature': [30, 25, 27, 29, 31],
    'Humidity': [70, 65, 75, 80, 60],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

df = pd.DataFrame(data)

# Calculate covariance matrix
cov_matrix = df[['Temperature', 'Humidity']].cov()
print(cov_matrix)


             Temperature  Humidity
Temperature          5.8      -2.5
Humidity            -2.5      62.5
