In [None]:
# Q1

"""
Difference Between Ordinal Encoding and Label Encoding:
Definitions:

Label Encoding:
Label encoding is a technique used to convert categorical variables into numerical format. In this method, each unique category value is assigned an integer value starting from 0.
 This approach is straightforward and works well for categorical data that does not have any ordinal relationship.

Ordinal Encoding:
Ordinal encoding, on the other hand, is specifically designed for categorical variables that have a clear order or ranking among the categories. Similar to label encoding, it
assigns integer values to categories, but these integers reflect the inherent order of the categories.

Key Differences:

Nature of Data:

Label Encoding: Used for nominal data where there is no intrinsic ordering (e.g., colors like red, blue, green).
Ordinal Encoding: Used for ordinal data where there is a meaningful order (e.g., ratings like low, medium, high).
Interpretation of Values:

Label Encoding: The assigned integer values do not imply any rank or order; they are merely identifiers.
Ordinal Encoding: The assigned integer values represent the rank or order of the categories.
Use Cases:

Label Encoding: Suitable when the model does not assume any ordinal relationship among features.
Ordinal Encoding: Appropriate when the model can leverage the ordinal nature of the data to improve performance.

Example Scenario:
When to Use Label Encoding:
Suppose you have a dataset with a feature representing different types of fruits: {Apple, Banana, Cherry}. Since there is no natural ordering among these fruits, label encoding
 would be appropriate. You might encode them as follows:

Apple = 0
Banana = 1
Cherry = 2

When to Use Ordinal Encoding:
Consider a survey response feature with levels of satisfaction: {Unsatisfied, Neutral, Satisfied}. Here, there is a clear order in terms of satisfaction levels. Using ordinal
encoding would make sense in this case:

Unsatisfied = 0
Neutral = 1
Satisfied = 2
By using ordinal encoding here, you allow machine learning models to understand that “Satisfied” is better than “Neutral,” which in turn is better than “Unsatisfied.”
"""

In [None]:
# Q2

"""

Target Guided Ordinal Encoding
Target Guided Ordinal Encoding is a technique used in machine learning to convert categorical variables into ordinal integers based on the relationship between the categories and
the target variable. This method is particularly useful when dealing with categorical features that have a natural order or when you want to leverage the information contained in
the target variable to inform how categories should be encoded.

Step-by-Step Explanation
1. Understanding Categorical Variables
Categorical variables are non-numeric data types that represent groups or categories. For example, a feature like “Education Level” might have categories such as “High School,”
“Bachelor’s,” and “Master’s.” In many machine learning algorithms, especially those that rely on numerical input, these categorical variables need to be converted into a numerical
format.

2. The Concept of Ordinal Encoding
Ordinal encoding assigns an integer value to each category based on its rank or order. For instance, if we were encoding education levels, we might assign:

High School = 1
Bachelor’s = 2
Master’s = 3
However, traditional ordinal encoding does not take into account how these categories relate to the target variable (the outcome we are trying to predict).

3. Incorporating Target Variable Information
Target Guided Ordinal Encoding enhances traditional ordinal encoding by using the average of the target variable for each category to determine their integer representation. This
means that instead of assigning arbitrary values based solely on rank, we look at how each category correlates with the target variable.

Example Process:
Calculate Mean Target Value: For each category in the feature, calculate the mean of the target variable.

Suppose our target variable is “Income” and we have data as follows:
High School: Income = $30,000 (mean = $30,000)
Bachelor’s: Income = $50,000 (mean = $50,000)
Master’s: Income = $70,000 (mean = $70,000)
Assign Encoded Values: Assign an ordinal value based on these means.

High School → 30000
Bachelor’s → 50000
Master’s → 70000
Use Encoded Values in Model: These encoded values can now be used as features in machine learning models.

4. When to Use Target Guided Ordinal Encoding
This method is particularly useful in scenarios where:

The categorical feature has a meaningful relationship with the target variable.
You want to preserve some ordinal nature while also leveraging statistical relationships.
You are working with tree-based models (like decision trees or random forests) where capturing this relationship can improve model performance.
"""


In [None]:
# Q3
"""
Definition of Covariance
Covariance is a statistical measure that indicates the extent to which two random variables change together. It quantifies the degree to which changes in one variable are associated with changes in another variable. If the covariance is positive, it means that as one variable increases, the other tends to increase as well. Conversely, if the covariance is negative, it implies that as one variable increases, the other tends to decrease.

Mathematically, covariance measures the joint variability of two variables and provides insight into their linear relationship. However, it does not standardize this relationship (unlike correlation), so its magnitude depends on the scale of the variables.

Importance of Covariance in Statistical Analysis
Covariance plays a crucial role in various aspects of statistical analysis and data science:

Understanding Relationships Between Variables: Covariance helps determine whether two variables are positively or negatively related. This information is foundational for understanding how variables interact and influence each other.

Foundation for Correlation: While covariance itself is not standardized, it serves as a precursor to calculating correlation coefficients. Correlation normalizes covariance by dividing it by the product of the standard deviations of both variables, providing a dimensionless measure bounded between -1 and 1.

Portfolio Risk Management in Finance: In finance, covariance is used to analyze how different assets move relative to each other. For example, when constructing an investment portfolio, understanding covariances between asset returns helps minimize risk through diversification.

Principal Component Analysis (PCA): Covariance matrices are central to PCA, a dimensionality reduction technique widely used in machine learning and data analysis. PCA identifies patterns in data by finding directions (principal components) along which variance (and hence covariance) is maximized.

Regression Analysis: Covariance forms part of regression models where relationships between dependent and independent variables are analyzed.

Signal Processing and Machine Learning Applications: Covariance matrices are used in algorithms like Kalman filters and Gaussian processes for modeling uncertainty and dependencies between variables.

How Covariance Is Calculated
The formula for calculating covariance depends on whether you are working with a population or a sample:

1. Population Covariance Formula:
For two random variables X and Y, with population size N, their population covariance (Cov(X,Y)) is calculated as: Cov(X,Y)=∑i^n =1(Xi−μX)(Yi−μY) / N
Where:Xi andYi are individual data points from X and Y,μX and μY are the population means of X and Y,N is the total number of observations.

2. Sample Covariance Formula:
For sample data with size n, sample covariance (sXY) is given by: sXY=∑i^n=1(Xi−X‾)(Yi−Y‾) / n−1
Where:Xi and Yi are individual sample points,X‾ and Y‾ are sample means,n−1 accounts for Bessel’s correction to reduce bias in small samples.
"""


In [1]:
# Q4

from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = {
    'Color': ['red', 'green', 'blue', 'green', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'small', 'large', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal', 'plastic']
}

df = pd.DataFrame(data)

label_encoders = {}
for column in df.columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

print(df)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     2         2
4      2     0         0
5      0     1         1


In [2]:
# Q5

import numpy as np
import pandas as pd

data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60],
    'Income': [30000, 35000, 50000, 60000, 80000, 85000, 90000, 110000],
    'Education_Level': [12, 14, 16, 14, 18, 16, 18, 20]
}

df = pd.DataFrame(data)

cov_matrix = df.cov()

print(cov_matrix)


                           Age        Income  Education_Level
Age                 150.000000  3.428571e+05        28.571429
Income           342857.142857  8.000000e+08     67142.857143
Education_Level      28.571429  6.714286e+04         6.857143


In [None]:
# Q6
"""
For encoding categorical variables in a machine learning project, we need to choose encoding methods based on the nature of the categorical data. Here’s how we can encode each
variable:

**1. Gender (Male/Female) → Binary Encoding (Label Encoding)
Encoding Method: Label Encoding (or Binary Encoding)

Why?

"Gender" has only two categories (Male/Female), so binary encoding (0 and 1) is sufficient.

Example:

Male → 0

Female → 1

2. Education Level (High School/Bachelor’s/Master’s/PhD) → One-Hot Encoding
Encoding Method: One-Hot Encoding

Why?

"Education Level" has more than two categories but no natural ranking in numerical terms.

Using Label Encoding here would assign numerical values (e.g., High School = 0, Bachelor's = 1), which could mislead the model into assuming an ordinal relationship.

One-Hot Encoding creates separate binary columns for each category:

High School → [1, 0, 0, 0]

Bachelor's → [0, 1, 0, 0]

Master's → [0, 0, 1, 0]

PhD → [0, 0, 0, 1]

3. Employment Status (Unemployed/Part-Time/Full-Time) → Ordinal or One-Hot Encoding
Encoding Method:

One-Hot Encoding (if the employment status categories are independent)

Ordinal Encoding (if there's a ranking in employment levels)

Why?

If we consider "Unemployed < Part-Time < Full-Time" as a meaningful order, Ordinal Encoding can be used:

Unemployed → 0

Part-Time → 1

Full-Time → 2

If there's no meaningful ranking, One-Hot Encoding is better to avoid implying a false ordinal relationship.
"""

In [8]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

data = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male'],
    'Education_Level': ['High School', "Bachelor's", "Master's", "PhD", "Bachelor's"],
    'Employment_Status': ['Unemployed', 'Part-Time', 'Full-Time', 'Unemployed', 'Full-Time']
}

df = pd.DataFrame(data)

le_gender = LabelEncoder()
df['Gender'] = le_gender.fit_transform(df['Gender'])

df = pd.get_dummies(df, columns=['Education_Level'])

employment_mapping = {'Unemployed': 0, 'Part-Time': 1, 'Full-Time': 2}
df['Employment_Status'] = df['Employment_Status'].map(employment_mapping)

print(df)


   Gender  Employment_Status  Education_Level_Bachelor's  \
0       1                  0                       False   
1       0                  1                        True   
2       0                  2                       False   
3       1                  0                       False   
4       1                  2                        True   

   Education_Level_High School  Education_Level_Master's  Education_Level_PhD  
0                         True                     False                False  
1                        False                     False                False  
2                        False                      True                False  
3                        False                     False                 True  
4                        False                     False                False  


In [9]:
# Q7

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {
    'Temperature': [30, 25, 28, 32, 35, 22, 20, 27, 31, 29],
    'Humidity': [70, 80, 75, 65, 60, 85, 90, 78, 68, 72],
    'Weather_Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Sunny', 'Cloudy', 'Rainy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind_Direction': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South']
}

df = pd.DataFrame(data)

label_encoders = {}
for column in ['Weather_Condition', 'Wind_Direction']:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

cov_matrix = df.cov()

print("Covariance Matrix:\n", cov_matrix)


Covariance Matrix:
                    Temperature   Humidity  Weather_Condition  Wind_Direction
Temperature          20.988889 -41.966667           2.677778        0.722222
Humidity            -41.966667  84.677778          -5.366667       -1.500000
Weather_Condition     2.677778  -5.366667           0.766667       -0.166667
Wind_Direction        0.722222  -1.500000          -0.166667        1.166667
