# **ASSIGNMENT**

**Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.**

Ordinal Encoding and Label Encoding are both techniques used for encoding categorical variables into numerical representations. However, they differ in the way they assign numerical values to categories.

1. Label Encoding:
   Label Encoding assigns a unique numerical label to each category in a categorical variable. The labels are assigned in an arbitrary manner, typically starting from 0 or 1. For example:

   Original Categorical Variable:
   - Category A
   - Category B
   - Category C

   Label Encoded Variable:
   - Category A: 0
   - Category B: 1
   - Category C: 2

   Label Encoding is often suitable when the categorical variable does not have an inherent order or hierarchy. It is commonly used for encoding nominal variables.

2. Ordinal Encoding:
   Ordinal Encoding assigns numerical values to categories based on their order or ranking. The assigned values reflect the relative position or importance of the categories. For example:

   Original Categorical Variable:
   - Category A
   - Category B
   - Category C

   Ordinal Encoded Variable:
   - Category A: 1
   - Category B: 2
   - Category C: 3

   Ordinal Encoding is useful when the categorical variable has an inherent order or hierarchy. It preserves the ordinal relationship between the categories. This encoding is often used for encoding variables with ordered levels, such as ratings (e.g., low, medium, high) or educational degrees (e.g., primary, secondary, tertiary).

Example scenario:
Suppose we have a dataset containing a "Size" feature that represents t-shirt sizes: "Small," "Medium," and "Large." If the sizes have no inherent order, we can use Label Encoding to assign numerical labels like 0, 1, and 2 to the categories. However, if the sizes have an inherent order (Small < Medium < Large), we should use Ordinal Encoding to assign numerical values based on the order (e.g., 1, 2, 3).

In summary, the choice between Label Encoding and Ordinal Encoding depends on the nature of the categorical variable and whether there is an inherent order or hierarchy among the categories. Label Encoding is appropriate for nominal variables without an order, while Ordinal Encoding is suitable for encoding variables with ordered levels.

**Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.**

Target Guided Ordinal Encoding is a technique used to encode categorical variables by taking into account the relationship between the categories and the target variable. It assigns ordinal numerical values to the categories based on their impact or influence on the target variable.

Here's how Target Guided Ordinal Encoding works:

1. Calculate the mean or median target value for each category: For each category in the categorical variable, compute the mean or median value of the target variable. This represents the average target value associated with that category.

2. Order the categories based on the mean or median target value: Sort the categories based on their mean or median target values in ascending or descending order. The ordering reflects the impact or influence of each category on the target variable.

3. Assign ordinal numerical values to the categories: Assign ordinal numerical values to the categories based on their ordered positions. The category with the highest mean or median target value typically receives the highest value, and the category with the lowest mean or median target value receives the lowest value.

Example scenario:
Let's consider a machine learning project where you are building a model to predict customer churn in a telecom company. One of the features is the "Subscription Type," which represents the different subscription plans available to customers. we want to encode this categorical variable in a way that captures its influence on the target variable (churn).

Using Target Guided Ordinal Encoding:
1. Calculate the mean or median churn rate for each subscription type.
2. Order the subscription types based on their mean or median churn rates.
3. Assign ordinal values to the subscription types based on their ordered positions.

Encoded Subscription Type:
- Plan A: 3
- Plan B: 2
- Plan C: 1

In this example, Target Guided Ordinal Encoding assigns numerical values based on the churn rate associated with each subscription type. The encoding reflects the ordering of subscription types in terms of their impact on the target variable (churn).

Target Guided Ordinal Encoding can be useful when there is a meaningful relationship between the categorical variable and the target variable. By considering the target variable, it allows the model to capture the inherent patterns and influence of the categories on the prediction task. It is commonly used when dealing with categorical variables that have a clear ordering or when the target variable shows distinct differences across categories.

**Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?**

Covariance is a statistical measure that quantifies the relationship between two variables. It measures how changes in one variable correspond to changes in another variable. Specifically, covariance indicates whether the variables tend to vary together (covary) or vary in opposite directions.

Covariance is important in statistical analysis for several reasons:

1. Relationship between variables: Covariance helps in understanding the relationship between two variables. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that they tend to vary in opposite directions.

2. Linearity assessment: Covariance is a measure of linear association. It helps determine whether two variables have a linear relationship. If the covariance is close to zero, it suggests a weak or no linear relationship, whereas a nonzero covariance indicates a linear association.

3. Variable selection: Covariance can assist in selecting relevant variables for analysis. By examining the covariances between variables and the target variable, one can identify variables that have a strong relationship with the outcome of interest.

4. Portfolio diversification: In finance, covariance is used to assess the diversification of a portfolio. Covariance between different assets helps determine the extent to which their returns move together or in opposite directions. A portfolio with assets that have low or negative covariance provides better diversification and reduces overall risk.

Covariance is calculated using the following formula:

cov(X, Y) = Σ((Xᵢ - X̄)(Yᵢ - Ȳ))/(n-1)

where:
- X and Y are the variables of interest
- Xᵢ and Yᵢ are the individual data points of X and Y
- X̄ and Ȳ are the means of X and Y, respectively
- n is the number of data points

The covariance formula calculates the sum of the products of the deviations of X and Y from their means, divided by (n-1). The resulting value indicates the strength and direction of the covariance.


**Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.**

In [1]:
from sklearn.preprocessing import LabelEncoder 

In [2]:
import pandas as pd

data = {
    'Color': ['red', 'green', 'blue', 'green', 'red', 'blue'],
    'Size': ['small', 'medium', 'medium', 'large', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'plastic', 'wood', 'metal']
}

dataset = pd.DataFrame(data)

In [3]:
dataset

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,medium,plastic
3,green,large,plastic
4,red,small,wood
5,blue,medium,metal


In [4]:
encoder=LabelEncoder()

In [5]:
dataset["Encoded_color"]=encoder.fit_transform(dataset["Color"])
dataset["Encoded_Size"]=encoder.fit_transform(dataset["Size"])
dataset["Encoded_Material"]=encoder.fit_transform(dataset["Material"])

In [6]:
dataset

Unnamed: 0,Color,Size,Material,Encoded_color,Encoded_Size,Encoded_Material
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,medium,plastic,0,1,1
3,green,large,plastic,1,0,1
4,red,small,wood,2,2,2
5,blue,medium,metal,0,1,0


In the encoded dataset, each categorical variable has been replaced with numeric labels.

For the 'Color' column:

'red' is encoded as 2
'green' is encoded as 1
'blue' is encoded as 0
For the 'Size' column:

'small' is encoded as 2
'medium' is encoded as 1
'large' is encoded as 0

For the 'Material' column:

'wood' is encoded as 2
'metal' is encoded as 0
'plastic' is encoded as 1

The label encoding process assigns a unique numerical label to each category within a feature. This encoding is suitable for categorical variables when there is no inherent order or hierarchy among the categories. It allows for the representation of categorical data as numerical values.

**Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.**

In [7]:
import pandas as pd

data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 70000, 80000, 90000],
    'Education Level': [1, 2, 3, 2, 3]
}

dataset1 = pd.DataFrame(data)


In [8]:
dataset1

Unnamed: 0,Age,Income,Education Level
0,25,50000,1
1,30,60000,2
2,35,70000,3
3,40,80000,2
4,45,90000,3


In [9]:
dataset1.cov()

Unnamed: 0,Age,Income,Education Level
Age,62.5,125000.0,5.0
Income,125000.0,250000000.0,10000.0
Education Level,5.0,10000.0,0.7


Interpreting the results:

Covariance between Age and Income: The covariance value of 25000 indicates a positive relationship between Age and Income. It suggests that, on average, as Age increases, Income tends to increase as well.
Covariance between Age and Education Level: The covariance value of 5.0 suggests a weak positive relationship between Age and Education Level. However, since Education Level is a categorical variable encoded as numbers, the covariance value might not provide meaningful insights in this case.
Covariance between Income and Education Level: The covariance value of 25000 indicates a positive relationship between Income and Education Level. It suggests that, on average, as Income increases, Education Level tends to increase as well.


**Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?**

When encoding categorical variables in a machine learning project, the choice of encoding method depends on the nature of the variable and its relationship with the target variable.

1. "Gender" (Male/Female):
   For the "Gender" variable, you can use Label Encoding or One-Hot Encoding.
   - Label Encoding: If the variable has a natural order or ordinal relationship, we can assign numerical labels such as 0 and 1 to represent Male and Female, respectively.
   - One-Hot Encoding: If there is no inherent order or ordinal relationship,we can use one-hot encoding. It creates binary columns for each category, representing the presence (1) or absence (0) of that category.

2. "Education Level" (High School/Bachelor's/Master's/PhD):
   For the "Education Level" variable, you can use Ordinal Encoding or One-Hot Encoding.
   - Ordinal Encoding: If there is a natural order or hierarchy in education levels (e.g., High School < Bachelor's < Master's < PhD), we can assign ordinal numerical values to each category, preserving their relative order.
   - One-Hot Encoding: If there is no inherent order or ordinal relationship, we can use one-hot encoding to create binary columns for each education level.

3. "Employment Status" (Unemployed/Part-Time/Full-Time):
   For the "Employment Status" variable, you can use One-Hot Encoding.
   - One-Hot Encoding: Since there is no inherent order or ordinal relationship among employment status categories, one-hot encoding is suitable. It will create binary columns for each employment status category, indicating the presence (1) or absence (0) of that category.

**Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.**

In [10]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

data = {
    'Temperature': [25.5, 28.2, 22.7, 20.1, 24.8],
    'Humidity': [65, 70, 75, 80, 85],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

dataset2 = pd.DataFrame(data)



In [11]:
dataset2

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,25.5,65,Sunny,North
1,28.2,70,Cloudy,South
2,22.7,75,Rainy,East
3,20.1,80,Sunny,West
4,24.8,85,Cloudy,North


In [12]:
dataset2.cov()

Unnamed: 0,Temperature,Humidity
Temperature,9.273,-11.875
Humidity,-11.875,62.5


The covariance measures the linear relationship between two variables. A positive covariance indicates that the variables tend to change together in the same direction, while a negative covariance indicates that they tend to change in opposite directions. A covariance value close to zero suggests no significant linear relationship between the variables.

Interpreting the results:

Covariance between Temperature and Humidity: The covariance value of -11.875 indicates a negative relationship between Temperature and Humidity. It suggests that as the Temperature tends to increase, the Humidity tends to decrease, and vice versa.
Covariance between Temperature and Weather Condition/Wind Direction: Since these variables are categorical, their covariance values are not meaningful. Covariance is typically calculated between two continuous variables.


--------------------------