Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Label Encoding is a simple technique where each unique category is assigned an integer value. This is often used when the categories have an inherent ordinal relationship (i.e., a clear order or ranking) but not necessarily a consistent interval between them.

Ordinal Encoding is a more structured version of label encoding that is used when the categories do have a clear ordinal relationship. In this technique, you assign integer values to categories in a way that reflects their order. 

In [None]:
## When we talking about rank in data then we use Ordinal Encoding.

In [3]:
import pandas as pd

In [1]:
from sklearn.preprocessing import LabelEncoder

In [2]:
encoder = LabelEncoder()

In [4]:
color = [ 'red', 'blue', 'green', 'red' ]

df = pd.DataFrame({
    'color': color
 })

Unnamed: 0,color
0,red
1,blue
2,green
3,red


In [17]:
encoded = encoder.fit_transform( df[ 'color' ] )

encoded_df = pd.DataFrame({
    'Label_color': encoded
})

pd.concat( [ df , encoded_df ], axis=1 )


Unnamed: 0,color,Label_color
0,red,2
1,blue,0
2,green,1
3,red,2


Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Target Guided Ordinal Encoding is a technique used to encode categorical variables by considering the target variable's relationship with each category.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

**Covariance** is a statistical measure that quantifies the degree to which two variables change together. It indicates whether an increase in one variable corresponds to an increase, decrease, or no change in another variable. In other words, covariance measures the direction of the linear relationship between two variables.

**Importance of Covariance in Statistical Analysis:**

Covariance is important in statistical analysis for several reasons:

1. **Relationship Assessment:** Covariance helps in understanding how two variables are related to each other. A positive covariance suggests that as one variable increases, the other tends to increase as well, while a negative covariance indicates an inverse relationship.

2. **Portfolio Management:** In finance, covariance is crucial for managing portfolios. It helps assess the risk and diversification benefits of combining different assets. Assets with low or negative covariance can help reduce overall portfolio risk.

3. **Linear Regression:** In linear regression analysis, covariance is used to calculate the slope (beta coefficient) of the regression line, which represents the relationship between the independent and dependent variables.

4. **Multivariate Analysis:** Covariance is a fundamental concept in multivariate analysis, where relationships between multiple variables are studied simultaneously.

5. **Dimensionality Reduction:** Techniques like Principal Component Analysis (PCA) use covariance to find the principal components, which are orthogonal linear combinations of variables that capture the most variance in the data.

**Calculation of Covariance:**

The covariance between two variables X and Y is calculated using the following formula:

```plaintext
cov(X, Y) = Σ [(xᵢ - μₓ) * (yᵢ - μᵧ)] / (n - 1)
```

Where:
- `xᵢ` and `yᵢ` are individual data points of X and Y.
- `μₓ` and `μᵧ` are the means of X and Y, respectively.
- `n` is the number of data points.

The formula involves subtracting the mean of each variable from its data points, multiplying the differences, and then summing up the products. Dividing by `(n - 1)` instead of `n` corrects for bias and provides an unbiased estimate of the population covariance.

It's important to note that the magnitude of the covariance doesn't have a standardized interpretation; it's influenced by the scales of the variables. Therefore, the covariance is often normalized to obtain the **correlation coefficient**, which is a standardized measure of the linear relationship between two variables and ranges between -1 and 1.

 Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [19]:
df = pd.DataFrame({
    'Color': [ 'red', 'green', 'blue' ],
    'Size': [ 'small', 'medium', 'large' ],
    'Material': [ 'wood', 'metal', 'plastic' ]
})

df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [20]:
from sklearn.preprocessing import LabelEncoder

In [21]:
# Create an instance of Label Encoder 
encoder = LabelEncoder()

In [25]:
# Perform 
df[ 'Color_Encoded'] = encoder.fit_transform( df[ 'Color'] )

In [27]:
df[ 'Size_Encoded'] = encoder.fit_transform( df[ 'Size'] )

In [28]:
df[ 'Material_Encoded'] = encoder.fit_transform( df[ 'Material'] )

In [29]:
df

Unnamed: 0,Color,Size,Material,Color_Encoded,Size_Encoded,Material_Encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [31]:
# Sample data
data = {
    'Age': [25, 30, 22, 35, 28, 40, 32, 29, 26, 38],
    'Income': [50000, 60000, 45000, 75000, 55000, 80000, 65000, 60000, 52000, 72000],
    'Education_Level': [12, 16, 10, 18, 14, 20, 14, 16, 12, 18]
}

df = pd.DataFrame( data )

df

Unnamed: 0,Age,Income,Education_Level
0,25,50000,12
1,30,60000,16
2,22,45000,10
3,35,75000,18
4,28,55000,14
5,40,80000,20
6,32,65000,14
7,29,60000,16
8,26,52000,12
9,38,72000,18


In [33]:
df.cov()

Unnamed: 0,Age,Income,Education_Level
Age,33.388889,65111.11,17.222222
Income,65111.111111,132044400.0,34444.444444
Education_Level,17.222222,34444.44,10.0


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

Ordinal Encoding Techniques for which we rank the Education level and Employment Status.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [37]:
# Sample data
data = {
    'Temperature': [25, 28, 22, 20, 30, 26, 24, 27, 23, 29],
    'Humidity': [60, 65, 70, 75, 55, 62, 68, 58, 63, 70],
    'Weather_Condition': ['Sunny', 'Cloudy', 'Cloudy', 'Rainy', 'Sunny', 'Rainy', 'Cloudy', 'Sunny', 'Rainy', 'Cloudy'],
    'Wind_Direction': ['North', 'South', 'East', 'West', 'North', 'West', 'South', 'East', 'North', 'South']
}

In [38]:
df = pd.DataFrame( data )

df

Unnamed: 0,Temperature,Humidity,Weather_Condition,Wind_Direction
0,25,60,Sunny,North
1,28,65,Cloudy,South
2,22,70,Cloudy,East
3,20,75,Rainy,West
4,30,55,Sunny,North
5,26,62,Rainy,West
6,24,68,Cloudy,South
7,27,58,Sunny,East
8,23,63,Rainy,North
9,29,70,Cloudy,South


In [39]:
## Covariance matrix for numerical data 
df[ [ 'Temperature', 'Humidity' ] ].cov()

Unnamed: 0,Temperature,Humidity
Temperature,10.266667,-12.155556
Humidity,-12.155556,38.266667


In [45]:
## Perform label encoding 
from sklearn.preprocessing import LabelEncoder

In [46]:
encoder = LabelEncoder()

In [49]:
df[ 'En_Wheather' ] = encoder.fit_transform( df[ 'Weather_Condition'] )
df[ 'En_Wind' ] = encoder.fit_transform( df[ 'Wind_Direction'] )

In [50]:
df

Unnamed: 0,Temperature,Humidity,Weather_Condition,Wind_Direction,En_Wheather,En_Wind
0,25,60,Sunny,North,2,1
1,28,65,Cloudy,South,0,2
2,22,70,Cloudy,East,0,0
3,20,75,Rainy,West,1,3
4,30,55,Sunny,North,2,1
5,26,62,Rainy,West,1,3
6,24,68,Cloudy,South,0,2
7,27,58,Sunny,East,2,0
8,23,63,Rainy,North,1,1
9,29,70,Cloudy,South,0,2


In [51]:
## Covariance matrix for categorical data 
df[[ 'En_Wheather', 'En_Wind' ] ].cov() 

Unnamed: 0,En_Wheather,En_Wind
En_Wheather,0.766667,-0.277778
En_Wind,-0.277778,1.166667
