ASSIGNMENT: FE-5

1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you 
might choose one over the other.

Ordinal encoding and label encoding are two commonly used techniques for encoding categorical variables in machine learning.

Ordinal encoding is a technique in which the categories of a categorical variable are assigned numerical values based on their order or rank. For example, if we have a categorical variable "size" with categories "small", "medium", and "large", we can assign the values 1, 2, and 3 respectively, based on their order.

Label encoding, on the other hand, is a technique in which each category is assigned a unique numerical value. For example, if we have the same "size" variable, we can assign the values 1, 2, and 3 to "small", "medium", and "large", respectively, without any particular ordering.

An example of when to use ordinal encoding over label encoding is when the categorical variable has a natural ordering or hierarchy, such as "low", "medium", and "high" or "small", "medium", and "large". In this case, ordinal encoding can capture the relationship between the categories and can help the model make better predictions. On the other hand, label encoding can be used when the categorical variable does not have a natural ordering or hierarchy, and we just need to assign unique numerical values to each category.

For instance, consider the "education level" variable with categories "high school", "some college", "bachelor's degree", "master's degree" and "PhD". This variable has a clear order or hierarchy, and ordinal encoding can be used to encode this variable with numerical values of 1 to 5 based on their order. On the other hand, if we have a categorical variable like "favorite color" with categories "red", "blue", and "green", which do not have a natural ordering or hierarchy, label encoding can be used.

2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in 
a machine learning project

Target Guided Ordinal Encoding is a technique used to encode categorical variables based on the target variable. In this technique, each category of the categorical variable is assigned a rank or order based on the target variable's mean or median value.

In [5]:
import pandas as pd
import numpy as np

# create a sample dataframe
df = pd.DataFrame({
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Los Angeles', 'Houston'],
    'purchase': [0, 1, 1, 0, 1, 0, 0]
})

# calculate the mean purchase rate for each city category
city_mean = df.groupby('city')['purchase'].mean()


# rank the categories based on their mean purchase rate in descending order
city_order = city_mean.sort_values(ascending=False).index

# assign numerical values to the categories based on their rank
city_dict = {city: rank for rank, city in enumerate(city_order, 1)}

# apply the encoding to the original dataframe
df['city_encoded'] = df['city'].map(city_dict)

df


Unnamed: 0,city,purchase,city_encoded
0,New York,0,3
1,Los Angeles,1,2
2,Chicago,1,1
3,Houston,0,4
4,New York,1,3
5,Los Angeles,0,2
6,Houston,0,4


3.  Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that describes how two random variables move in relation to each other. It measures the degree to which two variables are linearly related. In other words, it describes how much the values of one variable change when the values of another variable change.

Covariance is important in statistical analysis because it helps us to understand the relationship between two variables. It can be used to identify whether two variables have a positive, negative, or no correlation. A positive covariance indicates that the two variables tend to move in the same direction, while a negative covariance indicates that they tend to move in opposite directions. A covariance of zero indicates that the two variables are independent of each other.

Covariance is calculated by taking the product of the deviations of two variables from their means, and then averaging them. The formula for covariance is:

cov(X,Y) = E[(X - E[X]) * (Y - E[Y])]

Where X and Y are the two variables, E[X] and E[Y] are their respective means, and E[] denotes the expected value.

In practice, the covariance can be calculated using a covariance matrix, which is a matrix that contains the covariances between pairs of variables in a dataset. The diagonal of the matrix contains the variances of each variable (i.e., the covariance of a variable with itself), and the off-diagonal elements contain the covariances between pairs of variables.

It's important to note that covariance is affected by the scale of the variables. When the variables are measured in different units or have different scales, the covariance may not be meaningful. To overcome this issue, standardized covariance measures such as correlation are often used, which scale the covariance to be between -1 and 1, making it easier to interpret and compare across different datasets.

4.  For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, 
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. 
Show your code and explain the output

In [8]:
import pandas as pd

data={ 'Color':['red','green','blue'],'Size':['small','medium','large'],'Material':['wood','metal','plastic']}

df=pd.DataFrame(data)
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [9]:
from sklearn.preprocessing import LabelEncoder


In [13]:

# create an instance of LabelEncoder
le = LabelEncoder()

# fit and transform the data
encoded_data = le.fit_transform(df['Color'].values)

df[['Color', 'Size', 'Material']] = df[['Color', 'Size', 'Material']].apply(lambda col: le.fit_transform(col))


In [14]:
df

Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1


In [15]:
# another exmaple

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# create the DataFrame
data = {
    'Color': ['red', 'green', 'blue', 'green', 'blue', 'red', 'red'],
    'Size': ['small', 'medium', 'large', 'medium', 'small', 'large', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'plastic', 'metal', 'metal']
}
df = pd.DataFrame(data)

# create an instance of LabelEncoder
le = LabelEncoder()

# encode the categorical variables
df['Color'] = le.fit_transform(df['Color'])
df['Size'] = le.fit_transform(df['Size'])
df['Material'] = le.fit_transform(df['Material'])

# print the encoded DataFrame
print(df)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     1         2
4      0     2         1
5      2     0         0
6      2     1         0


5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education 
level. Interpret the results.

In [18]:
import pandas as pd

data=  {
          'Age':[19,45,34,23],
          'Income':[15000,100000,60000,25000],
          'Education': ['12th','MBA','Btech','BSc']
}

df=pd.DataFrame(data)

df

Unnamed: 0,Age,Income,Education
0,19,15000,12th
1,45,100000,MBA
2,34,60000,Btech
3,23,25000,BSc


In [19]:
df.cov()

  df.cov()


Unnamed: 0,Age,Income
Age,136.916667,450000.0
Income,450000.0,1483333000.0


In [21]:
# we can infer that as people get older, they tend to earn more money

In [22]:
df.corr()

  df.corr()


Unnamed: 0,Age,Income
Age,1.0,0.998539
Income,0.998539,1.0


6. You are working on a machine learning project with a dataset containing several categorical 
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), 
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for 
each variable, and why?

Gender: Binary encoding can be used as it has only two categories (Male and Female). It will encode the variable as 0 or 1, which will help in reducing the dimensionality of the dataset while preserving the information.

Education Level: Ordinal encoding can be used as the variable has an inherent order to it (High School < Bachelor's < Master's < PhD). Ordinal encoding will assign a numerical value to each category based on their order, which will be useful in retaining the order of the categories.

Employment Status: One-Hot encoding can be used as the variable has three categories (Unemployed/Part-Time/Full-Time) with no inherent order. One-Hot encoding will create a binary column for each category and assign a value of 1 or 0, which will help in preserving the information without implying any order to the categories.

7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two 
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [23]:
import pandas as pd

# create a DataFrame with the given data
data = {
    'Temperature': [28, 26, 30, 27, 29, 25, 31, 24],
    'Humidity': [60, 65, 55, 62, 58, 63, 53, 68],
    'Weather Condition': ['Sunny', 'Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy', 'Rainy', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'North', 'South', 'West', 'West', 'East']
}
df = pd.DataFrame(data)

# calculate the covariance matrix
cov_matrix = df.cov()

# print the covariance matrix
print(cov_matrix)


             Temperature   Humidity
Temperature          6.0 -12.000000
Humidity           -12.0  25.428571


  cov_matrix = df.cov()


In [24]:
df.corr()

  df.corr()


Unnamed: 0,Temperature,Humidity
Temperature,1.0,-0.971504
Humidity,-0.971504,1.0


In [25]:
# another way

In [46]:
import numpy as np
import pandas as pd

# create example data
data = {
    'Temperature': [25, 28, 23, 27, 26, 24],
    'Humidity': [60, 50, 70, 65, 55, 75],
    'Weather Condition': ['Sunny', 'Cloudy', 'Sunny', 'Rainy', 'Cloudy', 'Rainy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North', 'East']
}

# create DataFrame from data
df = pd.DataFrame(data)

df

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,25,60,Sunny,North
1,28,50,Cloudy,South
2,23,70,Sunny,East
3,27,65,Rainy,West
4,26,55,Cloudy,North
5,24,75,Rainy,East


In [47]:
from sklearn.preprocessing import OneHotEncoder


In [48]:
ohe = OneHotEncoder(sparse=False)
encoded_data = ohe.fit_transform(df[['Weather Condition', 'Wind Direction']])

# convert the encoded data back to a DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=ohe.get_feature_names_out(['Weather Condition', 'Wind Direction']))

# combine the encoded data with the original dataset
df1 = pd.concat([df[['Temperature', 'Humidity']], encoded_df], axis=1)





In [49]:
df1

Unnamed: 0,Temperature,Humidity,Weather Condition_Cloudy,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West
0,25,60,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,28,50,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,23,70,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,27,65,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,26,55,1.0,0.0,0.0,0.0,1.0,0.0,0.0
5,24,75,0.0,1.0,0.0,1.0,0.0,0.0,0.0


In [50]:
df1.cov()

Unnamed: 0,Temperature,Humidity,Weather Condition_Cloudy,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West
Temperature,3.5,-13.5,0.6,-2.2204460000000003e-17,-0.6,-0.8,-2.775558e-17,0.5,0.3
Humidity,-13.5,87.5,-4.0,3.0,1.0,4.0,-2.0,-2.5,0.5
Weather Condition_Cloudy,0.6,-4.0,0.266667,-0.1333333,-0.133333,-0.133333,0.06666667,0.133333,-0.066667
Weather Condition_Rainy,-2.2204460000000003e-17,3.0,-0.133333,0.2666667,-0.133333,0.066667,-0.1333333,-0.066667,0.133333
Weather Condition_Sunny,-0.6,1.0,-0.133333,-0.1333333,0.266667,0.066667,0.06666667,-0.066667,-0.066667
Wind Direction_East,-0.8,4.0,-0.133333,0.06666667,0.066667,0.266667,-0.1333333,-0.066667,-0.066667
Wind Direction_North,-2.775558e-17,-2.0,0.066667,-0.1333333,0.066667,-0.133333,0.2666667,-0.066667,-0.066667
Wind Direction_South,0.5,-2.5,0.133333,-0.06666667,-0.066667,-0.066667,-0.06666667,0.166667,-0.033333
Wind Direction_West,0.3,0.5,-0.066667,0.1333333,-0.066667,-0.066667,-0.06666667,-0.033333,0.166667


In [51]:
df1.corr()

Unnamed: 0,Temperature,Humidity,Weather Condition_Cloudy,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West
Temperature,1.0,-0.771429,0.621059,9.193520000000001e-17,-0.621059,-0.828079,2.2983800000000002e-17,0.654654,0.392792
Humidity,-0.7714286,1.0,-0.828079,0.621059,0.20702,0.828079,-0.4140393,-0.654654,0.130931
Weather Condition_Cloudy,0.621059,-0.828079,1.0,-0.5,-0.5,-0.5,0.25,0.632456,-0.316228
Weather Condition_Rainy,9.193520000000001e-17,0.621059,-0.5,1.0,-0.5,0.25,-0.5,-0.316228,0.632456
Weather Condition_Sunny,-0.621059,0.20702,-0.5,-0.5,1.0,0.25,0.25,-0.316228,-0.316228
Wind Direction_East,-0.8280787,0.828079,-0.5,0.25,0.25,1.0,-0.5,-0.316228,-0.316228
Wind Direction_North,2.2983800000000002e-17,-0.414039,0.25,-0.5,0.25,-0.5,1.0,-0.316228,-0.316228
Wind Direction_South,0.6546537,-0.654654,0.632456,-0.3162278,-0.316228,-0.316228,-0.3162278,1.0,-0.2
Wind Direction_West,0.3927922,0.130931,-0.316228,0.6324555,-0.316228,-0.316228,-0.3162278,-0.2,1.0
