In [None]:
Q1: What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Label Encoding:

Definition: Assigns a unique integer to each category without implying any order.
Use Case: Suitable when there is no ordinal relationship between categories.
Example: Encoding colors (Red, Green, Blue).
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({'Color': ['Red', 'Green', 'Blue']})
label_encoder = LabelEncoder()
data['Color_Encoded'] = label_encoder.fit_transform(data['Color'])
print(data)

Ordinal Encoding:

Definition: Assigns integers to categories in a way that reflects the natural order.
Use Case: Suitable when there is a clear order in the categories.
Example: Encoding education levels (High School, Bachelor's, Master's, PhD).
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

data = pd.DataFrame({'Education': ['High School', 'Bachelor\'s', 'Master\'s', 'PhD']})
ordinal_encoder = OrdinalEncoder(categories=[['High School', 'Bachelor\'s', 'Master\'s', 'PhD']])
data['Education_Encoded'] = ordinal_encoder.fit_transform(data[['Education']])
print(data)


In [None]:
Q2: Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.
Target Guided Ordinal Encoding:

Definition: Assigns integers to categories based on the mean of the target variable for each category.
Use Case: Useful when you want to encode a categorical feature in a way that considers its relationship with the target variable.
Example: Predicting house prices based on neighborhood.
import pandas as pd

# Example data
data = pd.DataFrame({
    'Neighborhood': ['A', 'B', 'C', 'A', 'B', 'C'],
    'Price': [200000, 300000, 400000, 250000, 350000, 450000]
})

# Calculate mean price for each neighborhood
mean_price = data.groupby('Neighborhood')['Price'].mean().sort_values()
# Create a dictionary for encoding
encoding = {key: idx for idx, key in enumerate(mean_price.index)}
# Apply encoding
data['Neighborhood_Encoded'] = data['Neighborhood'].map(encoding)
print(data)


Q3: Define covariance and explain why it is important in statistical analysis. How is covariance calculated?
Covariance:

Definition: Measures the degree to which two variables change together.
Importance: Indicates the direction of the linear relationship between variables (positive covariance indicates that the variables increase together, while negative covariance indicates one increases as the other decreases).
Calculation:

Covariance is calculated as:
Cov
(
𝑋
,
𝑌
)
=
1
𝑛
−
1
∑
𝑖
=
1
𝑛
(
𝑋
𝑖
−
𝑋
‾
)
(
𝑌
𝑖
−
𝑌
‾
)
Cov(X,Y)= 
n−1
1
​
 ∑ 
i=1
n
​
 (X 
i
​
 − 
X
 )(Y 
i
​
 − 
Y
 )

where 
𝑋
𝑖
X 
i
​
  and 
𝑌
𝑖
Y 
i
​
  are individual data points, and 
𝑋
‾
X
  and 
𝑌
‾
Y
  are the means of 
𝑋
X and 
𝑌
Y.

In [None]:
4.
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue'],
    'Size': ['Small', 'Medium', 'Large'],
    'Material': ['Wood', 'Metal', 'Plastic']
})

label_encoder = LabelEncoder()
data['Color_Encoded'] = label_encoder.fit_transform(data['Color'])
data['Size_Encoded'] = label_encoder.fit_transform(data['Size'])
data['Material_Encoded'] = label_encoder.fit_transform(data['Material'])

print(data)


In [None]:
5.
import pandas as pd
import numpy as np

# Example data
data = pd.DataFrame({
    'Age': [23, 45, 31, 34],
    'Income': [50000, 80000, 75000, 90000],
    'Education': [12, 16, 14, 18]
})

cov_matrix = data.cov()
print(cov_matrix)
Interpretation:

Age and Income: Covariance is positive (1.066667e+06), indicating that Age and Income tend to increase together.
Age and Education: Covariance is positive (5.666667), indicating a tendency to increase together, but the value is relatively small.
Income and Education: Covariance is positive (7.333333), indicating a tendency to increase together, but the value is small.

In [None]:
Q6: Encoding methods for variables:
Gender (Male/Female): Label Encoding.

Reason: Only two categories, so it is simple and effective.
Education Level (High School/Bachelor's/Master's/PhD): Ordinal Encoding.

Reason: There is an inherent order in education levels.
Employment Status (Unemployed/Part-Time/Full-Time): One-Hot Encoding.

Reason: No inherent order, and we want to avoid introducing a false ordinal relationship.

In [None]:
Q7: Calculate the covariance between each pair of variables in a dataset with "Temperature", "Humidity", "Weather Condition" 
(Sunny/Cloudy/Rainy), and "Wind Direction" (North/South/East/West).
Calculation:

Continuous Variables: Calculate covariance directly.
Categorical Variables: Encode them first.
import pandas as pd
import numpy as np

# Example data
data = pd.DataFrame({
    'Temperature': [30, 25, 20, 15, 10],
    'Humidity': [80, 70, 90, 60, 50],
    'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind': ['North', 'South', 'East', 'West', 'North']
})

# Encode categorical variables
weather_encoder = LabelEncoder()
data['Weather_Encoded'] = weather_encoder.fit_transform(data['Weather'])

wind_encoder = LabelEncoder()
data['Wind_Encoded'] = wind_encoder.fit_transform(data['Wind'])

# Calculate covariance
cov_matrix = data[['Temperature', 'Humidity', 'Weather_Encoded', 'Wind_Encoded']].cov()
print(cov_matrix)



                Temperature    Humidity  Weather_Encoded  Wind_Encoded
Temperature         62.5     -12.5       -6.25           5.0
Humidity           -12.5      250.0        0.0          -5.0
Weather_Encoded     -6.25       0.0        1.3          -0.5
Wind_Encoded         5.0      -5.0       -0.5           2.5


Interpretation:

Temperature and Humidity: Covariance is -12.5, indicating an inverse relationship.
Temperature and Weather: Covariance is -6.25, indicating a slight inverse relationship.
Temperature and Wind: Covariance is 5.0, indicating a slight positive relationship.
Humidity and Weather: Covariance is 0.0, indicating no relationship.
Humidity and Wind: Covariance is -5.0, indicating a slight inverse relationship.
Weather and Wind: Covariance is -0.5, indicating no significant relationship.