Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Answer:**Ordinal encoding:**
Used for categorical data with a natural order, such as temperature categories like cool, mild, and hot. Ordinal encoding preserves the order of the categories and is useful for machine learning models that expect numerical input.

**Label encoding:**
Used for categorical data without a natural order, such as color categories like red, orange, blue, and white. Label encoding assigns a unique integer to each category, allowing algorithms to interpret the data.

Choosing Between Ordinal Encoding and Label Encoding---

 **Label Encoding**use when the categorical feature is nominal, meaning that there is no meaningful order between categories.

 For example:
City Names (New York, London, Paris) → There is no inherent order to city names.

**Ordinal Encoding** use when the categorical feature is ordinal, meaning that the categories have a natural, meaningful order.

For example:
Satisfaction Levels (Low, Medium, High) → There is a clear order from low to high.


Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Answer:Target Guided Ordinal Encoding is a technique where categorical variables are encoded based on the relationship between each category and the target variable. Instead of assigning arbitrary or ordinal values, this method encodes categories according to how much they correlate with the target variable, usually by calculating the mean of the target for each category.


How it Works:

1. Group by Category: First, you group the dataset by the categorical feature you want to encode.

2. Calculate the Mean of the Target: For each category, calculate the mean of the target variable. If it’s a regression task, it would be the mean of the target. For classification, it’s usually the proportion of positive outcomes (e.g., the probability of churn in a churn prediction problem).

3. Sort and Assign Ranks: Sort the categories by the calculated target mean and assign ordinal values based on the ranking.

4. Encode Categories: Replace each category with the corresponding ordinal value.

EXample:

In [3]:
import pandas as pd
#create a data set
df = pd.DataFrame({
    'city':['New York','London','Paris','Tokyo','New York','Paris'],
    'price':[200,150,300,250,150,300]
})

#calculate the mean of price for each city
mean_price = df.groupby('city')['price'].mean().to_dict()

# replace each city with its mean price
df['city_encoded'] = df['city'].map(mean_price)
df

Unnamed: 0,city,price,city_encoded
0,New York,200,175.0
1,London,150,150.0
2,Paris,300,300.0
3,Tokyo,250,250.0
4,New York,150,175.0
5,Paris,300,300.0


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Answer:Covariance is a statistical measure that indicates the direction of the linear relationship between two variables. It helps to determine whether two variables move together in the same direction (positive covariance) or in opposite directions (negative covariance). Essentially, it measures how much two random variables vary together.

Importance of Covariance in Statistical Analysis:

1. Identifying Relationships: Covariance helps identify whether variables have a positive or negative relationship. This information is useful when understanding how variables are related and whether changes in one variable are associated with changes in another.

 **Positive Covariance**: When one variable increases, the other also tends to increase.

 **Negative Covariance:** When one variable increases, the other tends to decrease.

 **Zero Covariance:** Indicates no linear relationship between the variables.

2. Basis for Correlation: While covariance gives the direction of the relationship, it does not give the magnitude or scale-independent information. Correlation, which is derived from covariance, normalizes this measure, making it easier to interpret the strength of the relationship.

3. Feature Selection: In machine learning or regression models, covariance can be used to detect multicollinearity between features. If two features have high covariance, they may carry redundant information.

Covariance between two variables X and Y is calculated using the following formula:

Cov(X,Y) =(∑
i=1 (Xi-mean of X)(Yi-∑ mean of Y)) /(n-1)

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

Answer:

In [7]:
import pandas as pd
df = pd.DataFrame({
    'color':['red','green','blue'],
    'size':['small','medium','large'],
    'material':['wood','metal','plastic']
})

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['color_encoded'] = encoder.fit_transform(df['color'])
df['size_encoded'] = encoder.fit_transform(df['size'])
df['material_encoded'] = encoder.fit_transform(df['material'])
df

Unnamed: 0,color,size,material,color_encoded,size_encoded,material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [11]:
# Let’s assume we have the following data for Age, Income, and Education Level:
import pandas as pd
df = pd.DataFrame({
    'Age':[25,30,35,40,45],
    'Income':[50000,60000,75000,90000,100000],
    'Education_level':[12,16,18,20,22]
})

# Calculate the covariance matrix
covariance_matrix = df.cov()
pd.DataFrame(covariance_matrix)

Unnamed: 0,Age,Income,Education_level
Age,62.5,162500.0,30.0
Income,162500.0,425000000.0,77500.0
Education_level,30.0,77500.0,14.8


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

Answer: We have categorical variables such as "Gender," "Education Level," and "Employment Status," you need to choose encoding methods based on the nature of the variables. Here's a detailed explanation for encoding each:

1. "Gender"(Male/Female):
here i'm using  Label Encoding because Gender is binary, you can use Label Encoding to assign a unique integer to each category:

  Male → 0

 Female → 1

2. Education Level (High School, Bachelor’s, Master’s, PhD):
here I'm using OdinalEncoding because the levels of education have a natural order (High School < Bachelor's < Master's < PhD), we should use Ordinal Encoding to capture this order:

 High School → 0

 Bachelor’s → 1

 Master’s → 2

 PhD → 3

3. . Employment Status (Unemployed, Part-Time, Full-Time):
here I'm using OneHotEncoding because Employment Status is nominal (no natural order among "Unemployed," "Part-Time," and "Full-Time"), One-Hot Encoding is appropriate to avoid introducing an artificial ranking. It will create three columns:

 Employment_Unemployed → 1/0

  Employment_Part-Time → 1/0

  Employment_Full-Time → 1/0


Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

Answer:

In [24]:
import pandas as pd
df= pd.DataFrame({"Temperature":[ 25,30,35,40,45],
                  "Humidity":[60,65,70,75,80],
                  "Weather Condition":["Sunny","Cloudy","Rainy","Sunny","Cloudy"],
                  "Wind Direction":["North","South","East","West","North"]})

#in "weather Condition" applying Label Encoding
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Weather Condition_encoded'] = encoder.fit_transform(df['Weather Condition'])


#in "Wind Direction" applying OneHotEncoding
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['Wind Direction']])
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(['Wind Direction']))

df = pd.concat([df, encoded_df], axis=1)

df = df.drop(['Wind Direction','Weather Condition'], axis=1)

#calculate covariance
covariance_matrix = df.cov()
pd.DataFrame(covariance_matrix)



Unnamed: 0,Temperature,Humidity,Weather Condition_encoded,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West
Temperature,62.5,62.5,-2.5,-2.775558e-17,-5.5511150000000004e-17,-1.25,1.25
Humidity,62.5,62.5,-2.5,-2.775558e-17,-5.5511150000000004e-17,-1.25,1.25
Weather Condition_encoded,-2.5,-2.5,1.0,0.0,0.0,-0.25,0.25
Wind Direction_East,-2.775558e-17,-2.775558e-17,0.0,0.2,-0.1,-0.05,-0.05
Wind Direction_North,-5.5511150000000004e-17,-5.5511150000000004e-17,0.0,-0.1,0.3,-0.1,-0.1
Wind Direction_South,-1.25,-1.25,-0.25,-0.05,-0.1,0.2,-0.05
Wind Direction_West,1.25,1.25,0.25,-0.05,-0.1,-0.05,0.2
