Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.


In [None]:
"""
Label Encoding: Assigns unique numerical labels to categories without considering any inherent order. 
Suitable for nominal data. Can unintentionally imply an ordinal relationship.

Ordinal Encoding: Assigns numerical values based on the inherent order of categories. Suitable for ordinal
data where categories have a meaningful sequence.

Choose Label Encoding for nominal data without order and Ordinal Encoding for ordinal data with a clear sequence.
Be cautious with Label Encoding to avoid misinterpretation of data as ordinal when it's not.
"""

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.


In [None]:
"""
Target Guided Ordinal Encoding is a technique where you encode ordinal categorical variables based on their
relationship with the target variable. 
The steps are:

1-Calculate the mean/median of the target variable for each category.
2-Order the categories based on these values.
3-Replace categories with corresponding ordered encoded values while maintaining the monotonic relationship
between the variable and the target.


#Example:
Use this technique when we want to capture the ordinal relationship between categories and the target variable. 
For instance, in predicting resale prices, we might encode the "Condition" of a product to maintain the order of 
conditions while reflecting their impact on prices.
"""

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

In [None]:
"""
Covariance measures how two variables change together. A positive covariance means they tend to increase or decrease
together, while a negative covariance means one tends to increase when the other decreases. It's important in statistical
analysis because it helps assess relationships, manage risks, analyze regression, and perform multivariate analyses.

Calculation: 
Covariance is calculated using a formula that involves the differences between each data point and the mean of the variables.
However, interpreting covariance can be challenging due to different scales, so the correlation coefficient is often used 
instead to provide a standardized measure of the linear relationship between variables.
"""

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.


In [2]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd


# Sample Data
data={'Color':['red','green','blue'],
      'Size':['small','medium','large'],
      'Material':['wood','metal','plastic']}

# Dataframe
df=pd.DataFrame(data)

# Define LabelEncoder Object
encoder=LabelEncoder()

# Perform Label encoding
color_encoded=encoder.fit_transform(df['Color'])
size_encoded=encoder.fit_transform(df['Size'])
material_encoded=encoder.fit_transform(df['Material'])

#Creating Encoding features
df['color_encoded']=color_encoded
df['size_encoded']=size_encoded
df['material_encoded']=material_encoded

df.head()

Unnamed: 0,Color,Size,Material,color_encoded,size_encoded,material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.


In [14]:
import numpy as np
import seaborn as sns
# Example data (replace this with your actual data)
age = np.random.randint(20, 70, size=100)
income = np.random.randint(20000, 100000, size=100)
education_level = np.random.randint(1, 5, size=100)  # Assuming ordinal values

# Create a data matrix
data = np.vstack((age, income, education_level)).T

# Calculate the covariance matrix
cov_matrix = np.cov(data, rowvar=False)

print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
[[2.30474747e+02 4.15593061e+04 1.51515152e-02]
 [4.15593061e+04 5.26033151e+08 4.62580808e+01]
 [1.51515152e-02 4.62580808e+01 1.30050505e+00]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?


In [None]:
"""
1-Gender (Male/Female):
For binary categorical variables like "Gender," we can use Label Encoding. Since there are only two categories,
we can encode them as 0 and 1 (or vice versa). There's no inherent order between male and female, so label
encoding is appropriate.


2-Education Level (High School/Bachelor's/Master's/PhD):
We can use Ordinal Encoding. The education levels have a clear order from "High School" to "PhD," and ordinal 
encoding will capture this order effectively. Assigning numerical values based on the educational hierarchy 
will maintain the meaningful relationship between the categories.


3-Employment Status (Unemployed/Part-Time/Full-Time):
For non-ordinal categorical variables like "Employment Status," it's better to use One-Hot Encoding. One-hot 
encoding creates binary columns for each category, representing the presence or absence of that category. Since 
there's no inherent order between "Unemployed," "Part-Time," and "Full-Time," one-hot encoding prevents the model
from misinterpreting any ordinal relationship.
"""

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [12]:
# Import libraries
import pandas as pd
import numpy as np

data={'Temperature': np.random.randint(20,35,50),
      'Humidity':np.random.randint(35,55,50),
      'Weather Condition':np.random.choice(a=['Sunny','Cloudy','Rainy'],size=50,p=[0.5, 0.3, 0.2]),
      'Wind Direction':np.random.choice(a=['North','South','East','West'],size=50,p=[0.3, 0.3, 0.2,0.2])}


df=pd.DataFrame(data)

# Calculate covariance between Temperature and Humidity
covariance_temp_humidity = df['Temperature'].cov(df['Humidity'])

# Perform a cross-tabulation between Weather Condition and Wind Direction
cross_tab = pd.crosstab(df['Weather Condition'], df['Wind Direction'])

print("Covariance between Temperature and Humidity:", covariance_temp_humidity)
print()
print("\nCross-Tabulation between Weather Condition and Wind Direction:")
print()
print(cross_tab)

Covariance between Temperature and Humidity: 1.9338775510204076


Cross-Tabulation between Weather Condition and Wind Direction:

Wind Direction     East  North  South  West
Weather Condition                          
Cloudy                4      6      4     4
Rainy                 3      1      4     4
Sunny                 4      7      4     5
