In [None]:
#Q1

In [None]:
# Difference Between Ordinal Encoding and Label Encoding
# Ordinal Encoding and Label Encoding are techniques used to convert categorical data into numerical data. While they may seem similar, they serve different purposes and are used in different contexts.

# Ordinal Encoding
# Ordinal Encoding is used for categorical data that has a meaningful order or ranking. It assigns integer values to categories based on their order. This encoding method is useful when the categorical variable has a clear, ordered relationship among its categories.

# Example
# Consider a survey that asks respondents to rate their satisfaction on a scale: 'Poor', 'Fair', 'Good', 'Very Good', 'Excellent'.

# Label Encoding
# Label Encoding assigns integer values to categorical data without any intrinsic order. Each category is given a unique integer value. This method is generally used for nominal data, where the categories do not have a meaningful order.

# Example
# Consider a dataset of animals with species names: 'Cat', 'Dog', 'Rabbit', 'Fish'.
    
#   When to Choose One Over the Other
# When to Choose Ordinal Encoding
# Ordered Categories: Use ordinal encoding when the categorical variable has an inherent order or ranking.
# Example: Survey responses (e.g., 'Poor', 'Fair', 'Good', 'Very Good', 'Excellent'), educational levels (e.g., 'High School', 'Bachelor's', 'Master's', 'PhD').
# When to Choose Label Encoding
# Unordered Categories: Use label encoding when the categorical variable is nominal and does not have a meaningful order.
# Example: Species names (e.g., 'Cat', 'Dog', 'Rabbit', 'Fish'), colors (e.g., 'Red', 'Blue', 'Green').
# Practical Example of Choosing One Over the Other

In [None]:
#Q2

In [None]:
# Target Guided Ordinal Encoding
# Target Guided Ordinal Encoding is a technique used to encode categorical variables by sorting the categories according to the mean (or median) of the target variable. This method creates an ordinal relationship between the categories based on their relationship with the target variable, which can be particularly useful in supervised learning tasks.

# Steps in Target Guided Ordinal Encoding
# Calculate the Mean (or Median) of the Target Variable for Each Category: For each category in the categorical variable, compute the mean (or median) of the target variable.
# Sort the Categories: Sort the categories based on the computed means (or medians).
# Assign Ordinal Values: Assign ordinal values to the categories based on the sorted order.
# When to Use Target Guided Ordinal Encoding
# Predictive Power: When the categorical variable has a significant impact on the target variable, and you want to capture this relationship in the encoding.
# High Cardinality: When dealing with high cardinality categorical variables where one-hot encoding would be impractical due to the large number of categories.
# Example Scenario
# Let's consider a scenario where you are working on a machine learning project to predict house prices. You have a categorical variable neighborhood and a continuous target variable price. 
# You want to encode the neighborhood variable using target guided ordinal encoding to capture the relationship between the neighborhood and the house prices.

In [1]:
import pandas as pd

# Sample dataset
data = {
    'neighborhood': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'C', 'B', 'A'],
    'price': [200000, 150000, 300000, 250000, 120000, 350000, 220000, 330000, 130000, 240000]
}
df = pd.DataFrame(data)

print("Original Data:")
print(df)
# Calculate the mean price for each neighborhood
mean_price = df.groupby('neighborhood')['price'].mean()
print("\nMean Price by Neighborhood:")
print(mean_price)
# Sort neighborhoods by mean price
sorted_neighborhoods = mean_price.sort_values().index
neighborhood_mapping = {neighborhood: idx for idx, neighborhood in enumerate(sorted_neighborhoods, 1)}

print("\nNeighborhood Mapping:")
print(neighborhood_mapping)
# Apply the mapping to encode the neighborhood variable
df['neighborhood_encoded'] = df['neighborhood'].map(neighborhood_mapping)

print("\nEncoded Data:")
print(df)


Original Data:
  neighborhood   price
0            A  200000
1            B  150000
2            C  300000
3            A  250000
4            B  120000
5            C  350000
6            A  220000
7            C  330000
8            B  130000
9            A  240000

Mean Price by Neighborhood:
neighborhood
A    227500.000000
B    133333.333333
C    326666.666667
Name: price, dtype: float64

Neighborhood Mapping:
{'B': 1, 'A': 2, 'C': 3}

Encoded Data:
  neighborhood   price  neighborhood_encoded
0            A  200000                     2
1            B  150000                     1
2            C  300000                     3
3            A  250000                     2
4            B  120000                     1
5            C  350000                     3
6            A  220000                     2
7            C  330000                     3
8            B  130000                     1
9            A  240000                     2


In [None]:
#Q3

In [None]:
# Covariance is a statistical measure that indicates the extent to which two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds true for the lesser values, the covariance is positive. If the greater values of one variable mainly correspond to the lesser values of the other, the covariance is negative. Essentially, covariance is a measure of the joint variability of two random variables.

# Importance of Covariance in Statistical Analysis
# Understanding Relationships: Covariance helps in understanding the relationship between two variables. A positive covariance indicates that the variables increase together, while a negative covariance indicates that as one variable increases, the other decreases.
# Multivariate Data Analysis: Covariance is a fundamental concept in multivariate statistics. It is used in various analyses, such as Principal Component Analysis (PCA), which reduces the dimensionality of data by transforming variables into a new set of uncorrelated variables.
# Portfolio Management: In finance, covariance is used to measure how two stocks move together. This information is critical for portfolio diversification and risk management.
# Machine Learning: In machine learning, understanding the covariance between features can help in feature selection and in understanding feature interactions

In [2]:
import numpy as np

# Define the data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 6, 8, 10])

# Calculate the covariance matrix
cov_matrix = np.cov(X, Y)

# Extract the covariance value
covariance = cov_matrix[0, 1]

print("Covariance:", covariance)


Covariance: 5.0


In [None]:
#Q4

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a sample dataset
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}
df = pd.DataFrame(data)

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each column
df['Color_encoded'] = label_encoder.fit_transform(df['Color'])
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df['Material_encoded'] = label_encoder.fit_transform(df['Material'])

# Display the original and encoded data
print("Original Data:")
print(df[['Color', 'Size', 'Material']])
print("\nLabel Encoded Data:")
print(df[['Color_encoded', 'Size_encoded', 'Material_encoded']])


Original Data:
   Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic
3  green  medium    metal
4    red   small     wood

Label Encoded Data:
   Color_encoded  Size_encoded  Material_encoded
0              2             2                 2
1              1             1                 0
2              0             0                 1
3              1             1                 0
4              2             2                 2


In [None]:
#Q5

In [5]:
import pandas as pd
import numpy as np

# Create a sample dataset
data = {
    'Age': [25, 30, 45, 35, 50],
    'Income': [50000, 60000, 80000, 75000, 90000],
    'Education_Level': [12, 14, 16, 15, 18]  # Assuming education level as years of education
}
df = pd.DataFrame(data)

# Display the dataset
print("Sample Dataset:")
print(df)



Sample Dataset:
   Age  Income  Education_Level
0   25   50000               12
1   30   60000               14
2   45   80000               16
3   35   75000               15
4   50   90000               18


In [6]:
# Calculate the covariance matrix
cov_matrix = np.cov(df.T)

# Display the covariance matrix
print("Covariance Matrix:")
print(cov_matrix)


Covariance Matrix:
[[1.075e+02 1.600e+05 2.250e+01]
 [1.600e+05 2.550e+08 3.500e+04]
 [2.250e+01 3.500e+04 5.000e+00]]


In [None]:
# The covariance matrix provides insights into the relationships between the variables:

# Positive Covariance: Indicates that the variables tend to increase together. Here, Age, Income, and Education Level all have positive covariances with each other, suggesting that these variables tend to increase together.
# Magnitude of Covariance: The larger the absolute value of the covariance, the stronger the relationship between the variables. Income has the strongest relationship with both Age and Education Level.
# These insights can help in understanding the underlying relationships in the data and guide further analysis or model building.

In [None]:
#Q6

In [None]:
# Gender: One-hot encoding to avoid implying any ordinal relationship between Male and Female.
# Education Level: Ordinal encoding to capture the intrinsic order in educational qualifications.
# Employment Status: One-hot encoding to treat each employment category as distinct and unrelated.

In [7]:
education_mapping = {
    'High School': 1,
    'Bachelor\'s': 2,
    'Master\'s': 3,
    'PhD': 4
}

df = pd.DataFrame({
    'Education Level': ['Bachelor\'s', 'PhD', 'Master\'s', 'High School', 'Bachelor\'s']
})

df['Education_Level_Encoded'] = df['Education Level'].map(education_mapping)
print(df)


  Education Level  Education_Level_Encoded
0      Bachelor's                        2
1             PhD                        4
2        Master's                        3
3     High School                        1
4      Bachelor's                        2
