In [None]:
"""Q.1
Ordinal Encoding and Label Encoding are both techniques used to convert categorical data into numerical format, but they are applied under different circumstances and have distinct characteristics:
Aspect                                                                Ordinal Encoding                                                                                                                                             Label Encoding
Use Case                                               Appropriate when there's a meaningful order or hierarchy among categorical categories.                                                                     Suitable when there's no inherent order or hierarchy among categorical categories. Typically used for binary or nominal variables.
Encoding Method                                        Assigns integer labels based on the order or rank of categories. Lower integers represent lower ranks, higher integers represent higher ranks.             Assigns unique integer labels to each category without considering order or ranking. Each category is treated as a separate entity.
Example                                                Encoding education levels: "High School" < "Associate's" < "Bachelor's" < "Master's" < "Ph.D."                                                             Encoding gender: "Male" (0) and "Female" (1).
Preserves Order                                        Preserves the meaningful order or hierarchy among categories.                                                                                              Does not preserve order or ranking; treats categories as separate entities.
Application                                            Typically used for variables with ordinal data, like education levels, satisfaction ratings, or temperature levels.                                        Useful for encoding nominal variables or binary categories, where there's no natural order.
Potential Issues                                       May introduce ordinal assumptions that do not exist in the data.                                                                                           May not be suitable for variables with an implicit order, leading to loss of information.
Implementation in Python                               Can be implemented using custom mappings or libraries like scikit-learn's OrdinalEncoder.                                                                  Implemented using libraries like scikit-learn's LabelEncoder.

When to Choose One Over the Other:
Choose Ordinal Encoding when you have categorical data with a meaningful order or hierarchy, and preserving this order is important for the analysis or machine learning task. For example, when encoding education levels like "High School" < "Associate's" < "Bachelor's" < "Master's" < "Ph.D."
Choose Label Encoding when the categorical data lacks a clear order, and you want to convert categories into numerical values while treating them as distinct entities. For example, when encoding binary variables like "Yes" or "No," or when encoding nominal variables with no inherent ranking.

In [None]:
"""Q.2
Target Guided Ordinal Encoding is a technique used to encode categorical variables by considering the relationship between the category and the target variable in a supervised machine learning setting. It assigns ordinal values to categories based on the likelihood or probability of a particular category resulting in a specific target outcome. This technique is particularly useful when you have categorical features with a large number of categories and you want to capture the impact of each category on the target variable in a more meaningful way.
Here's how Target Guided Ordinal Encoding works:
1.Calculate Target Statistics: For each unique category within the categorical feature, calculate summary statistics based on the target variable. Common statistics include the mean, median, or mode of the target variable for each category.
2.Order Categories: Sort the categories based on their calculated statistics. Categories with higher mean (or other chosen statistics) of the target variable will be assigned higher values, indicating a stronger relationship with the target.
3.Encode Categories: Replace the original categorical values with the ordered numeric values based on their calculated statistics. The resulting ordinal values can then be used as features in your machine learning model.
4.Handle Missing Values: If a category is missing from the test dataset but present in the training dataset, you can handle it by assigning a default value or using a suitable imputation technique.

Here's an example of when you might use Target Guided Ordinal Encoding:
Scenario: Predicting Customer Churn in a Telecom Company
In a machine learning project aimed at predicting customer churn in a telecom company, you have a categorical feature called "Contract Type" that represents the type of contract each customer has. The possible values for this feature are "Month-to-Month," "One Year," and "Two Year."
You know that the contract type is likely to have a significant impact on customer churn, and there's a meaningful hierarchy: "Two Year" contracts are likely to have a lower churn rate than "One Year" contracts, which, in turn, are likely to have a lower churn rate than "Month-to-Month" contracts.
To leverage this ordinal relationship between contract types and churn rates, you can use Target Guided Ordinal Encoding:

In [75]:
import pandas as pd
data=pd.DataFrame({
    'Contract Type':['Monthly','Annually','Biannually','Monthly','Biannually'],
    'Churn':[1,0,1,0,0]
})
mean_churn_rate=data.groupby('Contract Type')['Churn'].mean()  # Calculate mean churn rate for each contract type
data['Contract Type_encoded']=data['Contract Type'].map(mean_churn_rate) # Create a mapping dictionary 
data  

Unnamed: 0,Contract Type,Churn,Contract Type_encoded
0,Monthly,1,0.5
1,Annually,0,0.0
2,Biannually,1,0.5
3,Monthly,0,0.5
4,Biannually,0,0.5


In [None]:
"""Q.3
Covariance is a statistical measure that quantifies the degree to which two random variables change together. In other words, it measures the joint variability of two variables. Specifically, covariance indicates whether an increase in one variable corresponds to an increase, decrease, or no change in another variable.
Here's a simplified explanation of covariance:
*If the covariance between two variables is positive, it means that when one variable tends to be above its mean, the other variable also tends to be above its mean. In other words, they move in the same direction.
*If the covariance is negative, it indicates that when one variable tends to be above its mean, the other variable tends to be below its mean. They move in opposite directions.
*If the covariance is close to zero or has a small magnitude, it suggests that there is little to no linear relationship between the two variables.
Covariance is an important concept in statistical analysis as:
1.Relationship Assessment: Covariance helps assess whether there is a linear relationship between two variables. It provides an initial indication of whether changes in one variable might be associated with changes in another.
2.Portfolio Diversification: In finance, covariance is used to assess the diversification benefits of adding different assets to an investment portfolio. Negative covariances between assets can reduce overall portfolio risk.
3.Machine Learning: Covariance can be used in machine learning algorithms, such as principal component analysis (PCA) and linear regression, to understand and model relationships between features or variables.
4.Multivariate Analysis: In multivariate statistics, covariance matrices are used to analyze the relationships between multiple variables simultaneously. This is important in fields like economics, social sciences, and biology.
The formula to calculate the covariance between two variables X and Y based on a sample is as follows:
Cov(X,Y)=Summation(i=1 to n)(Xi-Xbar)(Yi-Ybar)
        --------------------------------------
                          n - 1
Where:
Cov(X,Y) is the covariance between X and Y.
X and Y are the individual data points.
Xbar and Ybar are the means (average values) of X and Y, respectively.
n is the number of data points.  

In [5]:
#Q.4
import pandas as pd
data=pd.DataFrame({            # Create a DataFrame with categorical variables
     'Color':['red', 'green', 'blue'], 
     'Size':['small', 'medium','large'],
     'Material': ['wood', 'metal', 'plastic']})
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()  # Initialize LabelEncoder
data['Color_encoded'] = encoder.fit_transform(data['Color'])  # Encode each categorical column separately
data['Size_encoded'] = encoder.fit_transform(data['Size'])
data['Material_encoded'] = encoder.fit_transform(data['Material'])
print(data) # Display the encoded DataFrame

   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1


In [None]:
"""Explanation of the output:
The original DataFrame data contains three categorical columns: 'Color,' 'Size,' and 'Material.'
We use LabelEncoder to encode each of these columns individually, creating three new columns in the DataFrame: 'Color_encoded,' 'Size_encoded,' and 'Material_encoded.'
The fit_transform method of LabelEncoder assigns a unique integer to each unique category in each column. For example:
In the 'Color' column, 'red' is encoded as 2, 'green' as 1, and 'blue' as 0.
In the 'Size' column, 'small' is encoded as 2, 'medium' as 1, and 'large' as 0.
In the 'Material' column, 'wood' is encoded as 2, 'metal' as 0, and 'plastic' as 1.

In [17]:
#Q.5
import numpy as np
data=np.array([   # Create a hypothetical dataset with Age, Income, and Education level
    [30, 50000, 12],
    [35, 60000, 16],
    [28, 45000, 14],
    [40, 70000, 18],
    [32, 55000, 15]
])
# Calculate the covariance matrix using NumPy
cov_matrix = np.cov(data,rowvar=False)
print(cov_matrix)

[[2.200e+01 4.500e+04 9.250e+00]
 [4.500e+04 9.250e+07 1.875e+04]
 [9.250e+00 1.875e+04 5.000e+00]]


In [None]:
"""Interpretation:
Age vs. Age (variance): The diagonal element (21) represents the variance of the Age variable, indicating how much Age varies from its mean. In this case, Age has a variance of 21.
Income vs. Income (variance): The diagonal element (9.25e+07) represents the variance of the Income variable, indicating a much larger spread of income values in the dataset compared to Age.
Education level vs. Education level (variance): The diagonal element (5) represents the variance of the Education level variable, indicating the variance in education levels in the dataset.
Age vs. Income (covariance): The off-diagonal element (4.5e+04) represents the covariance between Age and Income. A positive covariance suggests that as Age tends to increase, Income also tends to increase.
Age vs. Education level (covariance): The off-diagonal element (9.25) represents the covariance between Age and Education level. A positive covariance indicates that there is a positive relationship between Age and Education level, suggesting that, on average, older individuals tend to have higher education levels.
Income vs. Education level (covariance): The off-diagonal element (1.875e+04) represents the covariance between Income and Education level. A positive covariance suggests that, on average, higher incomes tend to be associated with higher education levels.

In [37]:
#Q.6
import pandas as pd
data=pd.DataFrame({
    "Gender":["Male","Female","Female","Male"], 
    "Education Level": ["High School","Bachelor's","Masters","PhD"],
    "Employment Status" :["Unemployed","Part-Time","Full-Time","Full-Time"]
})
data

Unnamed: 0,Gender,Education Level,Employment Status
0,Male,High School,Unemployed
1,Female,Bachelor's,Part-Time
2,Female,Masters,Full-Time
3,Male,PhD,Full-Time


In [38]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,OrdinalEncoder
lab_encoder=LabelEncoder()
ohe_encoder=OneHotEncoder(sparse=False)
ord_encoder=OrdinalEncoder(categories=[["High School","Bachelor's","Masters","PhD"]])
data['Gender_encoded']=lab_encoder.fit_transform(data["Gender"])
data['Education Level_encoded']=ord_encoder.fit_transform(data[['Education Level']])
employment_encoded=ohe_encoder.fit_transform(data[['Employment Status']])
employment_encoded_df = pd.DataFrame(employment_encoded,columns=ohe_encoder.get_feature_names_out(['Employment Status']))
data = pd.concat([data, employment_encoded_df], axis=1)
data



Unnamed: 0,Gender,Education Level,Employment Status,Gender_encoded,Education Level_encoded,Employment Status_Full-Time,Employment Status_Part-Time,Employment Status_Unemployed
0,Male,High School,Unemployed,1,0.0,0.0,0.0,1.0
1,Female,Bachelor's,Part-Time,0,1.0,0.0,1.0,0.0
2,Female,Masters,Full-Time,0,2.0,1.0,0.0,0.0
3,Male,PhD,Full-Time,1,3.0,1.0,0.0,0.0


In [None]:
""" Encoding used for:
*Gender (Binary Categorical): Label Encoding is appropriate for "Gender" because it's a binary categorical variable with only two categories (Male/Female). Label Encoding assigns integers to the two categories, which works well for algorithms that can handle ordinal relationships. For instance, Male could be encoded as 0, and Female could be encoded as 1.
*Education Level (Ordinal Categorical): Ordinal Encoding is suitable for "Education Level" because it's an ordinal categorical variable with multiple categories that have a clear order or hierarchy (e.g., High School, Bachelor's, Master's, PhD). Ordinal Encoding captures the ordinal relationships while preserving the order. You can assign integers according to the educational hierarchy.
*Employment Status (Nominal Categorical): One-Hot Encoding (OHE) is appropriate for "Employment Status" because it's a nominal categorical variable with multiple categories (e.g., Unemployed, Part-Time, Full-Time) without a clear ordinal relationship. OHE creates binary columns for each category, effectively treating each category as independent. This allows the model to consider all possible categories without imposing any ordinal assumptions.

In [72]:
#Q.7
import numpy as np
# Considering only continuous variables to find the covariance between them.
temperature = [20, 22, 25, 18, 23]
humidity = [40, 45, 50, 35, 48]
# Calculate the covariance
covariance_matrix = np.cov(temperature, humidity)
# The covariance between Temperature and Humidity
print(covariance_matrix)
print("Covariance between Temperature and Humidity:", covariance_matrix[0, 1])

[[ 7.3 16.3]
 [16.3 37.3]]
Covariance between Temperature and Humidity: 16.3


In [None]:
"""Interpretation:
1.Covariance Matrix:
The element at row 1, column 1 (7.3) represents the variance of the "Temperature" variable. It indicates how much the "Temperature" values vary from their mean within the dataset.
The element at row 2, column 2 (37.3) represents the variance of the "Humidity" variable. It indicates the variance in humidity levels within the dataset.
The element at row 1, column 2 (16.3) represents the covariance between "Temperature" and "Humidity." A positive covariance suggests that, on average, when "Temperature" is above its mean, "Humidity" tends to be above its mean, and vice versa. The magnitude of covariance (16.3) indicates the strength of the linear relationship between these two continuous variables.
The element at row 2, column 1 (16.3) is the same covariance as above since covariance matrices are symmetric.
2.Covariance between Temperature and Humidity (covariance_matrix[0, 1]):
The value 16.3 represents the covariance between "Temperature" and "Humidity." A positive covariance suggests that, on average, when "Temperature" is above its mean, "Humidity" tends to be above its mean, and vice versa. The magnitude of covariance (16.3) indicates the strength of this relationship.