In [1]:
# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
# might choose one over the other.

# Ordinal Encoding and Label Encoding are both techniques used in categorical variable encoding, but they differ in their application and suitability depending on the nature of the categorical data.

# ### Ordinal Encoding:

# - **Definition:** Ordinal Encoding assigns a unique integer value to each category in a categorical feature. The assigned integers are ordered based on the ordinal relationship between the categories.

# - **Example:** Suppose you have a categorical feature "Education Level" with categories "High School", "College", and "Postgraduate". Ordinal Encoding might assign integers like 0, 1, and 2 respectively. These integers reflect an inherent order or ranking among the categories.

# - **Suitability:** Ordinal Encoding is appropriate when there is a clear ordering or ranking among the categories. It preserves the ordinal relationship between categories, which can be important in scenarios where the categorical variable represents levels or grades.

# ### Label Encoding:

# - **Definition:** Label Encoding also assigns a unique integer to each category, but unlike Ordinal Encoding, it does not consider any ordinal relationship. Each category is simply assigned a distinct integer starting from 0 or 1.

# - **Example:** Consider a categorical feature "City" with categories "New York", "London", and "Paris". Label Encoding might assign integers like 0, 1, and 2 respectively. These integers serve as arbitrary labels without any inherent order or meaning.

# - **Suitability:** Label Encoding is suitable when there is no intrinsic order or ranking among the categories. It is commonly used for encoding categorical variables into numerical format for algorithms that require numeric inputs.

# ### When to Choose One Over the Other:

# - **Ordinal Encoding:** Choose Ordinal Encoding when the categorical variable has a meaningful order or hierarchy that should be preserved and utilized by the model. For example, in "Education Level", the model should understand that "Postgraduate" is higher than "College" and "College" is higher than "High School".

# - **Label Encoding:** Choose Label Encoding when the categorical variable does not have a natural order or when the order is not important for the model. For instance, in "City" names, the model does not need to interpret any ranking or hierarchy; it just needs to differentiate between different cities.

# In summary, the choice between Ordinal Encoding and Label Encoding depends on whether the categorical variable exhibits an ordinal relationship among its categories. If there is a meaningful order, Ordinal Encoding should be used to retain and utilize this information. If no such order exists or is irrelevant, Label Encoding provides a straightforward way to convert categories into numerical form for machine learning algorithms.

In [None]:
# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
# a machine learning project.

# Target Guided Ordinal Encoding is a technique used for encoding categorical variables where the categories are assigned ordinal values based on the target variable's mean or median. This method leverages the relationship between the categorical feature and the target variable to create informative and predictive ordinal labels.

# ### Steps in Target Guided Ordinal Encoding:

# 1. **Calculate Mean/Median of Target Variable:**
#    - Group the categorical variable by its categories.
#    - Compute the mean or median of the target variable for each category.

# 2. **Assign Ordinal Labels:**
#    - Sort the categories based on their mean or median target value.
#    - Assign ordinal labels (integers) starting from 0 or 1 to these sorted categories.

# 3. **Map Categories to Labels:**
#    - Replace the categorical values with their corresponding ordinal labels.

# ### Example:

# Suppose you have a dataset with a categorical feature "Education Level" and a binary target variable "Income" (0 for low income, 1 for high income). You want to predict income based on education level. Here's how Target Guided Ordinal Encoding might work:

# - **Dataset Example:**
#   - Education Level: ["High School", "College", "Postgraduate", "High School", "Postgraduate", "College"]
#   - Income: [0, 1, 1, 0, 1, 0]

# - **Step-by-Step Process:**
#   1. **Calculate Mean/Median Income for Each Education Level:**
#      - High School: Mean Income = 0.5
#      - College: Mean Income = 0.5
#      - Postgraduate: Mean Income = 1.0

#   2. **Assign Ordinal Labels:**
#      - Sort Education Levels based on mean income: ["High School", "College", "Postgraduate"]
#      - Assign labels: {"High School": 0, "College": 1, "Postgraduate": 2}

#   3. **Transform Data:**
#      - Replace "High School" with 0, "College" with 1, and "Postgraduate" with 2.

# - **Transformed Dataset:**
#   - Education Level (Transformed): [0, 1, 2, 0, 2, 1]
#   - Income: [0, 1, 1, 0, 1, 0]

# ### When to Use Target Guided Ordinal Encoding:

# - **Predictive Modeling:** Use Target Guided Ordinal Encoding when there is a clear relationship between the categorical variable and the target variable. It helps capture this relationship by assigning ordinal labels that reflect the predictive power of each category on the target variable.
  
# - **Ordinal Relationships:** If the categorical feature has an ordinal nature but lacks explicit numerical order, using the target variable's information to create ordinal labels can enhance predictive accuracy.

# - **Example Use Case:** In a marketing campaign prediction project, you might have a categorical feature "Customer Segment" and a target variable indicating whether a customer responded positively to a campaign. Target Guided Ordinal Encoding could help assign ordinal labels to customer segments based on their response rates, thereby creating a more predictive feature for the model.

# In summary, Target Guided Ordinal Encoding is a useful technique in machine learning when you want to incorporate the predictive power of categorical variables into ordinal labels that align with the target variable's outcomes. It enhances the interpretability and predictive accuracy of models by leveraging the relationship between the feature and the target.

In [None]:
# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

# **Covariance** is a statistical measure that describes the relationship between two random variables. Specifically, it indicates how much two variables change together. It is crucial in statistical analysis because it helps in understanding the direction of the relationship between variables (whether they move in the same direction or opposite directions) and the magnitude of this relationship.

# ### Importance of Covariance:

# 1. **Relationship Insight:** Covariance provides insights into whether two variables tend to increase or decrease together (positive covariance) or move in opposite directions (negative covariance).

# 2. **Quantifying Dependency:** It quantifies the strength and direction of the linear relationship between variables. Higher absolute values of covariance indicate stronger relationships.

# 3. **Variable Selection:** In multivariate analysis, covariance helps in selecting variables that have significant relationships with each other, which is crucial for modeling and prediction tasks.

# ### Calculation of Covariance:

# The covariance \( \text{Cov}(X, Y) \) between two random variables \( X \) and \( Y \) can be calculated using the formula:

# \[ \text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) \]

# Where:
# - \( n \) is the number of observations (samples).
# - \( x_i \) and \( y_i \) are the individual observations of variables \( X \) and \( Y \), respectively.
# - \( \bar{x} \) and \( \bar{y} \) are the mean (average) values of \( X \) and \( Y \), calculated as \( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \) and \( \bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i \).

# ### Interpretation of Covariance:

# - **Positive Covariance:** Indicates that as one variable increases, the other tends to increase as well.
# - **Negative Covariance:** Indicates that as one variable increases, the other tends to decrease.
# - **Zero Covariance:** Implies that there is no linear relationship between the variables.

# ### Limitations of Covariance:

# - **Scale Dependency:** Covariance is sensitive to the scale of variables. Therefore, comparing covariances directly between variables with different scales can be misleading.
  
# - **Only Linear Relationship:** Covariance measures only linear relationships. It may not capture complex relationships that are nonlinear.

# In summary, covariance is a fundamental measure in statistics used to understand the relationship between variables. It helps in identifying patterns, dependencies, and interactions among variables, which is essential for various statistical analyses, including regression modeling, dimensionality reduction, and hypothesis testing.

In [None]:
Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample dataset
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['medium', 'small', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply LabelEncoder to each column
df['Color_encoded'] = label_encoder.fit_transform(df['Color'])
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df['Material_encoded'] = label_encoder.fit_transform(df['Material'])

print("\nEncoded DataFrame:")
print(df)


In [None]:
Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

import numpy as np

# Example dataset (hypothetical values)
# Each row represents a sample, and columns represent variables Age, Income, Education Level
data = np.array([
    [30, 50000, 3],    # Sample 1
    [35, 60000, 2],    # Sample 2
    [25, 40000, 1],    # Sample 3
    [40, 70000, 2],    # Sample 4
    [28, 45000, 1]     # Sample 5
])

# Calculate covariance matrix
covariance_matrix = np.cov(data, rowvar=False)

print("Covariance Matrix:")
print(covariance_matrix)


In [None]:
Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample dataset
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Education Level': ['Bachelor\'s', 'PhD', 'Master\'s', 'High School'],
    'Employment Status': ['Full-Time', 'Part-Time', 'Unemployed', 'Full-Time']
}

df = pd.DataFrame(data)

# One-Hot Encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=['Gender', 'Education Level', 'Employment Status'])

print("Encoded DataFrame:")
print(df_encoded)


In [None]:
Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

import numpy as np
import pandas as pd

# Sample dataset (hypothetical numerical values)
data = {
    'Temperature': [25, 30, 22, 28, 27],
    'Humidity': [50, 60, 45, 55, 52],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

df = pd.DataFrame(data)

# Numerical encoding for categorical variables
df['Weather Condition'] = df['Weather Condition'].astype('category').cat.codes
df['Wind Direction'] = df['Wind Direction'].astype('category').cat.codes

# Calculate covariance matrix for continuous variables
covariance_matrix = np.cov(df[['Temperature', 'Humidity']], rowvar=False)

print("Covariance Matrix:")
print(covariance_matrix)
