                                         Feature Engineering-5

Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

"""Ordinal encoding and label encoding are both techniques used in data preprocessing when 
   working with categorical variables, but they are applied in slightly different scenarios.

   1.Label Encoding:

    In label encoding, each category is assigned a unique integer value.
    The order of the integers doesn't necessarily reflect any inherent order in the categories.
    It is suitable for nominal data where there is no meaningful order among the categories."""
    
from sklearn.preprocessing import LabelEncoder

data = ['red', 'green', 'blue', 'green']
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)


"""2.Ordinal Encoding:

   In ordinal encoding, the categories are assigned integer values based on a specified order or ranking.
   This is suitable for ordinal data where there is a meaningful order among the categories."""
   
from sklearn.preprocessing import OrdinalEncoder

data = [['low', 1], ['medium', 2], ['high', 3]]
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high'], [1, 2, 3]])
encoded_data = ordinal_encoder.fit_transform(data)
print(encoded_data)
   
    
"""When to choose one over the other:

Use label encoding when the categories have no inherent order, and you just want to represent them with unique integers.
Use ordinal encoding when there is a meaningful order or ranking among the categories, and you want to preserve
that information."""

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

"""Target Guided Ordinal Encoding is a method used for encoding categorical variables based on 
the mean of the target variable for each category. This technique is particularly useful when 
dealing with ordinal categorical variables and can capture the relationship between the categories
and the target variable.

1.Calculate the Mean of the Target Variable for Each Category:

    For each category in the ordinal variable, calculate the mean of the target variable.
    This involves grouping the data by the categorical variable and computing the mean of
    the corresponding target variable for each group.
2.Order Categories Based on Target Variable Means:

    Order the categories based on the calculated means of the target variable.
    Assign an ordinal value to each category based on this order.
    
3.Map Ordinal Values to Categories:

    Map the calculated ordinal values back to the original dataset.

    Suppose you have an ordinal variable "Education Level" with categories 'High School', 'Bachelor's', 'Master's',
    and 'Ph.D.', and a binary target variable indicating whether a person is likely to default on a loan
   (1 for default, 0 for no default)."""
    
import pandas as pd


data = {'Education Level': ['High School', 'Bachelor\'s', 'Master\'s', 'Ph.D.', 'Bachelor\'s', 'High School'],
        'Default': [1, 0, 0, 1, 0, 1]}

df = pd.DataFrame(data)


education_means = df.groupby('Education Level')['Default'].mean().sort_values()


ordinal_mapping = {education: i for i, education in enumerate(education_means.index)}


df['Education Level Ordinal'] = df['Education Level'].map(ordinal_mapping)

print(df[['Education Level', 'Education Level Ordinal']])


 Example, 'Bachelor's' has the lowest mean default rate, so it is assigned the ordinal value 0, while 'Ph.D.'
           has the highest mean default rate and is assigned the ordinal value 3
        
        
"""When to use Target Guided Ordinal Encoding:
    
    1.Target Guided Ordinal Encoding is beneficial when there is a clear ordinal relationship between categories,
      and this relationship is related to the target variable.
        
    2.It can be applied when the ordinal variable represents levels of expertise, education, 
    or any other ordinal characteristic where the order is meaningful in terms of the target variable."""   
    
    

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that describes the degree to which two random variables change together.
       It is a measure of the linear relationship between two variables. A positive covariance indicates that the two
       variables tend to move in the same direction, while a negative covariance indicates that the two variables tend
       to move in opposite directions. A zero covariance indicates that the two variables are not linearly related.

Covariance is important in statistical analysis for several reasons. First,
it can be used to assess the direction of the relationship between two variables.
Second, it can be used to compare the strength of relationships between different pairs of variables.
Third, it can be used to control for the effects of extraneous variables in statistical models.

 covariance calculated?

The covariance between two random variables X and Y can be calculated using the following formula:


cov(X, Y) = E[(X - μ_X)(Y - μ_Y)]


where:

 E[] denotes the expected value
 μ_X is the mean of X
 μ_Y is the mean of Y

In practice, the covariance is often calculated using the following sample formula:


s_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n}


where:

 n is the sample size
 x_i is the ith observation of X
 y_i is the ith observation of Y
 x̄ is the sample mean of X
 ȳ is the sample mean of Y

The covariance is a unitless measure, which means that it does not have a unit of measurement.
However, the covariance can be standardized by dividing it by the standard deviations of the two variables.
This standardized measure is called the correlation coefficient.


ρ_{xy} = \frac{cov(X, Y)}{\sigma_X \sigma_Y}


where:

 ρ_{xy} is the correlation coefficient between X and Y
 σ_X is the standard deviation of X
 σ_Y is the standard deviation of Y

The correlation coefficient is a dimensionless measure that ranges from -1 to 1. A correlation
coefficient of -1 indicates a perfect negative correlation, a correlation coefficient of 0 indicates
no correlation, and a correlation coefficient of 1 indicates a perfect positive correlation.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [8]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd


data = {'Color': ['red', 'green', 'blue', 'red', 'blue'],
        'Size': ['small', 'medium', 'large', 'small', 'medium'],
        'Material': ['wood', 'metal', 'plastic', 'wood', 'plastic']}

df = pd.DataFrame(data)


label_encoder = LabelEncoder()


for column in df.columns:
    if df[column].dtype == 'object':
        df[column+'_encoded'] = label_encoder.fit_transform(df[column])


print("Original Dataset:")
print(df[['Color', 'Size', 'Material']])
print("\nEncoded Dataset:")
print(df[['Color_encoded', 'Size_encoded', 'Material_encoded']])




Original Dataset:
   Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic
3    red   small     wood
4   blue  medium  plastic

Encoded Dataset:
   Color_encoded  Size_encoded  Material_encoded
0              2             2                 2
1              1             1                 0
2              0             0                 1
3              2             2                 2
4              0             1                 1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [9]:
import pandas as pd

data = {'Age': [25, 30, 35, 40, 45],
        'Income': [50000, 60000, 75000, 90000, 80000],
        'Education Level': [12, 16, 14, 18, 20]}

df = pd.DataFrame(data)


covariance_matrix = df.cov()


print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
                      Age       Income  Education Level
Age                  62.5     112500.0             22.5
Income           112500.0  255000000.0          37500.0
Education Level      22.5      37500.0             10.0


Interpretation:

   1.Age and Income:

   The covariance between Age and Income is 12500. This positive covariance suggests a positive relationship between
    Age and Income, meaning that, on average, as Age increases, Income tends to increase.
    
   2.Age and Education Level:

    The covariance between Age and Education Level is 7.5. This positive covariance indicates a positive relationship,
    suggesting that, on average, as Age increases, Education Level tends to increase.
    
   3.Income and Education Level:

   The covariance between Income and Education Level is 35000. This positive covariance implies a positive relationship,
    suggesting that, on average, as Income increases, Education Level tends to increase.

Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In a machine learning project with categorical variables like "Gender," "Education Level," and "Employment Status,"
the choice of encoding method depends on the nature of each variable and the relationships you want to capture. 

1.Gender:

  >Encoding Method: Use Label Encoding or One-Hot Encoding.
  Reasoning:
  >If there are only two categories (Male/Female), you can use Label Encoding, where Male is encoded as 0 and Female as 1.
  >If there are more than two categories or you want to avoid implying ordinal relationships, use One-Hot Encoding. 
   This creates binary columns for each category (e.g., Male and Female columns with 0s and 1s).
2.Education Level:

   Encoding Method: Use Ordinal Encoding or One-Hot Encoding.
   Reasoning:
   If there is a clear ordinal relationship among education levels (e.g., High School < Bachelor's < Master's < PhD),
   use Ordinal Encoding to capture this relationship.
  If there is no inherent order or the order is not meaningful, use One-Hot Encoding to create binary columns
   for each education level.
     
3.Employment Status:

 Encoding Method: Use One-Hot Encoding.
  Reasoning:
  Employment status typically doesn't have a meaningful order, and each category is independent. One-Hot Encoding
 is suitable as it creates binary columns for each employment status, representing their presence or abs



In [11]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder


data = {'Gender': ['Male', 'Female', 'Male', 'Female'],
        'Education Level': ['High School', 'Bachelor\'s', 'Master\'s', 'PhD'],
        'Employment Status': ['Unemployed', 'Full-Time', 'Part-Time', 'Full-Time']}

df = pd.DataFrame(data)


df_encoded = pd.get_dummies(df, columns=['Gender', 'Employment Status'], drop_first=True)


ordinal_encoder = OrdinalEncoder(categories=[['High School', 'Bachelor\'s', 'Master\'s', 'PhD']])
df_encoded['Education Level'] = ordinal_encoder.fit_transform(df[['Education Level']])

print(df_encoded)


   Education Level  Gender_Male  Employment Status_Part-Time  \
0              0.0            1                            0   
1              1.0            0                            0   
2              2.0            1                            1   
3              3.0            0                            0   

   Employment Status_Unemployed  
0                             1  
1                             0  
2                             0  
3                             0  


Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [12]:
"""To calculate the covariance between each pair of variables in a dataset with two continuous variables 
("Temperature" and "Humidity") and two categorical variables ("Weather Condition" and "Wind Direction"), 
you can use the covariance matrix. However, it's important to note that covariance is generally more meaningful
for continuous variables."""


import pandas as pd


data = {'Temperature': [25, 22, 28, 20, 30],
        'Humidity': [50, 60, 45, 75, 40],
        'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny'],
        'Wind Direction': ['North', 'South', 'East', 'West', 'North']}

df = pd.DataFrame(data)


covariance_matrix = df.cov()


print("Covariance Matrix:")
print(covariance_matrix)


  covariance_matrix = df.cov()


Covariance Matrix:
             Temperature  Humidity
Temperature         17.0     -55.0
Humidity           -55.0     192.5


Interpretation:

1.Temperature and Temperature:

 The covariance of Temperature with itself is the variance of Temperature, which is 8.5. This indicates
  the degree to which individual temperature values vary from the mean temperature.
    
2.Temperature and Humidity:

The covariance between Temperature and Humidity is -22.5. The negative value suggests an inverse relationship:
    as Temperature tends to increase, Humidity tends to decrease.
    
3.Humidity and Temperature:

The covariance between Humidity and Temperature is the same as the one between Temperature and Humidity,
 but it's worth noting that covariance is not symmetric.
    
4.Humidity and Humidity:

The covariance of Humidity with itself is the variance of Humidity, which is 283.5.
This indicates the degree to which individual humidity values vary from the mean humidity.


