## Feature Engineering 4
**By Shahequa Modabbera**

### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ans)
Ordinal encoding and label encoding are two common techniques used to convert categorical data into numerical data for machine learning algorithms. 

Label encoding is a technique in which each unique value in a categorical feature is assigned a numerical label. For example, if we have a categorical feature called "color" with the values "red", "blue", and "green", we can encode them as 0, 1, and 2, respectively.

Ordinal encoding, on the other hand, is a technique that assigns a numerical value to each unique value in a categorical feature based on their order or rank. For example, if we have a categorical feature called "education level" with values "high school", "some college", "bachelor's degree", and "master's degree", we can encode them as 1, 2, 3, and 4, respectively, since we can assume that someone with a master's degree has a higher education level than someone with a high school degree.

The main difference between label encoding and ordinal encoding is that label encoding does not consider any order or hierarchy among the unique values of the categorical feature, while ordinal encoding does.

When choosing between the two techniques, it depends on the nature of the categorical feature and the problem at hand. If there is a clear order or hierarchy among the unique values of the categorical feature, it may be more appropriate to use ordinal encoding. For example, if we are working with a dataset that has a categorical feature like "education level", ordinal encoding would be a better choice since there is a natural order to the different levels of education. On the other hand, if there is no clear order or hierarchy among the unique values of the categorical feature, label encoding may be more appropriate. For example, if we are working with a dataset that has a categorical feature like "color", there is no inherent order to the different colors.

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding is a type of ordinal encoding technique that replaces the categories with a number based on the target variable's mean or median value for each category. It is useful when there is a strong relationship between the categorical variable and the target variable. 

Here is an example of how to use Target Guided Ordinal Encoding in a machine learning project. Suppose you are working on a project that involves predicting the salary of employees in a company based on their job titles. You have a dataset that includes job titles and corresponding salaries for a sample of employees in the company. One way to use Target Guided Ordinal Encoding is to group the job titles by their mean salary and encode them accordingly.

First, calculate the mean salary for each job title in the dataset. Then, sort the job titles by their mean salary, assigning a unique number to each job title based on their position in the sorted list. For example, suppose we have the following job titles and mean salaries:

- Manager: $100,000

- Sales Associate: $50,000

- Engineer: $75,000

- Receptionist: $30,000

In this case, we could encode the job titles using the following mapping:

- Manager: 4
- Engineer: 3
- Sales Associate: 2
- Receptionist: 1

We assigned the number 4 to Manager because it has the highest mean salary, followed by 3 for Engineer, 2 for Sales Associate, and 1 for Receptionist. 

This encoding approach can be useful when the job title is an important feature for predicting salaries, and there is a clear relationship between job title and salary. By encoding the job title based on its relationship with salary, the resulting encoding can be a more informative feature for machine learning algorithms to use in predicting salary.

### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that shows how two variables change together. It tells us if two variables move in the same direction (positive covariance), the opposite direction (negative covariance), or if there is no relationship between them (zero covariance). 

Covariance is important in statistical analysis because it helps us understand the relationship between two variables. For example, if we are analyzing the relationship between a person's age and their income, we can use covariance to see if there is a positive or negative relationship between the two variables. If we find a positive covariance, we can conclude that as age increases, income also tends to increase. If we find a negative covariance, we can conclude that as age increases, income tends to decrease.

Covariance is calculated by multiplying the difference between each value and the mean of each variable, and then dividing the result by the total number of values. The formula for covariance is:

cov(X,Y) = (1/n) * Σ[(xi - x̄)(yi - ȳ)]

where X and Y are the two variables being analyzed, n is the number of observations, xi and yi are the individual values of the variables, x̄ and ȳ are the means of the variables, and Σ is the sum of the products of the differences between each value and its respective mean.

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# create a sample dataset
data = {'color': ['red', 'green', 'blue', 'blue', 'red', 'green'],
        'size': ['small', 'medium', 'medium', 'large', 'small', 'medium'],
        'material': ['wood', 'metal', 'plastic', 'wood', 'metal', 'plastic']}

df = pd.DataFrame(data)

# initialize label encoder
le = LabelEncoder()

# label encode the categorical columns
df['color'] = le.fit_transform(df['color'])
df['size'] = le.fit_transform(df['size'])
df['material'] = le.fit_transform(df['material'])

# print the encoded dataset
print(df)

   color  size  material
0      2     2         2
1      1     1         0
2      0     1         1
3      0     0         2
4      2     2         0
5      1     1         1


In this code, we first create a sample dataset with three categorical columns: color, size, and material. We then import the LabelEncoder class from scikit-learn's preprocessing module.

We then initialize a LabelEncoder object and use it to transform each categorical column in the dataset. The fit_transform() method of the LabelEncoder object is used to transform each column into numerical format.

Finally, we print the encoded dataset to see the output. The resulting output shows the encoded values for each categorical variable. The encoded values start from 0 and increment by 1 for each unique category.

For example, the color column has values [red, green, blue], which are encoded as [2, 1, 0] respectively. The size column has values [small, medium, large], which are encoded as [2, 1, 0]. And the material column has values [wood, metal, plastic], which are encoded as [2, 1, 0].

### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [5]:
import numpy as np
import pandas as pd

# Generate sample data
#Generate random sample data for age (continuous variable)
age = np.random.normal(loc=30, scale=5, size=100)

# Generate random sample data for income (continuous variable)
income = np.random.normal(loc=50000, scale=10000, size=100)

# Generate random sample data for education level (categorical variable)
education_levels = ['High School', "Bachelor's", "Master's", 'PhD']
education = np.random.choice(education_levels, size=100)

from sklearn.preprocessing import LabelEncoder

# Convert categorical variable to numerical variable
le = LabelEncoder()
education_level_numerical = le.fit_transform(education)

# Create DataFrame
data = pd.DataFrame({'age': age, 'education_level': education_level_numerical, 'income': income})

# calculate the covariance matrix
cov_matrix = data.cov()

print(cov_matrix)

                         age  education_level        income
age                23.518886        -0.025768  2.828212e+03
education_level    -0.025768         1.404444 -5.159082e+02
income           2828.212013      -515.908154  8.033428e+07


Interpretation:
- There is positive correlation between age and income as age increases, income also increases.
- There is negative correlation between age and education indicating no relation.
- There is negative correlation between income and education.

### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

For the "Gender" variable, we would use binary encoding since there are only two unique values (Male/Female). Binary encoding would create one binary column to represent the variable.

For the "Education Level" variable, we would use ordinal encoding since there is a clear order to the categories (High School < Bachelor's < Master's < PhD). This method would assign numerical values to each category based on their order.

For the "Employment Status" variable, we would use one-hot encoding since there is no clear order to the categories and each category is equally important. This method would create a binary column for each unique value in the variable.

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [3]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# create a sample dataset
weather_condition = ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy', 'Cloudy', 'Sunny']
wind_direction = ['North', 'South', 'East', 'West', 'West', 'North', 'East']
temperature = [25, 20, 18, 22, 19, 23, 24]
humidity = [50, 65, 70, 60, 75, 55, 62]

# convert categorical variables to numerical variables
le = LabelEncoder()
weather_condition_numerical = le.fit_transform(weather_condition)
wind_direction_numerical = le.fit_transform(wind_direction)

# Calculate covariance matrix
data = pd.DataFrame({"Temperature": temperature, "Humidity": humidity, "Weather": weather_condition_numerical, "Wind": wind_direction_numerical})
cov_matrix = data.cov()

print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
             Temperature   Humidity   Weather      Wind
Temperature     6.952381 -19.785714  1.071429 -0.785714
Humidity      -19.785714  72.952381 -1.738095  2.619048
Weather         1.071429  -1.738095  0.809524 -0.071429
Wind           -0.785714   2.619048 -0.071429  1.619048


Interpretation: The covariance between temperature and humidity is -19.78, indicating a negative relationship between the two variables. The covariance between temperature and weather condition is 1, indicating a positive relationship between the two variables. The covariance between temperature and wind direction is -0.78, indicating a negative relationship between the two variables. The covariance between humidity and weather condition is -1.73, indicating a negative relationship between the two variables. The covariance between humidity and wind direction is 2.61, indicating a  positive relationship between the two variables. The covariance between weather condition and wind direction is -0.071, indicating a weak negative relationship between the two variables.