### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

__Ordinal Encoding__ is a method of converting categorical data into numerical data by assigning a unique integer value to each category based on their order or rank. For example, if we have a categorical variable called "Size" with categories "Small," "Medium," and "Large," we could assign the values of 1, 2, and 3 to these categories, respectively, based on their order.

__Label Encoding__ is a method of converting categorical data into numerical data by assigning a unique integer value to each category without any particular order. For example, if we have a categorical variable called "Color" with categories "Red," "Green," and "Blue," we could assign the values of 1, 2, and 3 to these categories, respectively, without any specific order

- For example, if we are working with a dataset that contains a categorical variable called "Education Level" with categories "High School," "College," and "Graduate School," Ordinal Encoding would be the better choice as the categories have a natural order. However, if we have a categorical variable called "Car Make" with categories such as "Ford," "Chevrolet," and "Toyota," Label Encoding would be the more appropriate choice.

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

__Target Guided Ordinal Encoding__ is a method of converting categorical data into numerical data by assigning ordinal values based on the relationship between the categories and the target variable. 

The steps involved in Target Guided Ordinal Encoding are as follows:

- Calculate the mean or median of the target variable for each category in the categorical variable.
- Order the categories based on the mean or median of the target variable in ascending or descending order.
- Assign an ordinal value to each category based on their order.

For example, suppose we have a dataset containing information about employees, including their age, education level, and salary, and we want to predict whether they will leave the company. The categorical variable is "education level," which has categories such as "High School," "College," and "Graduate School."

To use Target Guided Ordinal Encoding, we would follow these steps:

- Calculate the mean or median salary for each education level category.
- Order the categories based on their median salary in ascending or descending order.
- Assign an ordinal value to each category based on their order, such as 1 for "High School," 2 for "College," and 3 for "Graduate School."

By using Target Guided Ordinal Encoding, we can capture the relationship between the education level and salary variables, which may be useful in predicting whether an employee will leave the company.

### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

__Covariance__ is a measure of the linear relationship between two variables. It measures how two variables change together, and whether they have a positive or negative relationship. Specifically, covariance measures the extent to which changes in one variable are associated with changes in another variable.

__Importance__:<br>
Covariance is important in statistical analysis because it helps to identify patterns and relationships between variables. It can be used to determine whether two variables are positively or negatively correlated, or whether they are independent of each other

The formula for calculating the covariance between two variables X and Y is:

Cov(x,y) = (1/n) * Σ[(xi - x_mean)*(yi - y_mean)]

- where xi and yi are the individual values of the two variables
- x_mean and y_mean are their respective means,
- n is the number of data points.

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
import pandas as pd

df=pd.DataFrame({
    'Color':['red','green','blue'],
    'Size':['small','medium','large'],
    'Material':['wood','metal','plastic']
})

from sklearn.preprocessing import LabelEncoder

encoder=LabelEncoder()

# Encode categorical variables
df['Color']=encoder.fit_transform(df[['Color']])
df['Size']=encoder.fit_transform(df[['Size']])
df['Material']=encoder.fit_transform(df[['Material']])

  return f(*args, **kwargs)


In [2]:
df

Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1


In this output, we can see that each categorical variable has been replaced with a numerical label. For example, the "Color" variable now has the labels 0, 1, and 2, which correspond to the original values "blue", "green", and "red", respectively. Similarly, the "Size" variable has the labels 0, 1, and 2, which correspond to the original values "large", "medium", and "small", respectively.

### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [20]:
import numpy as np

data = pd.DataFrame({
'age' :[30, 40, 50, 35, 45, 55],
'income':[50000, 60000, 70000, 45000, 55000, 65000],
'education': [12, 16, 20, 14, 18, 22]
})

matrix=data.cov()

In [21]:
matrix

Unnamed: 0,age,income,education
age,87.5,72500.0,35.0
income,72500.0,87500000.0,29000.0
education,35.0,29000.0,14.0


This matrix shows the covariances between each pair of variables. The diagonal elements of the matrix represent the variances of each variable, and the off-diagonal elements represent the covariances between pairs of variables.

For example, the covariance between Age and Income is 72500.0, which indicates a positive relationship between these two variables. This means that as Age increases, Income tends to increase as well. Similarly, the covariance between Age and Education level is 16.1, indicating a weaker positive relationship between these two variables.

### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

__"Gender": Binary Encoding__<br>
Binary encoding would be a suitable choice for "Gender" since there are only two categories (Male/Female). Binary encoding will convert each category into a binary representation, with one column representing the presence or absence of each category. This would result in two columns, one for Male and one for Female, with values of 0 or 1 indicating the absence or presence of each category. 

__"Education Level": Ordinal Encoding__<br>
Ordinal encoding would be a suitable choice for "Education Level" since there is a natural order to the categories (High School < Bachelor's < Master's < PhD). Ordinal encoding will assign a unique integer value to each category based on their order. 

__"Employment Status": One-Hot Encoding__<br>
One-hot encoding would be a suitable choice for "Employment Status" since there is no inherent order to the categories (Unemployed/Part-Time/Full-Time). One-hot encoding will convert each category into a binary representation, with one column representing the presence or absence of each category. This would result in three columns, one for each category, with values of 0 or 1 indicating the absence or presence of each category. 

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [17]:
import pandas as pd

# Create a sample dataset with two continuous variables and two categorical variables
data = pd.DataFrame({
    'Temperature': [20, 22, 25, 18, 21, 23, 24, 19, 20, 22],
    'Humidity': [45, 50, 55, 60, 65, 70, 75, 80, 85, 90],
    'Weather Condition': ["Sunny", "Sunny", "Cloudy", "Rainy", "Sunny", "Cloudy", "Rainy", "Sunny", "Cloudy", "Rainy"],
    'Wind Direction': ["North", "South", "East", "West", "North", "South", "East", "West", "North", "South"]
})

# Calculate the covariance matrix
cov_matrix = data.cov()

# Print the covariance matrix
cov_matrix

Unnamed: 0,Temperature,Humidity
Temperature,4.933333,-1.666667
Humidity,-1.666667,229.166667


In this example, the covariance between temperature and humidity is -1.666667, which indicates a negative relationship between the two variables. This means that as temperature increases, humidity tends to decrease, and vice versa. However, the magnitude of the covariance is relatively small compared to the variances of the individual variables, which are 4.933333 (temperature) and 229.166667 (humidity). Therefore, we can conclude that the relationship between temperature and humidity is weak.