**Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.**

Ordinal Encoding and Label Encoding are both techniques used to convert categorical data into numerical form, but they differ in their application and suitability for different types of categorical variables.

#### Ordinal Encoding:

- Ordinal Encoding is used when the categorical variable has an inherent order or ranking among its categories.
- It assigns numerical values to the categories based on their order, preserving the ordinal relationship.
- Typically, the categories are mapped to integer values starting from 0 to N-1, where N is the number of unique categories.
- Ordinal Encoding is suitable for ordinal data, where the categories have a meaningful ranking.
- Example: If we have a categorical variable representing education level with categories "High School," "Bachelor's," "Master's," and "Ph.D.," we can use ordinal encoding to map them to 0, 1, 2, and 3, respectively.

#### Label Encoding:

- Label Encoding is used when the categorical variable is nominal, meaning there is no inherent order or ranking among the categories.
- It assigns unique numerical labels to each category, effectively creating a nominal-to-numeric mapping.
- Label Encoding does not preserve any ordinal relationship among the categories.
- It is suitable for nominal data, where the categories do not have any meaningful ranking.
- Example: If we have a categorical variable representing colors with categories "Red," "Blue," and "Green," we can use label encoding to map them to 0, 1, and 2, respectively.

When there is a rank to be assigned to the data like levels of eductaion background,contract type we use ordinal encoding if the data is of no specific rank like the colors,types of furniture,shapes and so on.

**Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.**

Target Guided Ordinal Encoding is a technique used to encode categorical variables by creating ordinal labels based on the target variable. It is particularly useful when the target variable is binary (e.g., 0 or 1) or ordinal (e.g., low, medium, high). The main idea behind Target Guided Ordinal Encoding is to encode the categorical variable in such a way that the encoding reflects the relationship between the categorical variable and the target variable, thereby capturing useful information for predictive modeling.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

1. **Calculate Mean/Median/Other Aggregate of the Target Variable for Each Category** :  For each unique category in the categorical variable, calculate the mean, median, or another appropriate aggregate of the target variable. This step involves creating a mapping of each category to the corresponding mean/median/aggregate of the target variable.
2. **Ordering the Category based on the Aggregate Values** : Order the categories based on the calculated mean/median/aggregate value. The idea is to assign higher values to categories that have a higher correlation with the target variable and vice versa.
3. **Encode the Categories**: Assign ordinal labels to the categories based on their order. The category with the highest aggregate value gets the highest label, and the one with the lowest aggregate value gets the lowest label.

In [1]:
#For Example

import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

df.head()

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180


In [2]:
mean_values = df.groupby('city')['price'].mean().to_dict()
mean_values

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [3]:
df['city_encoded'] = df['city'].map(mean_values)

In [4]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


Here we had a data based on city and cost of living of different people . We had to convert the City Categorical Column to Numerical so the best possible method to covert this data is by using Target Guided Ordinal Encoding.

**Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?**

Covariance is a statistical measure that quantifies the degree to which two variables change together. It indicates the direction of the relationship between two variables and whether they tend to increase or decrease simultaneously. A positive covariance suggests that the variables tend to increase or decrease together, while a negative covariance suggests that as one variable increases, the other decreases.

Importance of Covariance in Statistical Analysis:

1. **Relationship Assessment**: Covariance helps in understanding the relationship between two variables. If the covariance is positive, it indicates a positive correlation, suggesting that the variables move in the same direction. If it is negative, it indicates a negative correlation, suggesting that the variables move in opposite directions.

2. **Data Understanding**: Covariance provides insights into the direction and strength of association between variables, which is crucial for understanding the underlying patterns in the data.

3. **Portfolio Diversification**: In finance, covariance is used to analyze the risk and return of portfolios. Covariance between the returns of different assets helps investors diversify their portfolio to manage risk effectively.

4. **Modeling**: Covariance is a fundamental component in various statistical models and machine learning algorithms, such as linear regression, principal component analysis (PCA), and factor analysis.

![Covar.png](attachment:Covar.png)

In [5]:
import seaborn as sns
df = sns.load_dataset("penguins")
df.head(3)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female


In [6]:
df.cov()

  df.cov()


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,29.807054,-2.534234,50.375765,2605.591912
bill_depth_mm,-2.534234,3.899808,-16.21295,-747.370093
flipper_length_mm,50.375765,-16.21295,197.731792,9824.416062
body_mass_g,2605.591912,-747.370093,9824.416062,643131.077327


In [7]:
df.corr(method='pearson')

  df.corr(method='pearson')


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.235053,0.656181,0.59511
bill_depth_mm,-0.235053,1.0,-0.583851,-0.471916
flipper_length_mm,0.656181,-0.583851,1.0,0.871202
body_mass_g,0.59511,-0.471916,0.871202,1.0


In [8]:
df.corr(method='spearman')

  df.corr(method='spearman')


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
bill_length_mm,1.0,-0.221749,0.672772,0.5838
bill_depth_mm,-0.221749,1.0,-0.523267,-0.432372
flipper_length_mm,0.672772,-0.523267,1.0,0.839974
body_mass_g,0.5838,-0.432372,0.839974,1.0


**Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.**

In [9]:
import pandas as pd

#For example creating the above Dataset
df = pd.DataFrame({"Color" : ['red','green','blue','red','green'],
            "Size" : ['small','large','medium','medium','medium'],
            "Material" : ['wood','wood','plastic','plastic','metal']})

df.head()

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,large,wood
2,blue,medium,plastic
3,red,medium,plastic
4,green,medium,metal


In [10]:
#For Color & Material Coloumn we will use Label Encoding 
from sklearn.preprocessing import LabelEncoder

#creating different objects for each coloumn so there is no overridding when new data is inputted
encode= LabelEncoder()
encodee = LabelEncoder()
df['color_encoded'] = pd.DataFrame(encode.fit_transform(df['Color']))
df['material_encoded']= pd.DataFrame(encodee.fit_transform(df['Material']))
df

Unnamed: 0,Color,Size,Material,color_encoded,material_encoded
0,red,small,wood,2,2
1,green,large,wood,1,2
2,blue,medium,plastic,0,1
3,red,medium,plastic,2,1
4,green,medium,metal,1,0


In [11]:
encode.transform(['red'])

array([2])

In [12]:
encodee.transform(['plastic'])

array([1])

In [13]:
#For Size Coloumn we will use Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

encode_obj = OrdinalEncoder(categories=[['small','medium','large']])
df['size_encoded'] = pd.DataFrame(encode_obj.fit_transform(df[['Size']]))
df

Unnamed: 0,Color,Size,Material,color_encoded,material_encoded,size_encoded
0,red,small,wood,2,2,0.0
1,green,large,wood,1,2,2.0
2,blue,medium,plastic,0,1,1.0
3,red,medium,plastic,2,1,1.0
4,green,medium,metal,1,0,1.0


Here we have used Label Encoding for color and material columns and ordinal encoding for size column .If additional data regarding price was given then we could have done target based encoding for the material column.

**Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.**

In [14]:
import pandas as pd

# Sample dataset as a dictionary
df = pd.DataFrame({
    'Age': [30, 40, 22, 35, 21],
    'Income': [50000, 60000, 39000, 55000, 35000],
    'Education Level': ['Bachelor', 'Master', 'High School', 'Bachelor', 'High School']
    })

df

Unnamed: 0,Age,Income,Education Level
0,30,50000,Bachelor
1,40,60000,Master
2,22,39000,High School
3,35,55000,Bachelor
4,21,35000,High School


In [15]:
df.cov()

  df.cov()


Unnamed: 0,Age,Income
Age,67.3,85900.0
Income,85900.0,111700000.0


In [16]:
#So converting the Education Level using ordinal encoding

from sklearn.preprocessing import OrdinalEncoder

encoding_obj = OrdinalEncoder(categories=[['High School','Bachelor','Master']])
df['education_encoded'] = encoding_obj.fit_transform(df[['Education Level']])

df

Unnamed: 0,Age,Income,Education Level,education_encoded
0,30,50000,Bachelor,1.0
1,40,60000,Master,2.0
2,22,39000,High School,0.0
3,35,55000,Bachelor,1.0
4,21,35000,High School,0.0


In [17]:
df.drop(columns='Education Level',inplace = True)
df

Unnamed: 0,Age,Income,education_encoded
0,30,50000,1.0
1,40,60000,2.0
2,22,39000,0.0
3,35,55000,1.0
4,21,35000,0.0


In [18]:
#Covariance Matrix
df.cov()

Unnamed: 0,Age,Income,education_encoded
Age,67.3,85900.0,6.65
Income,85900.0,111700000.0,8450.0
education_encoded,6.65,8450.0,0.7


Each row and column in the matrix corresponds to one of these variables, and the elements represent the covariance between pairs of variables.

The covariance matrix is symmetric, as the covariance between X and Y is the same as the covariance between Y and X. 

Also, the diagonal elements represent the variance of each variable, which is the covariance of a variable with itself.

Interpretation of the covariance matrix:
- The variance of Age is approximately 67.30.
- The variance of Income is approximately 111,700,000.0.
- The variance of education_encoded is approximately 0.70.


Additionally, the off-diagonal elements represent the covariance between pairs of variables:

- The covariance between Age and Income is approximately 85,900.0. It indicates a positive relationship, suggesting that as Age increases, Income tends to increase as well (they move together).
- The covariance between Age and education_encoded is approximately 6.65. It indicates a positive relationship, suggesting that as Age increases, education_encoded tends to increase as well (they move together).
- The covariance between Income and education_encoded is approximately 8,450.0. It indicates a positive relationship, suggesting that higher Income is associated with higher education_encoded (they move together).
- The covariance matrix provides insights into the relationships between the variables and the variability of each variable.

**Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?**



**Gender (Binary Categorical Variable: Male/Female)**:
Since "Gender" is a binary categorical variable with only two possible values (Male and Female), the preferred encoding method is Label Encoding. In label encoding, we can assign 0 to one category (e.g., Male) and 1 to the other category (e.g., Female). Label encoding is suitable for binary categorical variables as it allows us to represent the categories as numerical values, which is useful for various machine learning algorithms.

**Education Level (Ordinal Categorical Variable: High School/Bachelor's/Master's/PhD)**:
"Education Level" is an ordinal categorical variable with a clear ordering of categories. In this case, the recommended encoding method is Ordinal Encoding. Ordinal encoding assigns a unique integer value to each category based on its order. For example, we can encode High School as 0, Bachelor's as 1, Master's as 2, and PhD as 3. By using ordinal encoding, we preserve the ordinal relationship between the categories, which is important when certain categories have a natural order.

**Employment Status (Nominal Categorical Variable: Unemployed/Part-Time/Full-Time)**:
"Employment Status" is a nominal categorical variable with no inherent order among its categories. For nominal categorical variables, the preferred encoding method is One-Hot Encoding. One-hot encoding creates binary columns for each category, where a value of 1 represents the presence of the category, and 0 represents its absence. For example, we can create three columns (Unemployed, Part-Time, Full-Time) where the corresponding category is encoded as 1 and the others as 0. One-hot encoding is useful for preventing any ordinality between categories, as they are treated as distinct and independent.


- Nominal for Gender as they are binary category variables
- Ordinal for Education Level as they are ordinal category variables
- One hot Encoding for Employment Status as they are nominal category variables

**Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.**

In [19]:
import pandas as pd


df = pd.DataFrame({
    'Temperature': [25, 20, 30, 22, 28],
    'Humidity': [60, 70, 55, 75, 65],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
})

df

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,25,60,Sunny,North
1,20,70,Cloudy,South
2,30,55,Rainy,East
3,22,75,Sunny,West
4,28,65,Cloudy,North


In [20]:
#Converting Categorial Variables to Numerical as coariance only works on numerical data
#Using OHE for Weather Condition & Wind Direction as they are Binary Variables

from sklearn.preprocessing import OneHotEncoder

encodeee= OneHotEncoder()
values = encodeee.fit_transform(df[['Weather Condition','Wind Direction']]).toarray()

In [21]:
encoded_df = pd.DataFrame(values,columns=encodeee.get_feature_names_out())
df = pd.concat([df,encoded_df],axis =1)
df.drop(columns=['Weather Condition','Wind Direction'],inplace = True)
df

Unnamed: 0,Temperature,Humidity,Weather Condition_Cloudy,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West
0,25,60,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,20,70,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,30,55,0.0,1.0,0.0,1.0,0.0,0.0,0.0
3,22,75,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,28,65,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [22]:
print("Covariance Matrix : ")
df.cov()

Covariance Matrix : 


Unnamed: 0,Temperature,Humidity,Weather Condition_Cloudy,Weather Condition_Rainy,Weather Condition_Sunny,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West
Temperature,17.0,-26.25,-0.5,1.25,-0.75,1.25,0.75,-1.25,-0.75
Humidity,-26.25,62.5,1.25,-2.5,1.25,-2.5,-1.25,1.25,2.5
Weather Condition_Cloudy,-0.5,1.25,0.3,-0.1,-0.2,-0.1,0.05,0.15,-0.1
Weather Condition_Rainy,1.25,-2.5,-0.1,0.2,-0.1,0.2,-0.1,-0.05,-0.05
Weather Condition_Sunny,-0.75,1.25,-0.2,-0.1,0.3,-0.1,0.05,-0.1,0.15
Wind Direction_East,1.25,-2.5,-0.1,0.2,-0.1,0.2,-0.1,-0.05,-0.05
Wind Direction_North,0.75,-1.25,0.05,-0.1,0.05,-0.1,0.3,-0.1,-0.1
Wind Direction_South,-1.25,1.25,0.15,-0.05,-0.1,-0.05,-0.1,0.2,-0.05
Wind Direction_West,-0.75,2.5,-0.1,-0.05,0.15,-0.05,-0.1,-0.05,0.2
