**Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.**


*Ordinal Encoding -* It encodes categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order.<br>
*Label Encoding -*  It involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories.

For Example:
Ordinal Encoding is used when we see a clear ordering or ranking pattern like Education level (i.e., High School, Undergraduate, Graduate, Post Graduate, PhD).
Label Encoding is used when there is not any ranking pattern and just want to convert into numerical data points, it do so using the alphabetical ordering (i.e., encoding the red catogory with 1, green with 2, blue -> 3, etc.).

**Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.**

It is a technique used to encode categorical variables based on their relationship with the target variable. In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

**Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?**


Covariance is a statistical measure of the linear relationship between two random variables. It tells you how much the two variables change together. A positive covariance indicates that the variables tend to move in the same direction, while a negative covariance indicates that they tend to move in opposite directions.

Covariance is an important concept in statistical analysis because it can be used to:

- Identify relationships between variables
- Predict the value of one variable based on the value of another variable
- Build statistical models that explain the relationships between variables
- Assess the risk of investing in a portfolio of assets

Covariance is calculated using the following formula:

>               Cov(X, Y) = E[(X - E[X]) (Y - E[Y])]
where:

- X and Y are the two random variables
- E[X] and E[Y] are the expected values of X and Y

**Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.**

In [2]:
import pandas as pd
df = pd.DataFrame({
    'Color':['red','green','blue'],
    'Size':['small','medium','large'],
    'Material':['wood','metal','plastic']
})
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [3]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
# df['encoded_Color'] = encoder.fit_transform(df['Color'])
# df['encoded_Size'] = encoder.fit_transform(df['Size'])
# df['encoded_Material'] = encoder.fit_transform(df['Material'])

for col in df:
    encded = encoder.fit_transform(df[col])
    df[col] = encded

df

Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1


**Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.**


In [48]:
df2 = pd.DataFrame({
    'Age':[23,15,41,25,33,14,12],
    'Income':[87000,7700,98000,18000,48000,1900,500],
    'Education level':['Graduate','High School','Post Graduate','Graduate','Post Graduate','High School','High School']
})
df2

Unnamed: 0,Age,Income,Education level
0,23,87000,Graduate
1,15,7700,High School
2,41,98000,Post Graduate
3,25,18000,Graduate
4,33,48000,Post Graduate
5,14,1900,High School
6,12,500,High School


In [49]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
or_encoder = OrdinalEncoder(categories=[['High School','Graduate','Post Graduate']])

encoded_degree = or_encoder.fit_transform(df2[['Education level']])
# encoded_degree
encoded_degree_df = pd.DataFrame(encoded_degree,columns=or_encoder.get_feature_names_out())
df2.drop(['Education level'],axis=1,inplace=True)
df2 = pd.concat([df2,encoded_degree_df],axis=1)
df2

Unnamed: 0,Age,Income,Education level
0,23,87000,1.0
1,15,7700,0.0
2,41,98000,2.0
3,25,18000,1.0
4,33,48000,2.0
5,14,1900,0.0
6,12,500,0.0


In [50]:
df2.cov()

Unnamed: 0,Age,Income,Education level
Age,115.571429,353533.3,9.380952
Income,353533.333333,1687520000.0,28866.666667
Education level,9.380952,28866.67,0.809524


**Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?**

**Gender**: One-hot encoding. Gender is a nominal variable, meaning that the categories have no inherent order. One-hot encoding will create two new binary variables, one for each gender. This will allow the machine learning model to learn the relationship between gender and the target variable without assuming that there is an order to the categories.

**Education Level**: Ordinal encoding. Education level is an ordinal variable, meaning that the categories have a natural order. Ordinal encoding will assign each category a numerical value in accordance with its order. This will allow the machine learning model to learn the relationship between education level and the target variable while taking into account the order of the categories.

**Employment Status**: One-hot encoding. Employment status is a nominal variable, so one-hot encoding is the appropriate method.

I chose one-hot encoding for gender and employment status because they are nominal variables. One-hot encoding is the most common encoding method for nominal variables, and it is well-suited for machine learning algorithms.

I chose ordinal encoding for education level because it is an ordinal variable. Ordinal encoding allows the machine learning model to learn the relationship between education level and the target variable while taking into account the order of the categories.

**Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.**

In [5]:
df1 = pd.DataFrame({
    'Temperature':[23,15,41,25,33,14,12],
    'Humidity':[87,77,98,58,68,89,85],
    'Weather Condition':['Rainy','Foggy','Cloudy','Foggy','Windy','Windy','Foggy'],
    'Wind Direction':['North','South','East','West','South','East','North']
})
df1

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,23,87,Rainy,North
1,15,77,Foggy,South
2,41,98,Cloudy,East
3,25,58,Foggy,West
4,33,68,Windy,South
5,14,89,Windy,East
6,12,85,Foggy,North


In [6]:
# One hot encoding
from sklearn.preprocessing import OneHotEncoder
encoder1 = OneHotEncoder()
w_encoded = encoder1.fit_transform(df1[['Weather Condition']]).toarray() 
weather_encoded = pd.DataFrame(w_encoded,columns=encoder1.get_feature_names_out())

In [7]:
wi_encoded = encoder1.fit_transform(df1[['Wind Direction']]).toarray()
wind_encoded = pd.DataFrame(wi_encoded,columns=encoder1.get_feature_names_out())

In [8]:
df1 = pd.concat([df1,wind_encoded,weather_encoded],axis=1)

In [9]:
df1.drop(['Weather Condition','Wind Direction'],axis=1)

Unnamed: 0,Temperature,Humidity,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West,Weather Condition_Cloudy,Weather Condition_Foggy,Weather Condition_Rainy,Weather Condition_Windy
0,23,87,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,15,77,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,41,98,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,25,58,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,33,68,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5,14,89,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,12,85,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [10]:
df1.cov()

  df1.cov()


Unnamed: 0,Temperature,Humidity,Wind Direction_East,Wind Direction_North,Wind Direction_South,Wind Direction_West,Weather Condition_Cloudy,Weather Condition_Foggy,Weather Condition_Rainy,Weather Condition_Windy
Temperature,115.571429,7.904762,1.404762,-1.928571,0.238095,0.285714,2.952381,-2.97619,-0.047619,0.071429
Humidity,7.904762,185.904762,4.404762,1.904762,-2.595238,-3.714286,2.952381,-3.47619,1.119048,-0.595238
Wind Direction_East,1.404762,4.404762,0.238095,-0.095238,-0.095238,-0.047619,0.119048,-0.142857,-0.047619,0.071429
Wind Direction_North,-1.928571,1.904762,-0.095238,0.238095,-0.095238,-0.047619,-0.047619,0.02381,0.119048,-0.095238
Wind Direction_South,0.238095,-2.595238,-0.095238,-0.095238,0.238095,-0.047619,-0.047619,0.02381,-0.047619,0.071429
Wind Direction_West,0.285714,-3.714286,-0.047619,-0.047619,-0.047619,0.142857,-0.02381,0.095238,-0.02381,-0.047619
Weather Condition_Cloudy,2.952381,2.952381,0.119048,-0.047619,-0.047619,-0.02381,0.142857,-0.071429,-0.02381,-0.047619
Weather Condition_Foggy,-2.97619,-3.47619,-0.142857,0.02381,0.02381,0.095238,-0.071429,0.285714,-0.071429,-0.142857
Weather Condition_Rainy,-0.047619,1.119048,-0.047619,0.119048,-0.047619,-0.02381,-0.02381,-0.071429,0.142857,-0.047619
Weather Condition_Windy,0.071429,-0.595238,0.071429,-0.095238,0.071429,-0.047619,-0.047619,-0.142857,-0.047619,0.238095
