## Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.


#### <b><u>Difference Between Ordinal Encoding and Label Encoding</u></b>:

In Label Encoding, it assign the numerical value to the categorical data 'arbitrarily'. Their labels have no actual meaning, but they are simple to deal with. Ordinal encoding is a slightly-advanced form of label encoding; we assign labels based on an order or hierarchy. 

#### <b><u>Example</u></b>:

If we have to assign labels to colors from a features in a dataset, we can assign numbers to color at random, thus Label Enoding can be used in this case. However, if we are dealing with cuts of diamonds, we may want to set up a system where the worse cuts are either assigned a higher or lower number. In this case, we will use Ordinal Encoding.

## Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project. 

#### <b><u>Target Guided Ordinal Encoding works</u></b>:

Target encoding aligns unique categorical values with the target feature based on the average relationship. 

Say we are presented with a data set trying to predict a house’s price range based on color and our colors are red, yellow, and blue. Let’s also say our price ranges for houses are 1, 2, 3, and 4 and our features include basic housing things like square footage and other features (but also color). If we see that red houses tend to fall on average at a 3.35, it means red houses are slightly above a 3 but far below a 4. We then assign every occurrence of the value red to 3.35 as that is the mean target value. This is taking label and ordinal encoding to the next level. We introduce meaningful numbers to take the place of colors as opposed to arbitrary numbers. Also, if blue houses fall at 3.34, or even at 3.35 like red, we have no problem and can assign the number 3.34 or 3.35 to blue. This “double-labeling” (that’s what I have decided to call it) would be impossible with label or ordinal encoding.




In [11]:
import seaborn as sns
import pandas as pd

df_tips=sns.load_dataset('tips')
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [24]:
df_tips['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [25]:
'''

In the below code, we are going to replace the 'time' feature which has categorical values, with the average 'tip' value 
of the corresponding time. So, in this case, our target value is 'tip' and we are using Target Encoding.


In our case, 'time' feature has two unique value, 'Dinner' & "Lunch". We will have to take all the 'tip' value at 'dinner'
and calculate it's mean. Same thing for 'Lunch'. And, then will have replace 'Dinner' & 'Lunch' by their corresponding
mean values.


'''

df_tips_encoded = df_tips.copy()

# After grouping by the unique values under 'time' feature, we are calculating the mean of the 'tip' values of the corresponding
# groups.

mean_series = df_tips_encoded.groupby('time')['tip'].mean()

df_tips_encoded['time'] = df_tips_encoded['time'].map(mean_series)

df_tips_encoded.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,3.10267,2
1,10.34,1.66,Male,No,Sun,3.10267,3
2,21.01,3.5,Male,No,Sun,3.10267,3
3,23.68,3.31,Male,No,Sun,3.10267,2
4,24.59,3.61,Female,No,Sun,3.10267,4


In [28]:
# Concating just for ease of comparing.

pd.concat([df_tips['time'], df_tips_encoded['time']], axis=1)

Unnamed: 0,time,time.1
0,Dinner,3.10267
1,Dinner,3.10267
2,Dinner,3.10267
3,Dinner,3.10267
4,Dinner,3.10267
...,...,...
239,Dinner,3.10267
240,Dinner,3.10267
241,Dinner,3.10267
242,Dinner,3.10267


## Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?


#### <b><u>Covariance</u></b>:

Covariance in statistics measures the extent to which two variables vary linearly. It reveals whether two variables move in the same or opposite directions.


#### <b><u>Importance in Statistical Analysis</u></b>:


Covariance tells us about the direction of the relationship between two random variable. When the covariance between two variables is positive, they tend to move in the same direction. Conversely, a negative covariance signifies that the variables move in opposite directions.


#### <b><u>Covariance Formula</u></b>:

$Cov(X, Y) = Σ [(Xi - Xmean) * (Yi - Ymean)] / (n - 1)$


Where:

    Xᵢ and Yᵢ represent the observed values of X and Y.
    X̄ and Ȳ denote their respective means.
    N is the number of observations.


## Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.


In [32]:
import pandas as pd

df_wood = pd.DataFrame(
    {'Color': ['red','green','blue'],
     'Size': ['small','medium','large'],
     'Material' : ['wood','metal','plastic']
    })

df_wood

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [39]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df_wood_encoded = df_wood.copy()

df_wood_encoded['Color'] =  encoder.fit_transform(df_wood_encoded['Color'])
df_wood_encoded['Size'] =  encoder.fit_transform(df_wood_encoded['Size'])
df_wood_encoded['Material'] =  encoder.fit_transform(df_wood_encoded['Material'])

df_wood_encoded

Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1


## Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.


<table>
    <tr>
        <th></th>
        <th>Age</th>
        <th>Income</th>
        <th>Educational Level</th>
    </tr>
    <tr>
        <th>Age</th>
        <td>VAR(Age)</td>
        <td>COV(Age,Income)</td>
        <td>COV(Age, Educational Level)</td>
    </tr>
    <tr>
        <th>Income</th>
        <td>COV(Income,Age)</td>
        <td>VAR(Income)</td>
        <td>COV(IIncome,Educational Level)</td>
    </tr>
    <tr>
        <th>Educational Level</th>
        <td>COV(Educational Level,Age)</td>
        <td>COV(Educational Level, Incomes)</td>
        <td>VAR(Educational Level)</td>
    </tr>
</table>

## Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?


* Gender(Male/Female) - Label Encoding - No relation or rank between the data, so arbitrary label can be encoded with them.

* Education Level(High School/Bachelor's/Master's/PhD) - Ordinal Encoding - There is rank present betweeen the data. We can use Target Encoding as well, if there is a "Income" feature in the dataset, we can Label it according to the mean of income.

* Employment Status" (Unemployed/Part-Time/Full-Time) - Target Encoding - If there is a "Income" feature in the dataset, we can Label it according to the mean of income for each group. 

## Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

In [4]:
import pandas as pd

# preparation of the dataset

df_weather = pd.DataFrame({'Temperature': [17,19,35,31,18,21],
                           'Humidity': [71,83,77,92,95,64],
                           'Weather Condition' : ['Sunny','Cloudy','Sunny','Rainy', 'Rainy','Cloudy'],
                           'Wind Direction' : ['South', 'East', 'North', 'West', 'South', 'East']
                          })


# Covariance between each numerical feature pair can be calculated by the below line of code.

df_weather.cov()

  df_weather.cov()


Unnamed: 0,Temperature,Humidity
Temperature,57.5,11.6
Humidity,11.6,144.666667
