### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
### might choose one over the other.

Ans.

Ordinal encoding and label encoding are both methods of converting categorical data into numerical form, but there are some key differences between the two.

Label encoding:- is a method of encoding categorical data where each unique category is assigned an integer value. For example, if we have a categorical variable "color" with three categories: red, blue, and green, we could assign the values 0, 1, and 2 to them. However, in the case of label encoding, the integer values assigned do not convey any relationship or order between the categories.

Ordinal encoding:- is a method of encoding categorical data where the categories are assigned numerical values based on their order or rank. For example, if we have a categorical variable "size" with three categories: small, medium, and large, we could assign the values 0, 1, and 2 to them respectively. In this case, the numerical values assigned do convey some information about the order or rank of the categories.

Which encoding method to use depends on the nature and characteristics of the data being analyzed. If there is no clear order or ranking between the categories, label encoding may be more appropriate. However, if there is a natural order or rank between the categories, ordinal encoding may be more useful.

For example, in a dataset where there is a clear order among variable categories (such as ratings or size categories), ordinal encoding would be more appropriate. In contrast, in a dataset where the categories are arbitrary and have no natural order (such as colors or names), label encoding would be more appropriate.

Overall, it is important to carefully consider the nature of the data and choose an appropriate encoding method that captures the most relevant information in the categorical variables.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
### a machine learning project.

Ans.

Target-guided ordinal encoding is a technique used to encode categorical variables in such a way that the values are assigned based on the target variable, taking the target variable into account. This encoding method is a combination of the ordinal encoding and mean encoding.

To apply target-guided ordinal encoding, we first calculate the mean of the target variable for each category of the categorical variable. We then sort the categories based on these means and assign a numerical value to each category based on its rank.

For example, consider a dataset that contains a categorical variable "race", with four categories: "White", "Asian", "Black", and "Hispanic", and a binary target variable "voter" indicating whether or not a person is a registered voter. We can apply target-guided ordinal encoding as follows: 

1.Calculate the mean of the target variable (voter) for each category of the categorical variable (race). 

2.Sort the categories based on these means, from highest to lowest. 

3.Assign a numerical value to each category based on its rank, with the highest mean assigned a value of 1, the next highest mean assigned a value of 2, and so on.

This method of encoding could be useful in a machine learning project where there is a strong relationship between the categorical variable and the target variable. By assigning values based on the target variable, we can potentially improve the performance of the model by highlighting and utilizing important relationships within the dataset.

However, target-guided ordinal encoding should be used with caution as it can lead to overfitting and generalization issues, especially if the target variable is imbalanced. It is important to carefully evaluate and test the model to ensure accurate results.


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans.

Covariance is a statistical measurement that evaluates how two variables change together. In other words, it measures the degree to which two variables are linearly associated with each other. When two variables are positively associated, they move in the same direction, while in a negative association, they move in opposite directions.

In statistical analysis, covariance is important because it can be used to identify the nature and strength of the relationship between two variables. It is particularly useful in portfolio theory, where it is used to reduce the overall risk of a portfolio by diversifying the investments across assets that are less associated with each other.

The covariance between two variables can be calculated by the following formula:

cov(X,Y) = summation((Xi - mean(X)) * (Yi - mean(Y))) / (n - 1)

The result can be interpreted as follows: 

-A positive covariance indicates that the two variables move together in the same direction, while a negative covariance indicates they move in opposite directions. 
-A covariance of zero indicates that there is no linear relationship between the two variables.

In summary, covariance is an important statistical concept that measures the relationship between two variables. It can help identify risk and diversification opportunities in portfolio theory and can also be used in other statistical analyses to understand associations between variables.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
### large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
### Show your code and explain the output.

In [1]:
#Solution:-

#Creating Dataset:-

import pandas as pd

df = pd.DataFrame({"Color": ["red","green","blue","red","green","blue"],
                   "Size": ["small","medium","large","small","medium","large"],
                   "Material": ["wood","metal","plastic","wood","metal","plastic"]})

In [2]:
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic
3,red,small,wood
4,green,medium,metal
5,blue,large,plastic


In [20]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df_encoded_color = pd.DataFrame(encoder.fit_transform(df["Color"]), columns= ["Color_encoded"])
df_encode_Size = pd.DataFrame(encoder.fit_transform(df["Size"]), columns = ["Size_encoded"])
df_encoded_Material =  pd.DataFrame(encoder.fit_transform(df["Material"]), columns = ["Material_encoded"])
df_encoded = pd.concat([df,df_encoded_color, df_encode_Size, df_encoded_Material], axis = 1)

In [22]:
df_encoded  # we can see in Color encoding: red->2, green->1, blue->0,  Size encoding: small->2, medium->1, large->0,  Material encoding: wood->2, metal->0, plastic->1

Unnamed: 0,Color,Size,Material,Color_encoded,Size_encoded,Material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1
3,red,small,wood,2,2,2
4,green,medium,metal,1,1,0
5,blue,large,plastic,0,0,1


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
### level. Interpret the results.

In [26]:
#Solution:-

#Creating Dataset:-

import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({"Age": [40, 30, 35, 45],
                   "Income": [60000, 50000, 58000, 39000],
                   "Education": ["Phd", "B.tech", "M.tech", "Intermediate"]
                   })
df

Unnamed: 0,Age,Income,Education
0,40,60000,Phd
1,30,50000,B.tech
2,35,58000,M.tech
3,45,39000,Intermediate


In [38]:
encoder = OrdinalEncoder(categories= [["Intermediate", "B.tech", "M.tech", "Phd"]])
Education_encoded = pd.DataFrame(encoder.fit_transform(df[["Education"]]), columns= ["Education_encoded"])

In [35]:
df_final = pd.concat([df, Education_encoded], axis = 1)

In [36]:
df_final

Unnamed: 0,Age,Income,Education,Education_encode
0,40,60000,Phd,3.0
1,30,50000,B.tech,1.0
2,35,58000,M.tech,2.0
3,45,39000,Intermediate,0.0


In [40]:
df_final.cov(numeric_only=True)  # we can see only Income & Education has positive covariance i.e. Income increase if Education Increases or vice versa

Unnamed: 0,Age,Income,Education_encode
Age,41.666667,-25833.33,-1.666667
Income,-25833.333333,90916670.0,11833.333333
Education_encode,-1.666667,11833.33,1.666667


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Q6. You are working on a machine learning project with a dataset containing several categorical
### variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
### and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
### each variable, and why?

Ans.

For the given categorical variables, we can use the following encoding methods:

-Gender (Male/Female): Since gender is a binary variable, we can use one-hot encoding to transform it into two dummy variables, in this case "Male" and "Female". This will result in a matrix with two columns, one for each gender. The value for a given sample in each column will be either 1 or 0, representing the presence or absence of the corresponding gender.

-Education Level (High School/Bachelor's/Master's/PhD): For education level, we can use ordinal encoding to convert the levels of education to integer values based on their order. For example, "High School" could be encoded as 1, "Bachelor's" as 2, "Master's" as 3, and "PhD" as 4. This encoding preserves the order of the categories and can be used for variables that have a natural ordering.

-Employment Status (Unemployed/Part-Time/Full-Time): For employment status, we can use one-hot encoding to create three dummy variables, one for each category. This will result in a matrix with three columns, one for each employment status. The value for a given sample in each column will be either 1 or 0, representing the presence or absence of the corresponding employment status.

The encoding methods chosen will depend on the nature of the categorical variable and the specific requirements of the machine learning algorithm being used. For example, some algorithms may work better with one encoding method over another.


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
### categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
### East/West). Calculate the covariance between each pair of variables and interpret the results.

In [42]:
# Creating Dataset:-

import pandas as pd

df = pd.DataFrame({"Temperature": [37.2, 38.4, 37.2, 33.4, 31.1, 32.0, 32.6],
                   "Humidity": [31.1, 30.8, 38.2, 54.8, 66.2, 67.3, 63.3],
                   "Weather Condition": ["Sunny", "Sunny", "Sunny", "Rainy", "Rainy", "Cloudy", "Cloudy"],
                   "Wind Direction": ["West", "West", "West", "South", "South", "West", "East"]})

In [43]:
df

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,37.2,31.1,Sunny,West
1,38.4,30.8,Sunny,West
2,37.2,38.2,Sunny,West
3,33.4,54.8,Rainy,South
4,31.1,66.2,Rainy,South
5,32.0,67.3,Cloudy,West
6,32.6,63.3,Cloudy,East


In [54]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
df_WeatherCondition = pd.DataFrame(encoder.fit_transform(df[["Weather Condition"]]).toarray(), columns= ["Cloudy", "Rainy", "Sunny"])
df_WeatherCondition

Unnamed: 0,Cloudy,Rainy,Sunny
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
5,1.0,0.0,0.0
6,1.0,0.0,0.0


In [58]:
df_windDirection = pd.DataFrame(encoder.fit_transform(df[["Wind Direction"]]).toarray(), columns=["West", "South", "East"])
df_windDirection

Unnamed: 0,West,South,East
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
5,0.0,0.0,1.0
6,1.0,0.0,0.0


In [61]:
df_final = pd.concat([df,df_WeatherCondition,df_windDirection], axis=1)
df_final

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction,Cloudy,Rainy,Sunny,West,South,East
0,37.2,31.1,Sunny,West,0.0,0.0,1.0,0.0,0.0,1.0
1,38.4,30.8,Sunny,West,0.0,0.0,1.0,0.0,0.0,1.0
2,37.2,38.2,Sunny,West,0.0,0.0,1.0,0.0,0.0,1.0
3,33.4,54.8,Rainy,South,0.0,1.0,0.0,0.0,1.0,0.0
4,31.1,66.2,Rainy,South,0.0,1.0,0.0,0.0,1.0,0.0
5,32.0,67.3,Cloudy,West,1.0,0.0,0.0,0.0,0.0,1.0
6,32.6,63.3,Cloudy,East,1.0,0.0,0.0,1.0,0.0,0.0


In [62]:
df_final.cov(numeric_only=True)

Unnamed: 0,Temperature,Humidity,Cloudy,Rainy,Sunny,West,South,East
Temperature,8.732857,-47.79119,-0.752381,-0.769048,1.521429,-0.32619,-0.769048,1.095238
Humidity,-47.79119,271.05619,5.019048,3.419048,-8.438095,2.17619,3.419048,-5.595238
Cloudy,-0.752381,5.019048,0.238095,-0.095238,-0.142857,0.119048,-0.095238,-0.02381
Rainy,-0.769048,3.419048,-0.095238,0.238095,-0.142857,-0.047619,0.238095,-0.190476
Sunny,1.521429,-8.438095,-0.142857,-0.142857,0.285714,-0.071429,-0.142857,0.214286
West,-0.32619,2.17619,0.119048,-0.047619,-0.071429,0.142857,-0.047619,-0.095238
South,-0.769048,3.419048,-0.095238,0.238095,-0.142857,-0.047619,0.238095,-0.190476
East,1.095238,-5.595238,-0.02381,-0.190476,0.214286,-0.095238,-0.190476,0.285714


we can observe:

1.Temperature has positive covariance with Sunny, Wind East so we can conclude Sunny day has high Temperature and when wind direction is east temperature is high.

2.Humidity has positive covariance with Cloudy weather, Rainy weather and wind direction south or west so we can conclude that on Rainy or Cloudy Weather Humidity is high.

We can draw lot more conclusions by observing the Covariance table.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------