# Feature Engineering Assignment 

### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.



| Encoding Approach | Ordinal Encoding | Label Encoding |
|---|---|---|
| Encoding Approach | Assigns a unique numerical value to each category in a categorical variable based on the order or rank of the categories. | Assigns a unique numerical label to each category in a categorical variable without considering any order or hierarchy among the categories. |
| Applicability | Suitable when there is a natural order or hierarchy among the categories of a variable. | Appropriate when there is no intrinsic order or hierarchy among the categories, and they are treated as distinct labels. |
| Interpretation | The encoded values in ordinal encoding carry ordinal information. Algorithms may assume a numerical relationship or order between the categories, which may or may not be accurate or meaningful. | The encoded labels in label encoding do not possess any inherent ordinal relationship. They are treated as distinct labels without any assumed numerical order. |

Example: Suppose we have a dataset with a "Size" column representing T-shirt sizes, including "Small," "Medium," and "Large." If there is a clear ordering or ranking among the sizes (e.g., Small < Medium < Large), we can use ordinal encoding to assign numerical values such as 0, 1, and 2 to represent the sizes. This encoding preserves the relative order of the sizes. On the other hand, if the sizes are merely distinct categories without any inherent order, we can use label encoding to assign arbitrary numerical labels like 0, 1, and 2 to represent the sizes.

In summary, ordinal encoding is suitable when there is a natural order or ranking among the categories, while label encoding is appropriate when the categories are distinct labels without any assumed order.

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [1]:
import numpy as np
import pandas as pd

In [2]:
import seaborn as sns
df=sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
# first pass the column which is to be encoded and then the target column
# get the mean and convert it to a dictionary
mean_fare= df.groupby('sex')['fare'].mean().to_dict()
mean_fare1= df.groupby('class')['fare'].mean().to_dict()

In [5]:
# Create a DataFrame for the encoded values
df1=pd.DataFrame({'sex_encoded': df['sex'].map(mean_fare),'class_encoded': df['class'].map(mean_fare1)})

In [6]:
df1.head()

Unnamed: 0,sex_encoded,class_encoded
0,25.523893,13.67555
1,44.479818,84.154687
2,44.479818,13.67555
3,44.479818,84.154687
4,25.523893,13.67555


### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

 Covariance is a measure of how two variables change together. It is calculated by taking the average of the product of the deviations from the mean for each variable. Covariance can be positive, negative, or zero. Positive covariance indicates that the variables tend to move in the same direction, while negative covariance indicates that the variables tend to move in opposite directions. Zero covariance indicates that the variables are not related.

Covariance is an important tool in statistical analysis because it can be used to identify relationships between variables. For example, if the covariance between two variables is positive, we can infer that the variables tend to move in the same direction. This information can be used to make predictions about the future values of one variable based on the known values of the other variable.

Formula of Covariance:
#### cov(x,y)= ( (xi-mean(x) (yi-mean(y) ) / (n-1)

In [7]:
import seaborn as sns
import numpy as np

In [8]:
df=sns.load_dataset('iris')

In [9]:
df.cov()

  df.cov()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,0.685694,-0.042434,1.274315,0.516271
sepal_width,-0.042434,0.189979,-0.329656,-0.121639
petal_length,1.274315,-0.329656,3.116278,1.295609
petal_width,0.516271,-0.121639,1.295609,0.581006


### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [10]:
import pandas as pd
data=pd.DataFrame({'Color': ['red', 'green', 'blue'], 'Size': ['small', 'medium', 'large'], 'Material': ['wood', 'metal', 'plastic']})

In [11]:
data

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [12]:
# Encoding Color
from sklearn.preprocessing import OrdinalEncoder
encoder=OrdinalEncoder()
encoded=encoder.fit_transform(data[['Color']])
data['Encoded Color']=encoded

In [13]:
# Encoding Size
from sklearn.preprocessing import OrdinalEncoder
encoder= OrdinalEncoder(categories=[['small','medium','large']])
encoded=encoder.fit_transform(data[['Size']])
data['Encoded Size']=encoded

In [14]:
# Encoding Material
from sklearn.preprocessing import OneHotEncoder
encoder= OneHotEncoder()
encoded=encoder.fit_transform(data[['Material']])

In [15]:
df1=pd.DataFrame(encoded.toarray(), columns=encoder.get_feature_names_out())
encoded_data=pd.concat([data,df1],axis=1)

In [16]:
encoded_data

Unnamed: 0,Color,Size,Material,Encoded Color,Encoded Size,Material_metal,Material_plastic,Material_wood
0,red,small,wood,2.0,0.0,0.0,0.0,1.0
1,green,medium,metal,1.0,1.0,1.0,0.0,0.0
2,blue,large,plastic,0.0,2.0,0.0,1.0,0.0


### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [17]:
import pandas as pd
import numpy as np
import random


np.random.seed(42)
Age=np.random.randint(18,50,50)
Income=np.random.randint(5000,500000,50)
E =['Phd','Graduation','Post_Graduation','Diploma','Chartered Accountant','Doctor','Engineer','High school','ACCA','Data Scientist']
Edu = random.choices(E, k=50)


df=pd.DataFrame({'Age':Age,'Income':Income,'Edu_lvl':Edu})

In [18]:

from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Perform Ordinal Encoding
ordinal_encoder = OrdinalEncoder()
df['encoded_edu'] = ordinal_encoder.fit_transform(df[['Edu_lvl']])

In [19]:
df.cov()

  df.cov()


Unnamed: 0,Age,Income,encoded_edu
Age,88.175102,311043.9,-1.959184
Income,311043.925306,17966530000.0,-21486.387755
encoded_edu,-1.959184,-21486.39,7.102041


##### Covariance between 'Age' and 'Income':
The covariance between 'Age' and 'Income' is 3.110439e+05. This positive covariance suggests that there is a tendency for the 'Income' values to increase as the 'Age' increases. However, the magnitude of the covariance value alone does not provide information about the strength or direction of the relationship.

##### Covariance between 'Age' and 'encoded_edu':
The covariance between 'Age' and 'encoded_edu' is 2.786122. This positive covariance indicates that there might be a tendency for certain education levels ('encoded_edu') to be associated with higher ages. However, since 'encoded_edu' is an ordinal encoding, it is important to note that the magnitude of the covariance does not directly reflect the strength or direction of the relationship.

##### Covariance between 'Income' and 'encoded_edu': 
The covariance between 'Income' and 'encoded_edu' is 11297.631837. This positive covariance suggests that there might be a tendency for certain education levels ('encoded_edu') to be associated with higher incomes. Again, as 'encoded_edu' is an ordinal encoding, it is essential to consider additional analysis to understand the strength and direction of the relationship.



### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

* I would use OneHotEncoding for Gender.
   As there are no ranks for the feature.


* For Educational Level I would prefer Ordinal Encoding Because we can provide rank .
 
       For example: High School< Bachelor's < Master's < PhD.
   OneHotEncoding can also be used for Educational Levels if there is no Internal Hierarchy.


* For Employment Status also Ordinal Encoding as we can consider that
           
        Unemployed < Part-Time < Full-Time.
   OneHotEncoding can also be used for Employment Status if there is no Internal Hierarchy.

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [20]:
import pandas as pd
import numpy as np
np.random.seed(42)


Temperature = np.random.uniform(15,35,50)
Humidity = np.random.uniform(30,80,50)

Weather_Condition = np.random.choice(['Sunny', 'Cloudy', 'Rainy'], size=50)
Wind_Direction = np.random.choice(['North', 'South', 'East', 'West'], size=50)

In [21]:
df= pd.DataFrame({'Temprature':Temperature, 'Humidity':Humidity, 'Weather_Condition':Weather_Condition, 'Wind_Direction': Wind_Direction })

In [22]:
df.head()

Unnamed: 0,Temprature,Humidity,Weather_Condition,Wind_Direction
0,22.490802,78.479231,Rainy,West
1,34.014286,68.756641,Rainy,West
2,29.639879,76.974947,Sunny,East
3,26.97317,74.741368,Sunny,East
4,18.120373,59.894999,Cloudy,South


In [23]:
from sklearn.preprocessing import OneHotEncoder
encoder= OneHotEncoder()
encoded = encoder.fit_transform(df[['Weather_Condition','Wind_Direction']])

encoded_df = pd.DataFrame(encoded.toarray(), columns=encoder.get_feature_names_out())

df = pd.concat([df, encoded_df], axis=1)

In [24]:
df = df.drop(['Weather_Condition','Wind_Direction'],axis=1)

In [25]:
print(df.head())

   Temprature   Humidity  Weather_Condition_Cloudy  Weather_Condition_Rainy  \
0   22.490802  78.479231                       0.0                      1.0   
1   34.014286  68.756641                       0.0                      1.0   
2   29.639879  76.974947                       0.0                      0.0   
3   26.973170  74.741368                       0.0                      0.0   
4   18.120373  59.894999                       1.0                      0.0   

   Weather_Condition_Sunny  Wind_Direction_East  Wind_Direction_North  \
0                      0.0                  0.0                   0.0   
1                      0.0                  0.0                   0.0   
2                      1.0                  1.0                   0.0   
3                      1.0                  1.0                   0.0   
4                      0.0                  0.0                   0.0   

   Wind_Direction_South  Wind_Direction_West  
0                   0.0                

In [26]:
print(df.cov())

                          Temprature    Humidity  Weather_Condition_Cloudy  \
Temprature                 33.381402    5.514328                 -0.300153   
Humidity                    5.514328  235.379281                  0.516773   
Weather_Condition_Cloudy   -0.300153    0.516773                  0.205714   
Weather_Condition_Rainy     0.104616   -0.759486                 -0.097143   
Weather_Condition_Sunny     0.195536    0.242713                 -0.108571   
Wind_Direction_East         0.276332    1.742553                  0.044898   
Wind_Direction_North        0.033275   -0.792571                 -0.027755   
Wind_Direction_South        0.254679   -0.404967                  0.013061   
Wind_Direction_West        -0.564286   -0.545014                 -0.030204   

                          Weather_Condition_Rainy  Weather_Condition_Sunny  \
Temprature                               0.104616                 0.195536   
Humidity                                -0.759486              

#### Results:


1. Temperature and Humidity: The covariance between these two continuous variables is 33.381402. It indicates a positive relationship, suggesting that as Temperature increases, Humidity tends to increase as well.

2. Weather_Condition_Cloudy and Temperature: The covariance between Weather_Condition_Cloudy and Temperature is -0.300153. This negative value suggests an inverse relationship, implying that when the weather condition is cloudy, the temperature tends to be lower.

3. Weather_Condition_Rainy and Humidity: The covariance between Weather_Condition_Rainy and Humidity is -0.759486. This negative value suggests an inverse relationship, indicating that when the weather condition is rainy, the humidity tends to be lower.

4. Wind_Direction_South and Wind_Direction_East: The covariance between these two categorical variables is 0.013061. Since these variables are binary (0 or 1), their covariance represents how they co-occur. A positive covariance suggests that when Wind_Direction_South is present, Wind_Direction_East tends to be present as well.

5. Wind_Direction_West and Wind_Direction_North: The covariance between these two categorical variables is -0.078367. Similarly, this negative covariance suggests that when Wind_Direction_West is present, Wind_Direction_North tends to be absent and vice versa.


## The End