# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

## Label encoding
Label encoding involves assigning a unique numerical label to each category in the variable.
Labels are usually assigned in alphabetical order or based on the frequencies of the category.

Example: If we have categorical variable color with 3 possible values (red, blue, green). we can represent using label encoding:

1. Red --> 1
2. Blue --> 2
3. Green --> 3

## Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique each category is assigned a numerical value based on its position in the order.
Example: If we have categorical variable "Education Level" with 4 possible values('High School', 'College', 'Graduate', 'Post Graduate'). we can represent using ordinal encoding.
1. High School --> 1
2. College --> 2
3. Graduate --> 3
4. Post Graduate --> 4

# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it ina machine learning project.

Target Guided Ordinal Encoding is a type of encoding in which each unique value in a categorical feature is assigned an value based on the mean or median of the target variable for that category.


In [1]:
import pandas as pd

# creating dataframe
df = pd.DataFrame({
    'city' : ['New York', 'London', 'New York', 'Paris','London'],
    'price': [100,200,300,250,400]
})

In [2]:
df

Unnamed: 0,city,price
0,New York,100
1,London,200
2,New York,300
3,Paris,250
4,London,400


In [4]:
# mean by city
mean_price = df.groupby('city')['price'].mean().to_dict()

In [5]:
mean_price

{'London': 300, 'New York': 200, 'Paris': 250}

In [6]:
df['city_encoded']= df['city'].map(mean_price)

In [7]:
df

Unnamed: 0,city,price,city_encoded
0,New York,100,200
1,London,200,300
2,New York,300,200
3,Paris,250,250
4,London,400,300


In [8]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,100,200
1,200,300
2,300,200
3,250,250
4,400,300


# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical tool used to determine the relationship between the movements of two random variables. Covariance is important in statistical analysis because it helps to understand the relationship between two variables and can be used to identify patterns and trends in the data.
The positive covariance states that two assets are moving together give positive returns while negative covariance means returns move in the opposite direction.

Covariance for population

Cov(x,y) = Σ ((xi – x_bar) * (yi -y_bar)) / N

Covariance for sample

Cov(x,y) = Σ ((xi – x_bar) * (yi -y_bar)) / (N-1)

xi = Data variable of x
yi = Data variable of y
x_bar = Mean of x
y_bar = Mean of y
N= Number of data variables.

# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [15]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data_dict = {'Color' : ['red','green','red','blue','green'],
             'Size'  : ['small','large','medium','small','medium'],
             'Material' : ['wood', 'plastic','metal','wood','metal']
            }

# Data Columns
data_col = ['Color', 'Size','Material']
# dataframe 
df = pd.DataFrame(data_dict)

    

le = LabelEncoder()

df[data_col] = df[data_col].apply(le.fit_transform)

df.head()


Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,0,1
2,2,1,0
3,0,2,2
4,1,1,0


# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [71]:
import numpy as np
import pandas as pd

np.random.seed(780)

# Generating Data
n=1000
age = np.random.randint(20,60,size=n)
income = 1000*age+ np.random.normal(loc=0, scale=1000, size=n)
Edu_level = np.random.choice(['High School','Bachelor','Masters','PhD'],size=n)

df = pd.DataFrame({
    'Age':age,
    'Income':income,
    'Education Level':Edu_level
})
df.head()


Unnamed: 0,Age,Income,Education Level
0,41,41859.499268,Bachelor
1,45,45414.734329,Bachelor
2,54,54197.446904,Masters
3,31,30854.233199,High School
4,28,29646.930671,PhD


In [72]:
# Ordinal encoding on Education Level
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[['High School','Bachelor','Masters','PhD']])
edu_encoded = oe.fit_transform(df[['Education Level']])
df['Education Level'] = np.ravel(edu_encoded)

df.head()



Unnamed: 0,Age,Income,Education Level
0,41,41859.499268,1.0
1,45,45414.734329,1.0
2,54,54197.446904,2.0
3,31,30854.233199,0.0
4,28,29646.930671,3.0


In [73]:
# covariance

df.cov()

Unnamed: 0,Age,Income,Education Level
Age,132.513264,132108.2,0.368882
Income,132108.214928,132672000.0,367.974257
Education Level,0.368882,367.9743,1.225144


# Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?


For Gender, we would use binary encoding as there are only two categories. For Education Level, we can use ordinal encoding as there is an inherent order to the categories (i.e., higher levels of education imply more education than lower levels). For Employment Status, we can use one-hot encoding as there is no order or hierarchy to the categories.

# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [74]:
import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(300)

# Generate data
n = 1000
temp = np.random.normal(45, 15, n)
humidity = np.random.normal(60, 10, n)
weather_condition = np.random.choice(['Sunny', 'Cloudy', 'Rainy'], size=n)
wind_direction = np.random.choice(['North', 'South', 'East', 'West'], size=n)

# Create dataframe
df = pd.DataFrame({
    'Temperature': temp, 
    'Humidity': humidity, 
    'Weather Condition': weather_condition, 
    'Wind Direction': wind_direction
})

# Show first few rows
df.head()


Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,22.722445,42.045473,Sunny,East
1,41.096929,50.470654,Rainy,West
2,23.631519,62.592838,Rainy,East
3,31.629016,59.497223,Cloudy,North
4,56.41745,60.901102,Cloudy,South


In [76]:
cov_matrix = df[['Temperature','Humidity']].cov()
print(cov_matrix)

             Temperature   Humidity
Temperature   221.438177   1.930439
Humidity        1.930439  94.794605


The covariance between "Temperature" and "Humidity" is 1.930 , indicating a positive relationship between the two variables. This means that as temperature increases, humidity tends to increase as well. The variances of each variable are shown on the diagonal, with Humidity having a larger variance than Temperature.

In [77]:
cov_matrix = pd.get_dummies(df[['Weather Condition', 'Wind Direction']]).cov()
print(cov_matrix)

                          Weather Condition_Cloudy  Weather Condition_Rainy  \
Weather Condition_Cloudy                  0.221998                -0.113325   
Weather Condition_Rainy                  -0.113325                 0.224944   
Weather Condition_Sunny                  -0.108673                -0.111619   
Wind Direction_East                       0.008657                 0.000525   
Wind Direction_North                     -0.004645                 0.001932   
Wind Direction_South                      0.000332                -0.001911   
Wind Direction_West                      -0.004344                -0.000546   

                          Weather Condition_Sunny  Wind Direction_East  \
Weather Condition_Cloudy                -0.108673             0.008657   
Weather Condition_Rainy                 -0.111619             0.000525   
Weather Condition_Sunny                  0.220291            -0.009181   
Wind Direction_East                     -0.009181             0.180484 