ASSIGNMENT: FE-4

1.  What is data encoding? How is it useful in data science?

Data encoding is the process of transforming data from one format to another format for efficient storage, transmission, and processing. It involves converting data into a specific code or representation that can be easily read, transmitted, and understood by computer systems.

Data encoding is useful in data science for several reasons:

Compression: Data encoding can be used to compress data, reducing the amount of storage space required and making it easier and faster to transmit data over networks.

Standardization: Data encoding can be used to standardize data, ensuring that all data is represented in the same format, regardless of where it came from. This makes it easier to compare and analyze data from multiple sources.

Security: Data encoding can be used to encrypt sensitive data, protecting it from unauthorized access and ensuring that it remains secure during transmission.

Efficient processing: Data encoding can be used to transform data into a format that is optimized for specific types of processing, making it faster and more efficient to work with.

2.  What is nominal encoding? Provide an example of how you would use it in a real-world scenario

Nominal encoding, also known as one-hot encoding, is a technique used in data science to represent categorical variables as binary vectors. Each category is assigned a unique binary value, with a value of 1 indicating that a particular category is present and a value of 0 indicating that it is not.

For example, consider a dataset that contains information about the type of fruit purchased at a grocery store. The "type of fruit" variable is a categorical variable, with possible values of "apple", "orange", and "banana". To use this variable in a machine learning model, we can encode it using nominal encoding. We would create three binary variables, one for each possible fruit type. For each observation in the dataset, we would set the value of the appropriate variable to 1, and the values of the other variables to 0.

![image.png](attachment:7dfea141-6ee5-427a-85fa-be6d680082d2.png)


3. In what situations is ordinal encoding preferred over one-hot encoding? Provide a practical example.

Ordinal encoding is preferred over one-hot encoding when there is a clear order or hierarchy among the categories in a categorical variable. Ordinal encoding assigns each category a numerical value based on its position or rank in the order.

One practical example of when ordinal encoding would be preferred over one-hot encoding is in the case of a survey question that asks respondents to rate their level of agreement with a statement on a Likert scale. The Likert scale has a clear order or hierarchy, with categories such as "strongly disagree", "disagree", "neutral", "agree", and "strongly agree".

Here's an example:

![image.png](attachment:e72a98a1-e695-43b6-b888-eb6563b191d6.png)

4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding 
technique would you use to transform this data into a format suitable for machine learning algorithms? 
Explain why you made this choice

The choice of encoding technique would depend on the nature of the categorical data and the specific requirements of the machine learning algorithm being used. Here are some considerations to keep in mind:

If the categorical data is nominal (i.e., there is no inherent order or hierarchy among the categories), one-hot encoding is usually the preferred technique. One-hot encoding creates binary features for each category, which can be used as input to the machine learning algorithm.

If the categorical data is ordinal (i.e., there is an inherent order or hierarchy among the categories), ordinal encoding may be a more appropriate technique. In ordinal encoding, each category is assigned a numerical value based on its rank or position in the order.

If the number of unique values is small (i.e., less than 10), either one-hot encoding or ordinal encoding may be appropriate, depending on the nature of the data.

If the number of unique values is large (i.e., more than 10), one-hot encoding may lead to a high-dimensional feature space, which can cause problems such as overfitting or slow training times. In this case, other encoding techniques such as target encoding or frequency encoding may be more appropriate.

5.  In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns 
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to 
transform the categorical data, how many new columns would be created? Show your calculations.

If we use nominal encoding to transform the two categorical columns, each unique category in each column will be converted to a new binary column. Therefore, the number of new columns created will depend on the number of unique categories in each column.

Let's assume that the first categorical column has 5 unique categories and the second categorical column has 3 unique categories. Using nominal encoding, we would create 5 new binary columns for the first categorical column and 3 new binary columns for the second categorical column. Therefore, in total, we would create 5 + 3 = 8 new columns.

In general, the number of new columns created using nominal encoding depends on the number of unique categories in each column. If the first categorical column has n1 unique categories and the second categorical column has n2 unique categories, then the total number of new columns created would be n1 + n2.

6. You are working with a dataset containing information about different types of animals, including their 
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into 
a format suitable for machine learning algorithms? Justify your answer.

If the categorical variables are nominal (i.e., there is no inherent order or hierarchy among the categories), one-hot encoding would be a good choice. This is because each category can be transformed into a binary feature that can be easily interpreted by machine learning algorithms. For example, the "species" column could be transformed into binary features representing each species, and the "habitat" column could be transformed into binary features representing each habitat type.

If the categorical variables are ordinal (i.e., there is an inherent order or hierarchy among the categories), ordinal encoding may be more appropriate. For example, if the "diet" column has an ordered scale such as "herbivore", "omnivore", and "carnivore", then these categories could be encoded as 0, 1, and 2, respectively.

If the number of unique categories in the categorical variables is large, one-hot encoding may lead to a high-dimensional feature space, which could cause problems such as overfitting or slow training times. In such cases, other encoding techniques such as target encoding or frequency encoding could be considered.

7. You are working on a project that involves predicting customer churn for a telecommunications 
company. You have a dataset with 5 features, including the customer's gender, age, contract type, 
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical 
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding

The "gender" feature is likely to be nominal, as there is no inherent order or hierarchy among genders. One-hot encoding can be used to transform this feature into binary features.

The "contract type" feature is also likely to be nominal, as there is no inherent order or hierarchy among contract types. One-hot encoding can be used to transform this feature into binary features.

In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# create a sample dataset
data = {'gender': ['male', 'female', 'male', 'female'],
        'age': [25, 30, 40, 50],
        'contract_type': ['month-to-month', 'one year', 'month-to-month', 'two year'],
        'monthly_charges': [50, 70, 80, 90],
        'tenure': [2, 5, 7, 10]}
df = pd.DataFrame(data)

# create a one-hot encoder for the gender and contract_type features
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoder.fit(df[['gender', 'contract_type']])

# transform the categorical features using one-hot encoding
encoded_features = encoder.transform(df[['gender', 'contract_type']])
encoded_features_df = pd.DataFrame(encoded_features, columns=encoder.categories_[0].tolist()+encoder.categories_[1].tolist())

# concatenate the encoded features with the original features
df_encoded = pd.concat([df.drop(['gender', 'contract_type'], axis=1), encoded_features_df], axis=1)

df




Unnamed: 0,gender,age,contract_type,monthly_charges,tenure
0,male,25,month-to-month,50,2
1,female,30,one year,70,5
2,male,40,month-to-month,80,7
3,female,50,two year,90,10


In [6]:
df_encoded

Unnamed: 0,age,monthly_charges,tenure,female,male,month-to-month,one year,two year
0,25,50,2,0.0,1.0,1.0,0.0,0.0
1,30,70,5,1.0,0.0,0.0,1.0,0.0
2,40,80,7,0.0,1.0,1.0,0.0,0.0
3,50,90,10,1.0,0.0,0.0,0.0,1.0
