<div class="alert alert-block alert-info" align="center" style="padding: 10px;">
    <h1><b><u>Feature Engineering-4</u></b></h1>
</div>

**Q1. What is data encoding? How is it useful in data science?**

Data encoding refers to the process of converting categorical data into a numerical format. This conversion is essential because many machine learning algorithms require numerical input.

Data encoding is useful in data science for the following reasons:
1. **Algorithm Compatibility:** Many machine learning algorithms, such as regression and neural networks, require numerical data as input. Data encoding enables the use of these algorithms on categorical data.
2. **Feature Engineering:** Data encoding is a crucial step in feature engineering, where you transform and preprocess data to create meaningful features for machine learning models.
3. **Improved Model Performance:** Properly encoded categorical data can lead to improved model performance, as it allows algorithms to learn from categorical variables effectively.

---

**Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.**

Nominal encoding, also known as label encoding, assigns a unique integer to each category in a categorical feature. This encoding technique is suitable for nominal categorical data, where there is no inherent order or ranking among the categories.

```python
import pandas as pd

data = {'fruit': ['apple', 'banana', 'apple', 'orange', 'banana']}
df = pd.DataFrame(data)

# Label encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['fruit_encoded'] = label_encoder.fit_transform(df['fruit'])
```

---
**Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.**

Nominal encoding (label encoding) is preferred over one-hot encoding in situations where there is no inherent order
or ranking among the categories in a categorical feature. It is suitable for nominal categorical data.

Example:-
```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {'day_of_week': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']}
df = pd.DataFrame(data)

# label encoding
label_encoder = LabelEncoder()
df['day_encoded'] = label_encoder.fit_transform(df['day_of_week'])
```
        
---
**Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.**

If you have a category that can take on 5 different values, you can use label encoding to make it work for machine learning. Label encoding gives each value a unique number, which is handy for small categories. It is a simpler and more efficient choice than one-hot encoding, which can make lots of new columns when you have many categories.

---

**Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.**

When using nominal encoding (label encoding) for categorical data, each unique category is replaced by a unique integer. Therefore, for each categorical column, one new column is created to store these encoded values.

In our scenario, we have 2 categorical columns, so using nominal encoding will create 2 new columns to store the encoded values.

---

**Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.**

The choice of encoding technique depends on the nature of the categorical data:
- For "**species**," if "species" is a nominal categorical feature with distinct categories that have no natural order or ranking, you can use nominal encoding to assign a unique integer to each species.
- For "**habitat** ," if "habitat" represents different types of environments where animals live and there is no inherent order among them, nominal encoding is appropriate.
- For "**diet** ," if "diet" represents different types of diets and there is no inherent order, you can use nominal encoding.

In this case, all three categorical features appear to be nominal, so nominal encoding is justified for all of them.

---
**Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.**

In this scenario, we have a mix of categorical and numerical features:

- **Gender** -> Categorical
- **Contract Type** -> Categorical
- **Age**, **Monthly Charges**, and **Tenure** -> Numerical

Since "gender" is a binary categorical feature, "Contract type" is likely to have multiple categories, and Age, Monthly Charges, and Tenure features are already numerical, so no additional encoding is needed.

Here is a step-by-step implementation of the encoding using Python and pandas:

In [1]:
import pandas as pd

# Sample dataset
data = {
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'contract_type': ['month-to-month', 'one year', 'two years', 'one year', 'month-to-month'],
    'age': [30, 25, 35, 40, 28],
    'monthly_charges': [50.0, 60.0, 70.0, 55.0, 75.0],
    'tenure': [12, 24, 36, 48, 60]
}

df = pd.DataFrame(data)

# Binary encoding for 'gender'
df['is_male'] = df['gender'].map({'Male': 1, 'Female': 0})

# One-hot encoding for 'contract_type'
df = pd.get_dummies(df, columns=['contract_type'], prefix='is')

# Drop the original 'gender' and 'contract_type' columns if needed
df = df.drop(['gender'], axis=1)

# Now, you have a dataset with all numerical features
print(df)


   age  monthly_charges  tenure  is_male  is_month-to-month  is_one year  \
0   30             50.0      12        1                  1            0   
1   25             60.0      24        0                  0            1   
2   35             70.0      36        1                  0            0   
3   40             55.0      48        0                  0            1   
4   28             75.0      60        1                  1            0   

   is_two years  
0             0  
1             0  
2             1  
3             0  
4             0  
