**Q1. What is data encoding? How is it useful in data science?**

**Answer:**
Data encoding is the process of converting categorical data into a numerical format suitable for machine learning algorithms. In data science, it is essential because many machine learning algorithms require numerical input. Encoding allows us to represent categorical variables in a way that preserves their information while enabling algorithms to process them effectively.

---

**Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.**

**Answer:**
Nominal encoding, also known as label encoding, assigns a unique numerical value to each category in a categorical variable. It is useful when the categories have no inherent order or ranking. For example, in a dataset containing car colors ("red", "blue", "green"), we can use nominal encoding to convert them to numerical values: "red" -> 0, "blue" -> 1, "green" -> 2.

---

**Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.**

**Answer:**
Nominal encoding is preferred over one-hot encoding when the categorical variable has a large number of unique categories or when the categories do not exhibit hierarchical relationships. For example, in a dataset containing country names, nominal encoding would be more suitable as there could be many countries, and they do not have a natural ordering.

---

**Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.**

**Answer:**
For a dataset with 5 unique values, label encoding (nominal encoding) would be suitable. Label encoding assigns a unique numerical value to each category, making it easier for algorithms to process the data. Since there are only 5 unique values, label encoding will not create a significant increase in dimensionality.

---

**Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.**

**Answer:**
For nominal encoding, each categorical column will be replaced by a single numerical column. Therefore, the number of new columns created would be equal to the number of categorical columns. In this case, since there are two categorical columns, 2 new columns would be created.

---

**Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.**

**Answer:**
For the given scenario, where the categories in each categorical variable do not have a natural ordering, nominal encoding (label encoding) would be suitable. Species, habitat, and diet are categorical variables where the categories do not have any inherent order, making nominal encoding an appropriate choice.

---

**Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.**

**Answer:**
For the given scenario:
1. Gender: Since gender typically has only two categories (male, female), we can use nominal encoding to represent them as numerical values (e.g., male -> 0, female -> 1).
2. Contract type: If contract type has multiple categories (e.g., month-to-month, one-year, two-year), we can again use nominal encoding to assign numerical values to each category.
3. Age, Monthly charges, and Tenure: These are numerical features and do not require encoding.


In [2]:
from sklearn.preprocessing import LabelEncoder

# Sample dataset with categorical features
dataset = {
    'gender': ['male', 'female', 'female', 'male', 'male'],
    'contract_type': ['month-to-month', 'one-year', 'month-to-month', 'two-year', 'one-year']
}

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical feature
for feature in ['gender', 'contract_type']:
    dataset[feature] = label_encoder.fit_transform(dataset[feature])

# Display encoded dataset
print(dataset)

{'gender': array([1, 0, 0, 1, 1]), 'contract_type': array([0, 1, 0, 2, 1])}
