## Q1. What is data encoding? How is it useful in data science?

*Data encoding* is the process of converting categorical data into a numerical format so that it can be processed by machine learning algorithms. Since most machine learning models require numerical input, encoding is crucial for converting non-numeric data into a suitable format. Encoding helps in simplifying the data structure, making it easier to model and analyze patterns within the data.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

*Nominal encoding*, also known as label encoding, is a method where each unique category value is assigned an integer value. This type of encoding is used when the categorical data is not ordinal, meaning there is no inherent order among the categories.

*Example:*
In a dataset containing information about fruits with a "Fruit Type" column:
- Apple → 1
- Orange → 2
- Banana → 3
- Grape → 4

This encoding could be used in a scenario where you are predicting the price of fruit based on its type.

## Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding is preferred over one-hot encoding when the categorical data has a large number of unique values and the machine learning algorithm can handle ordinal data, or when the algorithm can interpret the encoded values meaningfully without implying an ordinal relationship.

*Example:*
In a dataset with the "Country" column having many unique country names, nominal encoding is more efficient in terms of memory usage and computational complexity compared to one-hot encoding, which would create a sparse matrix with many columns.


## Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

If the categorical data has no inherent order and the number of unique values is small (5 in this case), *one-hot encoding* is generally preferred. This is because it avoids any ordinal relationship between the categories that nominal encoding might imply.

For example, for the categories "Red", "Blue", "Green", "Yellow", and "Black":
- Red → [1, 0, 0, 0, 0]
- Blue → [0, 1, 0, 0, 0]
- Green → [0, 0, 1, 0, 0]
- Yellow → [0, 0, 0, 1, 0]
- Black → [0, 0, 0, 0, 1]

## Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

If we use nominal encoding for two categorical columns, each unique value in these columns will be represented by a unique integer. Assuming the two columns have \( V_1 \) and \( V_2 \) unique values respectively, we still have 2 columns after encoding.

## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

For a dataset with categories like species, habitat, and diet, I would use *one-hot encoding*. This is because these categories are nominal (no inherent order) and one-hot encoding prevents any ordinal relationship implication that nominal encoding might introduce.

In [None]:
## Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

# To predict customer churn with features like gender, age, contract type, monthly charges, and tenure, the categorical features need to be encoded.

# 1. Identify categorical features: "Gender" and "Contract Type" are categorical.

# 2. Gender:
#   - Typically has two categories: "Male" and "Female".
#   - Use one-hot encoding or binary encoding (0 for Male, 1 for Female).

# 3. Contract Type:
#   - Suppose it has three categories: "Month-to-Month", "One Year", "Two Year".
#   - Use one-hot encoding to avoid any ordinal relationships.
#     - Month-to-Month → [1, 0, 0]
#     - One Year → [0, 1, 0]
#     - Two Year → [0, 0, 1]

# 4. Numerical features: "Age", "Monthly Charges", and "Tenure" can be used directly.

# Step-by-Step Implementation:
# 1. Import necessary libraries (e.g., pandas, sklearn).
# 2. Load the dataset.
# 3. Apply one-hot encoding to "Gender" and "Contract Type" using pandas' get_dummies() or sklearn's OneHotEncoder.
# 4. Concatenate the encoded columns back to the original dataset.
# 5. Ensure that the final dataset is numerical and suitable for machine learning algorithms.


import pandas as pd

# Assuming df is the DataFrame
df = pd.read_csv('customer_churn.csv')

# One-hot encode categorical columns
df_encoded = pd.get_dummies(df, columns=['Gender', 'Contract Type'], drop_first=True)

# Now df_encoded is ready for machine learning algorithms


# This ensures all categorical data is transformed into a numerical format without implying any unintended order.