## Feature Engineering-3

In [1]:
# Q1: What is data encoding? How is it useful in data science?

# Ans:

# Data encoding is the process of converting data from one format or representation to another.

# Uses in Data Science:
# Machine Learning Algorithms Require Numerical Input: Most machine learning algorithms are designed to work with 
# numerical data. They can't directly process text or categorical variables in their original form.   

# Enables Pattern Recognition: By encoding categorical data into numerical form, we allow the algorithms to identify 
# patterns and relationships within the data, which is crucial for tasks like prediction and classification.   

# Prevents Bias: Proper encoding helps to ensure that all features (variables) are treated equally by the machine 
# learning model, preventing bias that might arise from the way categorical data is represented.



In [2]:
# Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

# Ans:

# Nominal encoding is a type of data encoding used to transform categorical variables where there's no inherent order 
# or ranking among the categories. The categories are distinct, but one isn't "better" or "higher" than another.

# Example: Customer Preferences

# Let's say we're analyzing customer preferences for different types of online content. One of our variables is 
# "Content_Type," which can have the following categories:

# Video
# Article
# Podcast
# Infographic

# Since there's no inherent order (one isn't "better" than the others), nominal encoding is appropriate. 
# Here's how one-hot encoding would work:   

# Create New Columns: We'd create four new columns, one for each category:  "Content_Type_Video", 
# "Content_Type_Article", "Content_Type_Podcast", and "Content_Type_Infographic".   

# Assign Binary Values: For each customer record:

# If a customer prefers "Video", the "Content_Type_Video" column would be 1, and the other three columns would be 0.
# If a customer prefers "Article", the "Content_Type_Article" column would be 1, and the others would be 0.
# And so on...



In [3]:
# Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

# Ans:

# 1: High Cardinality Categories: When the categorical column has a large number of unique values, one hot encoding 
# creates too many new columns, increasing memory usage and reducing model efficiency.
# 2: When Ordinal Relationships Exist: If categories have some meaningful relationship or ranking, nominal encoding 
# (e.g., label encoding or target encoding) helps models learn patterns more effectively.
# Tree-Based Models (Decision Trees, Random Forest, XGBoost, etc.): Many tree-based models can naturally handle 
# label-encoded categorical values without needing one-hot encoding.

# Example: Suppose we are working on a fraud detection system for an e-commerce platform, and one of the categorical 
# features is "Customer Region" with 500 unique values (e.g., "Cuttack", "Bhubaneswar", etc.).

# One-Hot Encoding would create 500 new binary columns, making the dataset sparse and computationally expensive.
# Nominal Encoding (e.g., Target Encoding) can replace each category with the mean fraud probability in that region, 
# reducing dimensionality while preserving valuable information.

In [4]:
# Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use
# to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

# Ans:

# For a dataset containing categorical data with 5 unique values, I would choose One-Hot Encoding (OHE) in most cases.

# Why One-Hot Encoding?

# 1: Small Number of Categories (Low Cardinality)
# Since there are only 5 unique values, one-hot encoding will create only 5 new binary columns, which is manageable and
# does not significantly increase dimensionality.

# 2: Avoids Implicit Ordinality
# If we use Label Encoding (assigning numeric values like {A: 0, B: 1, C: 2, D: 3, E: 4}), machine learning models may 
# mistakenly interpret the values as ordinal, which is incorrect unless there is a natural ranking.

# 3: Works Well with Most ML Models
# One-hot encoding is widely supported and effective for models like logistic regression, neural networks, and k-nearest 
# neighbors, where categorical relationships should not be assumed as numerical.

In [5]:
# Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, 
# and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, 
# how many new columns would be created? Show your calculations.

# Ans

# To determine how many columns will be created, we need to find how many unique values are there in the categorical
# columns. For instance column 1 have n and column 2 have m unique values. So after using nominal encoding the total 
# number of transformed columns would be m+n.


In [6]:
# Q6. You are working with a dataset containing information about different types of animals, including their species, 
# habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for 
# machine learning algorithms? Justify your answer.

# Ans:

# For a dataset containing categorical information about animals (e.g., species, habitat, and diet), the best encoding 
# technique depends on the characteristics of each categorical feature.

# My choice would be for a hybrid approach.

# One-Hot Encoding for column diet because I am assuming it has three values only such as (Herbivore, Carnivore, 
# Omnivore).
# Label Encoding can be used for species, if the dataset contains many unique species names.
# Target Encoding can be used for habitat, if the dataset is large and habitat has many unique categories (e.g., Forest, 
# Desert, Savannah, Ocean, etc.).

In [7]:
# Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a 
# dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which 
# encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step 
# explanation of how you would implement the encoding.

# Ans:

# Step 1: Identify the categorical columns
# Given the features:
# Gender (Male, Female) → Categorical
# Contract Type (Month-to-Month, One-Year, Two-Year) → Categorical
# Age, Monthly Charges, and Tenure → Already numerical (No encoding needed)

# Step 2: Choosing encoding techniques
# Gender (Binary Categorical Feature: 2 unique values). We can use Label Encoding (0/1 Encoding)
# Contract Type (Nominal Feature: 3 unique values). We can use One-Hot Encoding.

# Step 3: Implementation

# Import the packages
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


# Sample dataset
data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Contract': ['Month-to-Month', 'One-Year', 'Two-Year', 'Month-to-Month'],
    'Age': [25, 34, 45, 29],
    'Monthly_Charges': [70.5, 50.0, 90.0, 65.3],
    'Tenure': [5, 24, 36, 12]
})

# Step 1: Encode Gender using Label Encoding
label_encoder = LabelEncoder()
data['Gender'] = label_encoder.fit_transform(data['Gender'])


# Step 2: Encode Contract Type using One-Hot Encoding
data = pd.get_dummies(data, columns=['Contract'], drop_first=True)  # Drop first to avoid multicollinearity

# Display transformed dataset
print(data)

   Gender  Age  Monthly_Charges  Tenure  Contract_One-Year  Contract_Two-Year
0       1   25             70.5       5              False              False
1       0   34             50.0      24               True              False
2       0   45             90.0      36              False               True
3       1   29             65.3      12              False              False
