Q1. What is data encoding? How is it useful in data science? 

In [1]:
# Data encoding is the process of converting categorical variables into a numerical format so machine learning algorithms can process them. It's essential because most algorithms cannot work with raw categorical data directly.

# Why it's useful:

# ML models like logistic regression, decision trees, or SVM require numerical inputs.

# Helps retain information while making it machine-readable.

Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario."

In [2]:
# Nominal Encoding (also called Label Encoding) assigns a unique integer to each category.

# Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue', 'Red', 'Blue']
})

encoder = LabelEncoder()
data['Encoded_Color'] = encoder.fit_transform(data['Color'])
print(data)


   Color  Encoded_Color
0    Red              2
1  Green              1
2   Blue              0
3    Red              2
4   Blue              0


Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example

In [3]:
# Nominal (Label) encoding is preferred when:

# The number of categories is large.

# The categorical feature is an identifier or has no ordinal relationship but doesn’t require separation like one-hot.

# Practical Example: Encoding country codes ('USA', 'IND', 'UK', 'AUS') for identification purposes where one-hot would introduce sparsity.

Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

In [4]:
# Use One-Hot Encoding if the number of unique values is small and the categories don't have inherent order.
data = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Kolkata', 'Chennai', 'Mumbai']
})

encoded = pd.get_dummies(data, columns=['City'])
print(encoded)


   City_Chennai  City_Delhi  City_Kolkata  City_Mumbai
0         False        True         False        False
1         False       False         False         True
2         False       False          True        False
3          True       False         False        False
4         False       False         False         True


Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.

In [6]:
# Assume:
# Categorical Column A has 4 unique values.
# Column B has 3 unique values.
# Using nominal encoding, each is encoded into one column (not expanded like one-hot).
# So, number of new columns created = 2

# Simulated example
import numpy as np

df = pd.DataFrame({
    'Category1': np.random.choice(['A', 'B', 'C', 'D'], size=1000),
    'Category2': np.random.choice(['X', 'Y', 'Z'], size=1000),
    'Num1': np.random.randn(1000),
    'Num2': np.random.randn(1000),
    'Num3': np.random.randn(1000),
})

# Apply label encoding
le1 = LabelEncoder()
le2 = LabelEncoder()
df['Cat1_encoded'] = le1.fit_transform(df['Category1'])
df['Cat2_encoded'] = le2.fit_transform(df['Category2'])
print(df[['Cat1_encoded', 'Cat2_encoded']].head())


   Cat1_encoded  Cat2_encoded
0             2             2
1             2             0
2             2             1
3             2             2
4             2             0


Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer

In [7]:
#Use One-Hot Encoding since these are nominal categorical variables with no ordinal relationship.
df = pd.DataFrame({
    'Species': ['Dog', 'Cat', 'Bird', 'Dog', 'Cat'],
    'Habitat': ['Urban', 'Urban', 'Forest', 'Rural', 'Urban'],
    'Diet': ['Carnivore', 'Carnivore', 'Omnivore', 'Omnivore', 'Herbivore']
})

encoded_df = pd.get_dummies(df)
print(encoded_df)


   Species_Bird  Species_Cat  Species_Dog  Habitat_Forest  Habitat_Rural  \
0         False        False         True           False          False   
1         False         True        False           False          False   
2          True        False        False            True          False   
3         False        False         True           False           True   
4         False         True        False           False          False   

   Habitat_Urban  Diet_Carnivore  Diet_Herbivore  Diet_Omnivore  
0           True            True           False          False  
1           True            True           False          False  
2          False           False           False           True  
3          False           False           False           True  
4           True           False            True          False  


Q7.You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding

In [10]:
# Features: gender (cat), age (num), contract type (cat), monthly charges (num), tenure (num)

# Step-by-step:

# 1.Apply label encoding to gender (binary).
# 2.Apply one-hot encoding to contract type (multi-class).
# 3.Leave numerical columns as-is.

df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male'],
    'Age': [30, 45, 22, 35, 60],
    'Contract': ['Month-to-month', 'One year', 'Month-to-month', 'Two year', 'One year'],
    'MonthlyCharges': [70, 60, 75, 80, 55],
    'Tenure': [12, 24, 5, 36, 20]
})

# Encode Gender
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])

# One-Hot Encode Contract
df = pd.get_dummies(df, columns=['Contract'])

print("Encoded Dataset for ML:")
print(df)


Encoded Dataset for ML:
   Gender  Age  MonthlyCharges  Tenure  Contract_Month-to-month  \
0       1   30              70      12                     True   
1       0   45              60      24                    False   
2       0   22              75       5                     True   
3       1   35              80      36                    False   
4       1   60              55      20                    False   

   Contract_One year  Contract_Two year  
0              False              False  
1               True              False  
2              False              False  
3              False               True  
4               True              False  
