**Q1. What is data encoding? How is it useful in data science?**


Data encoding is the process of converting categorical data into a numerical format that can be used by machine learning algorithms. It is useful in data science because most algorithms require numerical input, and encoding allows for the representation of categorical variables in a way that preserves their information while making them interpretable by models.



**Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.**


Nominal encoding is a method of converting categorical variables without any intrinsic order into numerical values. For example, if you have a dataset containing the colors of cars (e.g., red, blue, green), you could assign each color a unique integer: red = 1, blue = 2, green = 3. This encoding allows machine learning models to process the categorical data effectively.



**Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.**

Nominal encoding is preferred over one-hot encoding when dealing with high cardinality categorical variables, as one-hot encoding can lead to a significant increase in dimensionality. For instance, if you have a dataset with a "Country" feature containing 100 unique countries, one-hot encoding would create 100 new binary features, which can lead to sparse data and increased computational cost. Nominal encoding, on the other hand, would simply convert the countries into a single numerical feature.



**Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.**

To transform categorical data with 5 unique values using nominal encoding, I would assign each unique value a distinct integer. For example, if the values are {A, B, C, D, E}, I could encode them as follows: A = 0, B = 1, C = 2, D = 3, E = 4. This method is chosen because it simplifies the data while maintaining the categorical nature, making it easier for algorithms to interpret the information without creating additional features, as would be the case with one-hot encoding.



**Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.**

If there are two categorical columns, and assuming each column has a unique number of categories (let's say the first has 3 categories and the second has 4 categories), nominal encoding would convert each categorical column into a single numerical column. Therefore, no new columns would be created; instead, the two categorical columns would be replaced by two numerical columns. The total number of columns would remain 5.



In [1]:
import pandas as np
import pandas as pd

# Sample data
data = pd.DataFrame({
    'A': ['Red', 'Blue', 'Green', 'Yellow', 'Red', 'Green', 'Blue', 'Yellow'],
    'B': ['Small', 'Medium', 'Large', 'Small', 'Medium', 'Large', 'Small', 'Medium'],
    'C': [1.2, 2.3, 3.1, 4.0, 5.5, 6.7, 7.8, 8.9],
    'D': [10, 20, 30, 40, 50, 60, 70, 80],
    'E': [100, 200, 300, 400, 500, 600, 700, 800]
})

# One-hot encode categorical columns
encoded_data = pd.get_dummies(data, columns=['A', 'B'])

# Show the encoded data
print(encoded_data)


     C   D    E  A_Blue  A_Green  A_Red  A_Yellow  B_Large  B_Medium  B_Small
0  1.2  10  100       0        0      1         0        0         0        1
1  2.3  20  200       1        0      0         0        0         1        0
2  3.1  30  300       0        1      0         0        1         0        0
3  4.0  40  400       0        0      0         1        0         0        1
4  5.5  50  500       0        0      1         0        0         1        0
5  6.7  60  600       0        1      0         0        1         0        0
6  7.8  70  700       1        0      0         0        0         0        1
7  8.9  80  800       0        0      0         1        0         1        0


**Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.**

In [2]:
import pandas as pd

data = pd.DataFrame({
    'species': ['mammal', 'bird', 'reptile', 'fish'],
    'habitat': ['forest', 'desert', 'ocean', 'freshwater'],
    'diet': ['herbivore', 'carnivore', 'omnivore', 'herbivore']
})

# Apply one-hot encoding
one_hot_encoded_data = pd.get_dummies(data)
print(one_hot_encoded_data)


   species_bird  species_fish  species_mammal  species_reptile  \
0             0             0               1                0   
1             1             0               0                0   
2             0             0               0                1   
3             0             1               0                0   

   habitat_desert  habitat_forest  habitat_freshwater  habitat_ocean  \
0               0               1                   0              0   
1               1               0                   0              0   
2               0               0                   0              1   
3               0               0                   1              0   

   diet_carnivore  diet_herbivore  diet_omnivore  
0               0               1              0  
1               1               0              0  
2               0               0              1  
3               0               1              0  


Q7. You are working on a project that involves predicting customer churn for a telecommunications company. The dataset contains 5 features, including the customer's gender, age, contract type, payment method, and tenure. Which encoding technique(s) would you use to transform the categorical data into a suitable format for machine learning algorithms? Provide a step-by-step explanation of how you would implement the encoding.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = pd.DataFrame({
    'gender': ['Male', 'Female', 'Female', 'Male'],
    'age': [25, 45, 35, 50],
    'contract_type': ['Month-to-Month', 'One Year', 'Two Year', 'Month-to-Month'],
    'payment_method': ['Electronic Check', 'Mailed Check', 'Bank Transfer', 'Credit Card'],
    'tenure': [12, 24, 36, 48]
})

# Step 1: Label Encoding for 'gender'
label_encoder = LabelEncoder()
data['gender'] = label_encoder.fit_transform(data['gender'])

# Step 2: One-Hot Encoding for 'contract_type' and 'payment_method'
data = pd.get_dummies(data, columns=['contract_type', 'payment_method'])

print(data)


   gender  age  tenure  contract_type_Month-to-Month  contract_type_One Year  \
0       1   25      12                             1                       0   
1       0   45      24                             0                       1   
2       0   35      36                             0                       0   
3       1   50      48                             1                       0   

   contract_type_Two Year  payment_method_Bank Transfer  \
0                       0                             0   
1                       0                             0   
2                       1                             1   
3                       0                             0   

   payment_method_Credit Card  payment_method_Electronic Check  \
0                           0                                1   
1                           0                                0   
2                           0                                0   
3                           1          